cs.CL @ 2025-06-27: 660
-
00 06-26 (4) HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation HalluSegBench: Counterfactual Visual Reasoning for Segmentation Halluzination Evaluation HalluSegeBench:截肢幻觉评价的反事实视觉理由 2506.21546v1 -
01 06-26 Data Efficacy for Language Model Training Dateneffizienz für Sprachmodellschulungen 语文示范培训的数据效率 2506.21545v1 -
02 06-26 “What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets “Was ist los, Doc?”: Analysieren, wie Nutzer Gesundheitsinformationen in groß angelegten KI-Datensätzen suchen “怎么了,医生?” :分析用户如何在大型对话的AI数据集中寻求健康信息。 2506.21532v1 -
03 06-26 OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages OpenNER 1.0: Standardisierte Open-Access-Datensätze für die Entity-Erkennung in 50+ Sprachen OpenNER 1.0:标准化的开放获取实体识别数据集,50+语言 2412.09587v2 -
04 06-26 Potemkin Understanding in Large Language Models Potemkin Verständnis in großen Sprachmodellen 大语言模型中的波坦金理解 2506.21521v1 -
05 06-26 skLEP: A Slovak General Language Understanding Benchmark sklep: Ein slowakisches allgemeines Sprachverständnis Benchmark SkLEP:斯洛伐克一般语言理解基准 2506.21508v1 -
06 06-26 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten Mind2Web 2: 与代理法官评估代理搜索 2506.21506v1 -
07 06-26 Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments Verbesserung des Nutzerengagements im sozial-gesteuerten Dialog durch interaktive LLM-Alignments 通过互动LLM调整,加强用户参与社会驱动对话 2506.21497v1 -
08 06-26 Bridging Offline and Online Reinforcement Learning for LLMs Überbrückung Offline- und Online-Verstärkungslernen für LLMs 为LLMMs搭桥离线和在线加强学习 2506.21495v1 -
09 06-26 Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages Mit Phonemes: Mehrsprachigkeit von LLMs für nicht-lateinische Script-Sprachen verbessern 以电话提示:提高LLMS的非拉丁文拼写语言多重语言质量 2411.02398v3 -
10 06-26 From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents 从网络搜索到代理深层研究:激励使用理性代理进行搜索 2506.18959v2 -
11 06-26 Logios : An open source Greek Polytonic Optical Character Recognition system Logios : Ein offenes griechisches Polytonisches optisches Zeichenerkennungssystem Logios: 开放源码希腊多元光学特征识别系统 2506.21474v1 -
12 06-26 TopK Language Models TopK-Sprachenmodelle 顶 K 语言模式 2506.21468v1 -
13 06-26 Aligning Spoken Dialogue Models from User Interactions Ausrichten von gesprochenen Dialogmodellen aus Benutzerinteraktionen 校对用户互动中的口语对话框模型 2506.21463v1 -
14 06-26 Spatial Mental Modeling from Limited Views Räumliche mentale Modellierung aus begrenzten Ansichten 根据有限观点进行空间精神建模 2506.21458v1 -
15 06-26 Text2Cypher Across Languages: Evaluating Foundational Models Beyond English Text2Cypher Across Sprachen: Bewertung von Grundmodellen jenseits des Englischen 跨语言文本:评价超越英语的基础模型 2506.21445v1 -
16 06-26 Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection Domänenwissen-verbesserte LLMs für Betrug und Konzept-Drift-Erkennung 防止欺诈和概念漂流探测的有知识增强的有限LMs 2506.21443v1 -
17 06-26 Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v3 -
18 06-26 Rethinking LLM Training through Information Geometry and Quantum Metrics Rethinking LLM Training durch Informationsgeometrie und Quantenmetrics 通过信息几何和量度测量重新思考LLM培训 2506.15830v2 -
19 06-26 Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference Skalierbare Bayesische Low-Rank-Anpassung von großen Sprachmodellen über stochastische Variations-Subraum-Inferenz 通过Stochastic变异性子空间推断,对大语言模型进行可缩放的Bayesian低Rank 2506.21408v1 -
20 06-26 DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation DiffuCoder: Maskierte Difffusionsmodelle für die Codegenerierung verstehen und verbessern DiffuCoder:理解和改进代代码生成的蒙面传播模式 2506.20639v2 -
21 06-26 Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings Hybrides Deep Learning und Signalverarbeitung für die arabische Dialekterkennung in Low-Resource-Einstellungen 低资源设置中阿拉伯语语音识别的混合深深学习和信号处理 2506.21386v1 -
22 06-26 Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation 利用LLM协助的对活检索一代人查询了解 2506.21384v1 -
23 06-26 Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models AI 文学批评主义的结构性方法:大语言模型利用Greimas半语言广场 2506.21360v1 -
24 06-26 Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts Latent Prototype Routing: Erzielen einer nahezu perfekten Lastabgleichung in Mixture-of-Experts 原型原型路由:在混合专家中实现近效果负载平衡 2506.21328v1 -
25 06-26 Exploring Adapter Design Tradeoffs for Low Resource Music Generation Erforschung von Adapter-Design-Tradeoffs für Low Resource Music Generation 探索用于低资源音乐制作的适应设计取舍 2506.21298v1 -
26 06-26 Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models Erkennung von Verweisen auf Ausdrücke im visuell begründeten Dialog mit autoregressiven Sprachmodellen 与自动递减语言模型进行视觉基础对话中检测引用表达式 2506.21294v1 -
27 06-26 Small Encoders Can Rival Large Decoders in Detecting Groundedness Kleine Encoder können große Decoder bei der Erkennung von Erdlichkeit rivalisieren 在地面探测中能够使大型分离器在探测地面时发生迭接 2506.21288v1 -
28 06-26 Thinkless: LLM Learns When to Think Denklos: LLM lernt, wann man denkt 无思想:LLM学习思考时间 2505.13379v2 -
29 06-26 Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung 双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由 2506.21285v1 -
30 06-26 HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context HumanOmniV2: Vom Verständnis zur Omni-Modalen Vernunft mit Kontext HumanOmniV2:从理解到以上下文为根据的全方位模式 2506.21277v1 -
31 06-26 Cat and Mouse – Can Fake Text Generation Outpace Detector Systems? Katze und Maus – Kann die Textgenerierung ausfallende Detektorsysteme fälschen? 猫和老鼠 – – 假文本生成能否超越检测器系统? 2506.21274v1 -
32 06-26 A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns Ein Troublemaker mit ansteckenden Jailbreak macht Chaos in ehrlichen Städten 一个麻烦制造者 与贪婪的监狱破碎 制造混乱 在诚实的城镇 2410.16155v2 -
33 06-26 DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster DiLoCoX: Ein kommunikationsarmer groß angelegter Ausbildungsrahmen für dezentralisierte Cluster DILOCOX:权力下放小组的低通信大范围培训框架 2506.21263v1 -
34 06-26 Simulating Hard Attention Using Soft Attention Simulation der harten Aufmerksamkeit mit weicher Aufmerksamkeit 使用软关注模拟硬关注 2412.09925v2 -
35 06-26 Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents Agent-RewardBench: Auf dem Weg zu einem einheitlichen Benchmark für Prämienmodellierung über Wahrnehmung, Planung und Sicherheit in multimodalen Real-World-Agenten Agent-RewardBench:建立一个统一基准,用于在现实世界多式联运代理中建立跨认知、规划和安全概念、规划与安全的奖励模型 2506.21252v1 -
36 06-26 Capturing Style in Author and Document Representation Stil in der Autor- und Dokumentdarstellung erfassen 在作者和文件代表中获取样式 2407.13358v2 -
37 06-26 Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval Automatische Termextraktion mit großen Sprachmodellen durch syntactic Retrieval verbessern 通过同步检索增强使用大语言模型的自动定期抽取功能 2506.21222v1 -
38 06-26 Complexity-aware fine-tuning Komplexitätsbewusste Feinabstimmung 复杂度认知微调 2506.21220v1 -
39 06-26 Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? Kausale Vernunft in großen Sprachmodellen enthüllen: Realität oder Mirage? 大语言模型中未解的因果理由:现实还是幻影? 2506.21215v1 -
40 06-26 TAPS: Tool-Augmented Personalisation via Structured Tagging TAPS: Tool-Augmented Personalisierung durch strukturiertes Tagging TAPS: 通过结构拖网提高工具的个性化 2506.20409v2 -
41 06-26 LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey LLM-basierte human-agente Kooperations- und Interaktionssysteme: Eine Umfrage 以LLM为基础的人类-机构协作和互动系统:调查 2505.00753v4 -
42 06-26 Prompt-Guided Turn-Taking Prediction Prompt-geführte Turn-Taking-Vorhersage 即时指导的回转预测 2506.21191v1 -
43 06-26 Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks 维持MDEB:实现长期使用和可复制嵌入基准 2506.21182v1 -
44 06-26 Compressed and Smooth Latent Space for Text Diffusion Modeling Komprimierter und glatter Latent-Raum für Text-Diffusionsmodellierung 压缩和平滑的文本传播中缓流空间模型模型 2506.21170v1 -
45 06-26 CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models CVC: Eine groß angelegte chinesische Wertregel Corpus zur Wertausrichtung großer Sprachmodelle CVC: 大型中文大语言模式价值调整大型中国价值规则公司 2506.01495v4 -
46 06-26 Do Large Language Models Advocate for Inferentialism? Befürworten große Sprachmodelle den Inferentialismus? 大语言模型是否为推定主义辩护? 2412.14501v2 -
47 06-26 Learning Evaluation Models from Large Language Models for Sequence Generation Learning Evaluation Models aus großen Sprachmodellen für die Sequenzgenerierung 序列生成大语言模式学习评价模式 2308.04386v3 -
48 06-26 Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models Progtuning: Progressives Fine-Tuning-Framework für transformerbasierte Sprachmodelle 改进:基于变换器的语文模式逐步微调框架 2506.21119v1 -
49 06-26 Learning to Skip the Middle Layers of Transformers Lernen, die mittleren Schichten der Transformer zu überspringen 学习跳过变换器的中层 2506.21103v1 -
50 06-26 ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry ComRAG: Retrieval-Augmented Generation mit dynamischen Vector Stores für Echtzeit-Community-Frageantworten in der Industrie ComRAG: 利用动态矢量储存库实时社区工业问题回答实时社区问题的回收-原始一代 2506.21098v1 -
51 06-26 HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics HERMES: zeitlich-zusammenhängendes lang-für-M Verständnis mit Episoden und Semantik HERMES: 与分数和语义学的理解 2408.17443v4 -
52 06-26 DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning DALR: Dual Level Alignment Learning für multimodales Sentence Representative Learning DALR: 双级统一学习促进多式判决代表制学习 2506.21096v1 -
53 06-26 Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph Verbesserung der LLM-Tool-Nutzung mit hochwertigen Instruktionsdaten aus Wissensgrafik 利用来自知识图的高质量教学数据加强LLM工具的使用 2506.21071v1 -
54 06-26 MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法 2506.21053v1 -
55 06-26 Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach Bewertung der Diagnoseleistung bei seltenen Krankheiten bei Symptomen: Ein synthetischer Vignette-Simulationsansatz 评价症状检查器中的罕见疾病诊断性能: 合成Vignette模拟方法 2506.19750v3 -
56 06-26 Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs Suche und Verfeinerung während des Denkens: Autonome Retrieval-Augmented Reasoning von LLMs 思考期间的搜索和记忆:自主检索-强化理据(LLM) 2505.11277v3 -
57 06-26 A Semi-supervised Scalable Unified Framework for E-commerce Query Classification Ein halbüberwachtes skalierbares Unified Framework für die E-Commerce Query Classification 半监督的电子商务查询分类可扩展统一框架 2506.21049v1 -
58 06-26 MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting MockLLM: Ein Multi-Agent-Behavior-Kooperationsrahmen für Online-Jobsuche und Recruiting MockLLLM:网上求职和招聘多代理行为协作框架 2405.18113v2 -
59 06-26 SceneGenAgent: Precise Industrial Scene Generation with Coding Agent SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent SceneGenerAgenti: 精密工业场景与编码剂生成 2410.21909v3 -
60 06-26 Large Language Models Acing Chartered Accountancy Große Sprachmodelle Aking Chartered Accountancy 特许会计会计 2506.21031v1 -
61 06-26 SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control SAC: Ein Rahmen für die Messung und Induktion von Persönlichkeitseigenschaften in LLMs mit dynamischer Intensitätskontrolle SAC: 具有动态强度控制的LMLM中测量和诱导个性轨迹的框架 2506.20993v1 -
62 06-26 SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes SharpZO: Hybrid Sharpness-Aware Vision Sprachmodell Prompt Tuning via Forward-Only Passes SharpZO: 混合尖锐-敏锐视觉语言模型,通过前向-单行道快速调试 2506.20990v1 -
63 06-26 SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization SACL: Verständnis und Bekämpfung von Textbias im Code Retrieval mit semantisch-angereicherter Reranking und Lokalisierung SACL: 理解和打击《规则》中与语义-增强的重新排级和本地化相结合的 “ 检索法 “ 中的 “ 理解和打击 “ 理论上的 “ 种族 “ 行为 2506.20081v2 -
64 06-26 Can Gradient Descent Simulate Prompting? Kann Gradient Descent Simulate Prompting? 梯子源模拟能刺激吗? 2506.20989v1 -
65 06-26 Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models Vergleich von Retrieval-Augmentation und Parameter-Effizient Fine-Tuning für Datenschutz-Erhaltung Personalisierung von großen Sprachmodellen 比较大语言模型的检索增强和参数有效微量微量美化,促进保护隐私和保持个人特征化 2409.09510v2 -
66 06-26 Reward-Guided Speculative Decoding for Efficient LLM Reasoning Belohnungsgeführte spekulative Dekodierung für effiziente LLM-Reasoning 高效 LLM 理由说明的受奖励指导的投机性说明 2501.19324v3 -
67 06-26 Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization Ranken lernen für mehrere Retrieval-Augmented Modelle durch iterative Utility Maximierung 通过迭代功用最大化学习多重检索增强型号排名 2410.09942v2 -
68 06-26 Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation Jenseits der reaktiven Sicherheit: Risiko-Bewusst LLM-Ausrichtung über Long-Horizon Simulation 超越反应安全性:通过长休松模拟使风险-警用LLM对齐 2506.20949v1 -
69 06-26 Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters Bewertung großer Sprachmodelle für automatisierte klinische Abstraktion in pulmonalen Embolism Registries: Performance Across Modellgrößen, Versionen und Parameter 评价肺部新陈代谢登记簿自动临床抽象化的大型语言模型:不同模型大小、版本和参数的性能 2503.21004v2 -
70 06-26 PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks PP-DocBee: Multimodales Dokumentenverständnis durch Tricks verbessern PP-Docbee:通过一袋小把戏改进多式文件理解 2503.04065v3 -
71 06-26 KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model KaLM-Embedding-V2: Überlegene Trainingstechniken und Daten inspirieren ein vielseitiges Einbettungsmodell KaLM-Embedding-V2:高级培训技术和数据预报 2506.20923v1 -
72 06-26 FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language FineWeb2: Eine Pipeline, um sie alle zu skalieren – Anpassung der Vorschulungsdatenverarbeitung an jede Sprache FineWeb2: 将全部标准缩放的一条管道 – – 将培训前数据处理适应于每种语言 2506.20920v1 -
73 06-26 Optimising Language Models for Downstream Tasks: A Post-Training Perspective Sprachmodelle für Downstream-Aufgaben optimieren: Eine Perspektive nach dem Training 优化下游任务的语言模式:培训后展望 2506.20917v1 -
74 06-25 (3) Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues Erforschung von fünf großen Persönlichkeits- und KI-Kapazitätseffekten in LLM-simulierten Verhandlungsdialogen 探讨LLM模拟谈判对话中的五大个性和AI 2506.15928v2 -
75 06-25 GroundCap: A Visually Grounded Image Captioning Dataset GroundCap: Ein visuell geerdeter Bildbeschriftungs-Datensatz GroundCap:视觉定位图像说明数据集 2502.13898v3 -
76 06-25 A3 : an Analytical Low-Rank Approximation Framework for Attention A3: ein analytischer Rahmen für die Annäherung an den Low-Rank-Wert A3: 分析性低Rank接近度关注框架 2505.12942v3 -
77 06-25 Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine Entscheiden Sie weniger, kommunizieren Sie mehr: Auf dem Konstrukt Gültigkeit der End-to-End-Fact-Checking in der Medizin 决定少决定少决定少决定,交流多交流: 2506.20876v1 -
78 06-25 Leaner Training, Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA Leaner Training, Lower Leakage: Die Erinnerung an LLM Fine-Tuning mit LoRA 皮皮培训,《下下渗漏:重新研究LLM与LORA的精细调整的记忆 2506.20856v1 -
79 06-25 Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training Datenschutz Ripple Effekte von Hinzufügen oder Entfernen von persönlichen Informationen in Sprachmodell-Training 语言模式培训中增加或删除个人信息对隐私的波纹效应 2502.15680v2 -
80 06-25 Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes Enthüllen versteckter gewalttätiger Tendenzen in LLMs: Eine demografische Analyse über Verhaltensvignetten 《LLMs中隐蔽的隐藏暴力时期:通过行为征兆进行的人口分析》 2506.20822v1 -
81 06-25 MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering MultiFinRAG: Ein optimiertes multimodales Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering 多金融inRAG: 最佳多式联运回收-提款一代(RAG)金融问题回答框架 2506.20821v1 -
82 06-25 The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas Die Ideation-Execution-Gap: Ergebnisse der LLM-generierten gegen menschliche Forschungsideen 观察与执行差距:LLM-Genered与人类研究概念的执行结果 2506.20803v1 -
83 06-25 Multi-lingual Functional Evaluation for Large Language Models Multilinguale Funktionsbewertung für große Sprachmodelle 大语言模式多语文职能评价 2506.20793v1 -
84 06-25 CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement CodeLutra: Steigerung der LLM-Code-Generierung durch Präferenz-geführte Verfeinerung 代码Lutra:通过优先指导改进促进LLM代码生成 2411.05199v3 -
85 06-25 Towards Probabilistic Question Answering Over Tabular Data Auf dem Weg zu einer probabilistischen Fragestellung über tabellarische Daten 走向在表格数据上回答概率问题 2506.20747v1 -
86 06-25 MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation MAPIE: Ein Datensatz für die multi-Agenten-Kontextbewertung von PrIvacy MAGPIE: 多动环境普尔瓦茨评价数据集 2506.20737v1 -
87 06-25 MMSearch-R1: Incentivizing LMMs to Search MMSearch-R1: LMMs zur Suche anregen MMSearch- R1: 激励使用 LMM 搜索 2506.20670v1 -
88 06-25 Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs Im Inneren sind Sie viele Wölfe: Mit kognitiven Modellen, um Wert-Abwägungen in LLMs zu interpretieren 使用认知模型来解释LLMM中的价值权衡 2506.20666v1 -
89 06-25 The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind Der Decrypto-Benchmark für multi-agente Vernunft und Theorie des Geistes 多种代理理由和思想理论的Decrypto Decrypto基准 2506.20664v1 -
90 06-25 OmniGen2: Exploration to Advanced Multimodal Generation OmniGen2: Exploration zur fortgeschrittenen multimodalen Generation OmniGen 2:探索先进的多式联运 2506.18871v2 -
91 06-25 Memento: Note-Taking for Your Future Self Memento: Notizen für Ihr zukünftiges Selbst 纪念:为未来的自我记记记 2506.20642v1 -
92 06-25 PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models PLoP: Präzise LoRA-Platzierung für effiziente Feinsteuerung großer Modelle PLP: 高效微调大型模型的精确LORA定位 2506.20629v1 -
93 06-25 Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models Recycling the Web: Eine Methode zur Verbesserung der Vorschulung von Daten Qualität und Menge für Sprachmodelle 网上再循环:提高语文模式培训前数据质量和数量的方法 2506.04689v2 -
94 06-25 Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm Modellbearbeitung als zweischneidiges Schwert: Lenker Ethisches Verhalten gegenüber Wohltätigkeit oder Schaden 模版编辑为双字箭:指导剂道德行为促进福利或伤害 2506.20606v1 -
95 06-25 Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models Ad-hoc-Konzept Formung in den Game-Codenamen als Mittel zur Bewertung großer Sprachmodelle 形成游戏代号作为评价大语言模式的一种手段的游戏代码概念 2502.11707v2 -
96 06-25 FluoroSAM: A Language-promptable Foundation Model for Flexible X-ray Image Segmentation FluoroSAM: Ein sprachförderndes Foundation-Modell für flexible Röntgenbild-Segmentierung FluororosAM:灵活X射线图像分割语言快速基础模型 2403.08059v3 -
97 06-25 On the Role of Context in Reading Time Prediction Zur Rolle des Kontexts bei der Lesezeitvorhersage 关于在阅读时间预测方面背景作用的 2409.08160v4 -
98 06-25 Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling Entsperren des In-Context-Lernens für natürliche Datensätze jenseits der Sprachmodellierung 解锁超出语言建模之外的自然数据集的文中学习 2501.06256v2 -
99 06-25 When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs Wenn Leben gibt Ihnen Proben: Die Vorteile der Skalierung Inferenz berechnen für mehrsprachige LLMs 当生命赋予你样本时:扩大多语种LMM的推论计算的益处 2506.20544v1 -
100 06-25 Attention with Trained Embeddings Provably Selects Important Tokens Aufmerksamkeit bei trainierten Einbettungen wählt wahrscheinlich wichtige Token aus 与经过训练的嵌入器的关注 2505.17282v3 -
101 06-25 Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers Zunge vom Gedanken trennen: Aktivierung Patching offenbart sprach-agnostische Konzeptdarstellungen in Transformern 与思想分离的舌语:在变形器中进行语言-不可知概念代表 2411.08745v4 -
102 06-25 Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards Asymmetrisches REINFORCE für Off-Policy-Verstärkung-Lernen: Ausgleich positiver und negativer Belohnungen 非政策加强学习的不对称REINFORCE对非政策加强学习的影响:平衡正与负的奖励 2506.20520v1 -
103 06-25 OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling OctoThinker: Mittleres Training fördert verstärktes Lernen Scaling OctoThinker: 中级培训鼓励加强学习 2506.20512v1 -
104 06-25 ReCode: Updating Code API Knowledge with Reinforcement Learning ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen ReCode:更新法规API知识与强化学习 2506.20495v1 -
105 06-25 Counterfactual Influence as a Distributional Quantity Gegenfaktischer Einfluss als Verteilungsmenge 分发量的反事实影响 2506.20481v1 -
106 06-25 GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching GPTailor: Large Language Model Pruning durch Layer Cutting und Stitching GPTAilor: 大语言模型 穿过层切切和缝合 2506.20480v1 -
107 06-25 Graph Linearization Methods for Reasoning on Graphs with Large Language Models Graphische Linearisierungsmethoden zur Begründung von Graphen mit großen Sprachmodellen 用于解释大语言模型图表的线性线性方法 2410.19494v3 -
108 06-25 Knowledge-Aware Diverse Reranking for Cross-Source Question Answering Knowledge-Aware Diverse Reranking für Cross-Source-Frageantworten 跨源问题解答知识-知识-知识-知识-知识多样化重新排名 2506.20476v1 -
109 06-25 Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations Zeit ist auf meiner Seite: Dynamik des Gesprächs-Zeit-Sharing in Video-Chat-Gesprächen 时间就在我身边:视频聊天中的谈话时间分享动态 2506.20474v1 -
110 06-25 Probing AI Safety with Source Code Nachweis der KI-Sicherheit mit Quellcode 以《源代码》检验AI安全 2506.20471v1 -
111 06-25 Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v3 -
112 06-25 CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models CogniBench: Ein gesetzlich inspirierter Rahmen und Datensatz zur Bewertung der kognitiven Treue großer Sprachmodelle CogniBench:评估大语言模型认知性信仰的受法律启发的框架和数据集 2505.20767v4 -
113 06-25 Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception Auf dem Weg zur vollen Nutzung der LLM-internen Staaten zur Verbesserung der Wissensgrenzenwahrnehmung 争取充分利用LLM 内部各国加强知识边界认识 2502.11677v2 -
114 06-25 An Agentic System for Rare Disease Diagnosis with Traceable Reasoning Ein Agentisches System für die Diagnose seltener Krankheiten mit rückverfolgbarer Begründung 利用可追踪理由进行罕见疾病诊断的制剂系统 2506.20430v1 -
115 06-25 SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities SMAR: Soft Modality-Aware Routing-Strategie für MoE-basierte multimodale große Sprachmodelle Erhaltung von Sprachfähigkeiten SMAR: 以教育部为基础的维护语言能力的多模式多语言模型软式模式-软件运行战略 2506.06406v2 -
116 06-25 Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content Biomed-angereichert: Ein biomedizinischer Datensatz mit LLMs für die Vorschulung und Extraktion seltener und versteckter Inhalte 生物医学富含生物医学:生物医学数据集,配有预培训和提取稀有和隐藏内容的LLMMs 2506.20331v1 -
117 06-25 From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents Von der Kodikologie zum Code: Eine vergleichende Studie von Transformer und YOLO-basierten Detektoren zur Layoutanalyse in historischen Dokumenten 从术语到代码:用于历史文件布局分析的变异器和以YOLO为基地的YOLO探测器比较研究 2506.20326v1 -
118 06-25 VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback VICCA: Visuelle Interpretation und Verständlichkeit von Röntgenanomalien im generierten Bericht ohne menschliches Feedback VICC:在没有人类反馈的情况下生成的报告对胸透X射线异常现象的视觉解读和理解 2501.17726v2 -
119 06-25 Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning Konfuzius3-Math: Leichtes Hochleistungs-LLM für das chinesische K-12 Mathematik-Lernen 剖析3- 数学: 中国 K-12 数学学习的轻量级高性能推理法LLMLM 2506.18330v2 -
120 06-25 VAQUUM: Are Vague Quantifiers Grounded in Visual Data? VAQUUM: Sind Vakuum-Quantifikatoren in visuellen Daten verankert? VAQUUUM: 模糊量化器是否以视觉数据为基础? 2502.11874v3 -
121 06-25 FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment FundaQ-8: Ein klinisch inspiriertes Scoring-Framework für automatisierte Fundus-Bildqualitätsbewertung FundaQ-8:自动基金图像质量评估临床启发的分类框架 2506.20303v1 -
122 06-25 Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning Ausbalancieren von Wahrhaftigkeit und Aufklärung mit unsicherer Anleitung Feintuning 平衡真实和知情与不确定性软件指示 2502.11962v3 -
123 06-25 LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems LR^2Bench: Bewertung von langkettigen Reflektierenden Fähigkeiten von großen Sprachmodellen durch Einschränkungen Zufriedenheitsprobleme LL_2Bench:通过限制满意度问题评价大语言模式的长链反思理由能力 2502.17848v4 -
124 06-25 LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs LADM: Langzeit-Kontext-Trainingsdatenauswahl mit aufmerksamkeitsbasierter Abhängigkeitsmessung für LLMs LADM: 长期培训数据选择,对LLMM进行以注意为基础的依赖性衡量 2503.02502v2 -
125 06-25 Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models Narrative Shift Detection: Ein hybrider Ansatz von dynamischen Themenmodellen und großen Sprachmodellen 叙述式转移探测:动态专题模型和大语言模型的混合方法 2506.20269v1 -
126 06-25 Why Robots Are Bad at Detecting Their Mistakes: Limitations of Miscommunication Detection in Human-Robot Dialogue Warum Roboter schlecht darin sind, ihre Fehler zu erkennen: Einschränkungen der Fehlkommunikationserkennung im Mensch-Roboter-Dialog 机器人为何在发现错误时不善:人类机器人对话中通信错误探测的限制 2506.20268v1 -
127 06-25 Language Modeling by Language Models Sprachmodellierung nach Sprachmodellen 按语文模式建模的语文 2506.20249v1 -
128 06-25 CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment CBF-AFA: Multi-SSL-Fusion auf Chunk-Basis für automatische Fluency Assessment CBF-AFA: 自动能力评估的基于整块的多SSL融合 2506.20243v1 -
129 06-25 Enhancing Large Language Models through Structured Reasoning Erweiterung großer Sprachmodelle durch strukturierte Vernunft 通过结构化理由加强大语言模式 2506.20241v1 -
130 06-25 LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models LLaVA-CMoE: Auf dem Weg zu einer kontinuierlichen Mischung von Experten für große Vision-Sprachenmodelle LLavaVA-CMoE:建立大型视觉语言模型专家的连续混合体 2503.21227v3 -
131 06-25 Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems Perspektiven im Spiel: Ein multiperspektivischer Ansatz für integrativere NLP-Systeme 游戏中的视角:对更具包容性的NLP系统采取多视角办法 2506.20209v1 -
132 06-25 Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation Intrinsische vs. Extrinsische Bewertung tschechischer Satzeinbettungen: Semantische Relevanz hilft bei MT-Evaluierung nicht 捷克判决嵌入式:语义相关性无助于MT评价 2506.20203v1 -
133 06-25 How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models? Wie kann man Beispiele im In-Context-Lernen abrufen, um die Erkennung von Konversationsgefühlen mit großen Sprachmodellen zu verbessern? 如何利用大语言模式获取学习内文中的实例, 2506.20199v1 -
134 06-25 COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees COIN: Ungewissheitssicherung Selektive Frage-Beantwortung für Stiftungsmodelle mit wahrscheinlichen Risikogarantien COIN: 可靠风险保障基础模型的不确定性保护选择性问题选择性回答 2506.20178v1 -
135 06-25 Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation Conversational User-AI Intervention: Eine Studie zum Prompt Rewriting für verbesserte LLM Response Generation 相互交流的用户-AI干预:关于为改进LLM反应生成迅速改写的研究报告 2503.16789v2 -
136 06-25 SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs SEED: Ein struktureller Encoder für die Einbettung-getriebener Decodierung in Time Series Prediction mit LLMs SEED: 使用LLMs在时间序列预测中嵌入式分解编码结构编码器 2506.20167v1 -
137 06-25 AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control AALC: Großes Sprachmodell Effiziente Vernunft über adaptive Genauigkeits-Längen-Steuerung AALC:通过适应性准确度-语言控制进行大语言模型高效解释 2506.20160v1 -
138 06-25 Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners Belohnender Graph Reasoning Prozess macht LLMs mehr Generalized Reasoners 奖励图表说明程序使LLMs公司更普遍化理由 2503.00845v2 -
139 06-25 CCRS: A Zero-Shot LLM-as-a-Judge Framework for Comprehensive RAG Evaluation CCRS: Ein Null-Shot LLM-as-a-Richter-Rahmen für eine umfassende RAG-Bewertung CCRS: 全面RAG综合评价框架 2506.20128v1 -
140 06-25 Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests Einsatz von KI-Gradern für fehlende Score-Imputation, um eine genaue Abschätzung der Fähigkeit in konstruierten Reaktionstests zu erreichen 利用AI 级数来计算缺失计分数,以在建构反应测试中实现准确能力估算 2506.20119v1 -
141 06-25 A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection Multi-Pass Large Language Model Framework für präzise und effiziente Radiologie Fehlererkennung melden 精确和高效放射学报告误差探测多版本大语言示范框架 2506.20112v1 -
142 06-25 A Global Context Mechanism for Sequence Labeling Ein globaler Kontextmechanismus für die Sequenzkennzeichnung 序列标签全球背景机制 2305.19928v5 -
143 06-25 What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning Was in LLM-generierten Daten zählt: Vielfalt und ihre Wirkung auf Modell Feintuning LLM产生的数据中哪些重要:多样性及其对模拟微调的影响 2506.19262v2 -
144 06-25 A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen 全面评价未受过训练语言模式和人文的语义关系知识 2412.01131v2 -
145 06-25 Misalignment of Semantic Relation Knowledge between WordNet and Human Intuition Fehlausrichtung des semantischen Beziehungswissens zwischen WordNet und menschlicher Intuition WordNet与人类直觉之间的语义关系知识失调 2412.02138v2 -
146 06-25 MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations MIRAGE: Benchmark für multimodale Informationssuche und -vernunft in sachverständigen Gesprächen in der Landwirtschaft MIRAGE:农业专家指导下的农业多模式信息查找和说明理由基准 2506.20100v1 -
147 06-25 PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models PSALM-V: Symbolische Planung in interaktiven visuellen Umgebungen mit großen Sprachmodellen automatisieren PSALM-V:在与大语言模型互动视觉环境中自动进行符号规划 2506.20097v1 -
148 06-25 PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding PP-DocBee2: Verbesserte Baselines mit effizienten Daten für multimodales Dokumentenverständnis PP-Docbee2:改进基线,提供高效数据,促进多式文件理解 2506.18023v2 -
149 06-25 ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset ITFormer: Überbrückungszeitreihe und natürliche Sprache für Multi-Modal QA mit großformatigem Multitask-Datensatz ITFremer:具有大型多任务数据集的多模式QA的连接时间序列和自然语言 2506.20093v1 -
150 06-25 Understanding World or Predicting Future? A Comprehensive Survey of World Models Welt verstehen oder Zukunft voraussagen? Eine umfassende Übersicht über Weltmodelle 了解世界或预测未来?世界模式综合概览 2411.14499v2 -
151 06-25 Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models Aufmerksamkeitsentropie ist ein Schlüsselfaktor: Eine Analyse des Parallelkontexts Kodierung mit Full-Attention-basierten vortrainierten Sprachmodellen 注意信封是一个关键因素:分析平行背景编码,采用以充分注意为基础的预先培训前语文模式 2412.16545v2 -
152 06-25 Therapy as an NLP Task: Psychologists’ Comparison of LLMs and Human Peers in CBT Therapie als NLP-Aufgabe: Psychologenvergleich von LLMs und menschlichen Peers im CBT 作为国家实验室规划任务的一项治疗任务:心理学家对科威特中央银行的LLMs和人类同侪的比较 2409.02244v2 -
153 06-25 Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder Bridging Compositional and Distributional Semantics: Eine Umfrage zur latenten Semantischen Geometrie über AutoEncoder 搭桥构成和分布式语义学:通过自动 Encder 进行边端语义几何测量调查 2506.20083v1 -
154 06-25 Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective Quantifizierung von Fairness in LLMs jenseits von Tokens: Eine semantische und statistische Perspektive 量化 Tokens 之后的LLLMs的公平性:语义和统计观点 2506.19028v2 -
155 06-25 mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks mSTEB: Massive mehrsprachige Bewertung von LLMs zu Sprach- und Textaufgaben mSTEB: 对关于发言和文本任务LLM女士进行大规模多语种评价 2506.08400v2 -
156 06-25 A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs Ein modulares Multitask-Reasoning-Framework Integrating Spatio-temporal Models und LLMs 纳入时空空间模型和LLMs的模块多任务解释框架 2506.20073v1 -
157 06-25 Computation Mechanism Behind LLM Position Generalization Berechnungsmechanismus hinter LLM-Positionsverallgemeinerung LLM 职位普遍化背后的计算机制 2503.13305v3 -
158 06-25 Thought Anchors: Which LLM Reasoning Steps Matter? Thought Anchors: Welche LLM-Gründungsschritte sind wichtig? 何为理据步骤? 2506.19143v2 -
159 06-24 (2) Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models Lernen von Instruction-Following-Richtlinien durch offenes Instruction-Relabeling mit großen Sprachmodellen 通过不限名额指令与大语言模式重新标签 2506.20061v1 -
160 06-24 Cross-Layer Discrete Concept Discovery for Interpreting Language Models Cross-Layer Discrete Concept Discovery für Interpretationssprachmodelle 解释语言模型的跨语言监听概念发现 2506.20040v1 -
161 06-24 The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research Der lärmende Pfad von der Quelle zur Zitation: Wie Wissenschaftler sich mit vergangener Forschung auseinandersetzen 《从来源到引用的噪音路径:衡量学者如何参与过去的研究》 2502.20581v3 -
162 06-24 Evaluating Long Range Dependency Handling in Code Generation LLMs Bewertung der Langzeitabhängigkeitsbehandlung in LLMs der Code-Generation 评估代码生成中的长期依赖性处理 2407.21049v2 -
163 06-24 Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs Sprachmodelle lernen seltene Phänomene aus weniger seltenen Phänomenen: Der Fall der vermissten AANNs 语言模型从少见少见的神话中学习罕见现象:失踪的AANNs案例 2403.19827v3 -
164 06-24 Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning Persona-Assigned Große Sprachmodelle zeigen Menschen-wie motivierte Vernunft 人与人之间签署的大型语言模型 2506.20020v1 -
165 06-24 Accurate and Energy Efficient: Local Retrieval-Augmented Generation Models Outperform Commercial Large Language Models in Medical Tasks Präzise und energieeffizient: Modelle der lokalen Retrieval-Augmented Generation übertreffen kommerzielle Großsprachenmodelle in medizinischen Aufgaben 准确性和能源效率:当地检索和推荐的一代模型在医疗任务中超效商业大语言模型 2506.20009v1 -
166 06-24 Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’ Können Sprachmodelle Programmierer für Coding ersetzen? REPOCOD sagt ‘Noch nicht’ 语言模式能替换编码程序程序员吗? REPOCOD 说“ 还没有” 。 2410.21647v4 -
167 06-24 A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior Ein Spatio-Temporal-Punkt-Verfahren zur feinkörnigen Modellierung des Leseverhaltens 阅读行为精细模拟模型的斯帕迪奥时点进程 2506.19999v1 -
168 06-24 WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development WAFFLE: Feinsteuerungs-Multi-Modal-Modell für automatisierte Front-End-Entwicklung WAFFLE: 自动前端开发的微调多模式模型 2410.18362v2 -
169 06-24 Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation Doc2Agent: Skalierbare Generierung von Tool-Using Agents aus API-Dokumentation Doc2Agency: 从 API 文件中可缩放生成工具使用代理器 2506.19998v1 -
170 06-24 When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour Wenn große Sprachmodelle Menschen widersprechen? Das sykophantische Verhalten von Large Language Models 当大语言模型与人类相矛盾时? 2311.09410v4 -
171 06-24 FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs FactCheckmate: Vorbeugend Halluzinationen in LMs erkennen und abschwächen 事实:先发制人地探测和减轻LMM的幻觉 2410.02899v2 -
172 06-24 Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs Inferenz-Skalierte GraphRAG: Verbesserung der Multi-Hop-Fragebeantwortung auf Wissensgraphen 推推缩比例图RAG:改进知识图多位数问题的回答 2506.19967v1 -
173 06-24 CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation CycleDistill: Bootstrapping Maschine Übersetzung mit LLMs mit Cyclical Destillation 循环蒸馏:用具有周期蒸馏作用的LLMML 制动机械翻译 2506.19952v1 -
174 06-24 Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation Aug2Search: Verbesserung der Facebook-Marktplatzsuche mit LLM-generierter Synthetischer Datenvergrößerung Oug2Search:利用LLM光化合成数据增强功能,加强Facebook市场搜索 2505.16065v3 -
175 06-24 GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models GlyphPattern: Ein abstrakter Mustererkennungs-Benchmark für Vision-Language-Modelle Glyph Patter:愿景语言模型摘要模式识别基准 2408.05894v2 -
176 06-24 ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing ScaleCap: Inferenzzeitskalierbare Bildunterschriften über Dual-Modality-Debiasing Cap Cap: 通过双式拆分法进行可缩放图像的推断-时间 2506.19848v1 -
177 06-24 Orthogonal Finetuning Made Scalable Orthogonale Feinsteuerung aus skalierbarem Material 可缩放 2506.19847v1 -
178 06-24 MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration MAM: Modulares Multi-Agent Framework für multi-Modale medizinische Diagnose über rollenspezialisierte Zusammenarbeit MAM:通过作用专业化协作进行多模式医学诊断的模块多机构框架 2506.19835v1 -
179 06-24 How Effectively Can BERT Models Interpret Context and Detect Bengali Communal Violent Text? Wie effektiv können BERT-Modelle den Kontext interpretieren und Bengali gemeinschaftlichen gewalttätigen Text erkennen? BERT模型如何有效地解释背景和检测孟加拉社区暴力文本? 2506.19831v1 -
180 06-24 Scaling Speculative Decoding with Lookahead Reasoning Spekulative Dekodierung mit Blick auf die Vernunft skalieren 带有 “ 眼前 “ 理由的 投机替代 2506.19830v1 -
181 06-24 Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models Bewertung der Einhaltung der Visualisierungsrichtlinien in Diagrammen für wissenschaftliche Publikationen mit großen Visions-Sprachmodellen 评价使用大愿景语言模型的科学出版物图表中视觉化准则的遵守情况 2506.19825v1 -
182 06-24 Entropy and type-token ratio in gigaword corpora Entropie- und Typ-Token-Verhältnis in Gigaword-Korpora 千兆字词公司中的 英文比率和类型-吨比 2411.10227v3 -
183 06-24 KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality KnowRL: Erforschendes Wissenswertes Verstärktes Lernen für die Realität KnowRL:探索知识强化学习促进事实质量 2506.19807v1 -
184 06-24 LLM-Based Social Simulations Require a Boundary LLM-basierte soziale Simulationen erfordern eine Grenze 以LLM为基础的社会模拟需要边界 2506.19806v1 -
185 06-24 Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study Warum kämpfen Open Source LLMs mit Datenanalyse? Eine systematische empirische Studie 开放源码LLMs为何要与数据分析斗争?系统的经验研究 2506.19794v1 -
186 06-24 Words as Trigger Points in Social Media Discussions: A Large-Scale Case Study about UK Politics on Reddit Worte als Auslöser Punkte in Social Media Diskussionen: Eine groß angelegte Fallstudie über britische Politik auf Reddit 作为社会媒体讨论的触发点的词句:关于联合王国重新适用政治的大规模案例研究 2405.10213v3 -
187 06-24 A Foundational individual Mobility Prediction Model based on Open-Source Large Language Models Ein grundlegendes individuelles Mobilitätsvorhersagemodell basierend auf Open-Source großen Sprachmodellen 基于开放源码大语言模式的基础性个人流动预测模型 2503.16553v2 -
188 06-24 Large language models for automated scholarly paper review: A survey Große Sprachmodelle für automatisierte wissenschaftliche Papierrezension: Eine Umfrage 用于自动学术性纸质审查的大型语言模型:调查 2501.10326v2 -
189 06-24 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation Kling-Foley: Multimodaler Difffusionstransformator für hochwertige Video-zu-Audio-Generation Kling-Foley:高质量视频到视听一代的多式联运变异器 2506.19774v1 -
190 06-24 SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning SRFT: Einstufige Methode mit überwachter und verstärkter Feinsteuerung für die Vernunft SRFT: 单一标准方法,以监督和加固为理由的罚款 2506.19767v1 -
191 06-24 Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation Sensitive Inhaltsklassifikation in den sozialen Medien: Eine ganzheitliche Ressource und Bewertung 社会媒体中的敏感内容分类:综合资源和评价 2411.19832v3 -
192 06-24 Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR Genau, schnell, günstig: Drei auswählen. Multi-Head-Achtung mit bidirektionaler Wiederholungs-Achtung für langformige ASR ersetzen 准确、快速、廉价:选择 3 : 选择 3 。 以双向的经常关注取代多头保护 。 2506.19761v1 -
193 06-24 Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis Arabische Dialektklassifikation mit RNNs, Transformern und großen Sprachmodellen: Eine vergleichende Analyse 使用RNN、变换器和大语言模式的阿拉伯语方言分类:比较分析 2506.19753v1 -
194 06-24 “I know myself better, but not really greatly”: How Well Can LLMs Detect and Explain LLM-Generated Texts? “Ich kenne mich selbst besser, aber nicht wirklich sehr gut”: Wie gut können LLMs LLM-generierte Texte erkennen und erklären? “我更了解自己,但并不十分了解”:“LLMs”如何能探测和解释LLM创世纪的文字? 2502.12743v2 -
195 06-24 NEAR$^2$: A Nested Embedding Approach to Efficient Product Retrieval and Ranking NEAR$^2$: Ein verschachtelter Einbettungsansatz für effizientes Produkt-Retrieval und Ranking 2美元:高效产品回收和排序的内嵌嵌入方法 2506.19743v1 -
196 06-24 Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains? Breaking Barriers: Gewinnt die Verstärkung von Posttrainings die Übertragung auf ungesehene Domains? 突破障碍:加强培训后收益是否转移到未知领域? 2506.19733v1 -
197 06-24 jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval jina-embeddings-v4: Universelle Einbettungen für multimodale Mehrsprachigkeit jina-embeddings-v4:多语种多式联运回收通用嵌入式 2506.18902v2 -
198 06-24 Detecting Machine-Generated Texts: Not Just “AI vs Humans” and Explainability is Complicated Maschinengenerierte Texte erkennen: Nicht nur “AI vs Menschen” und Erklärbarkeit ist kompliziert 检测机器生成的文字: 不只是“ AI 和 人类 ” , 解释复杂 2406.18259v2 -
199 06-24 Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving Lokale Look-Ahead-Anleitung über Verifier-in-the-Loop für automatisierte Theorem-Proving 通过自动理论验证人在线验证人进行自动理论验证,指导当地目视中心 2503.09730v2 -
200 06-24 Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models Ausreißer-sicheres Pre-Training für robuste 4-Bit Quantisierung großer Sprachmodelle 大语言模式强力四比四比四的量化培训前培训 2506.19697v1 -
201 06-24 Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation Recurrent Visual Feature Extraction und Stereo-Warnungen für CT-Report-Generierung 为编写CT报告提供经常性的《CT报告》经常性《视觉特征、采掘和立体关注》 2506.19665v1 -
202 06-24 Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager Maßgeschneiderte Gespräche über LLMs hinaus: Ein RL-basierter Dialogmanager 超出LLLM 的定制对话:基于 RL 的对话管理器 2506.19652v1 -
203 06-24 Language Model Re-rankers are Fooled by Lexical Similarities Sprachmodell-Reranker werden durch Lexikalische Ähnlichkeiten ausgeblendet 语言模式重新排名者被古典相似性所愚弄 2502.17036v2 -
204 06-24 Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning Recht ist nicht genug: Die Pitfalls von Outcome Supervision in der Ausbildung LLMs for Math Reasoning 权利还是不够的:数学原因培训优等生培训成果监督的空洞 2506.06877v2 -
205 06-24 Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge Halluzinationen in News-Zusammenfassungen korrigieren: Erforschung von selbstkorrekten LLM-Methoden mit externem Wissen 新闻摘要:探索利用外部知识自行校正的LLM方法 2506.19607v1 -
206 06-24 Social Hatred: Efficient Multimodal Detection of Hatemongers Sozialer Hass: Effiziente multimodale Erkennung von Hatemongern 社会仇恨:以多种方式高效率地以多种方式探测仇恨者 2506.19603v1 -
207 06-24 PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}Markierung von großen Sprachmodellen gegen die menschliche Bevölkerung: Eine Fallstudie der Kompetenz in der Mathematik der 8. Klasse 切奇! {P} sycyclogics- {A}ssis{T} Ben{CH} 标记针对人类人口的大语言模型: 八年级数学能力案例研究 2404.01799v3 -
208 06-24 Large Language Models as Span Annotators Große Sprachmodelle als Span-Annotatoren 大语言模型作为 Span 标注器 2504.08697v2 -
209 06-24 ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model ECCOT: Ein Rahmen für die Verbesserung der effektiven Kognition durch Gedankenkette im großen Sprachmodell ECCT:通过高语言思维链加强有效认知的框架 2506.19599v1 -
210 06-24 ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation ConciseHint: Effiziente Reasonierung durch kontinuierliche ConciseHints während der Generation 简明提示: 生成期间通过连续压缩提示促进高效理性 2506.18810v2 -
211 06-24 KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation KAG-Thinker: Interactive Thinking und Deep Reasoning in LLMs über wissensbasierte Generation KAG- Thinker: 通过知识型一代在LLMs中互动思考和深智 2506.17728v2 -
212 06-24 Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects Fake oder Real, Können Roboter erzählen? Evaluieren von körpereigenen Vision-Sprachenmodellen auf realen und 3D-gedruckten Objekten 假的还是假的,机器人能告诉吗?评价关于真物和3D实用物的内嵌视觉语言模型 2506.19579v1 -
213 06-24 Benchmarking the Pedagogical Knowledge of Large Language Models Benchmarking der pädagogischen Kenntnisse großer Sprachmodelle 确定大语言模式教学知识基准 2506.18710v2 -
214 06-24 Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress Hat die maschinelle Übersetzungsbewertung die menschliche Parität erreicht? Die menschliche Referenz und die Grenzen des Fortschritts 机器翻译评价是否实现了人类平等?人类参考和进展的极限 2506.19571v1 -
215 06-24 GeistBERT: Breathing Life into German NLP GeistBERT: Das Leben in die deutsche NLP einatmen 呼吸生命化为德国NLP 2506.11903v3 -
216 06-24 ChatSR: Multimodal Large Language Models for Scientific Formula Discovery ChatSR: Multimodale große Sprachmodelle für die wissenschaftliche Formel-Discovery ChatSR: 科学公式发现多模式大语言模型 2406.05410v2 -
217 06-24 DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs DaMO: Ein dateneffizienter Multimodal-Orchester für zeitliche Vernunft mit Video-LLMs DaMO: 带有视频LMS的时空理由数据高效多式多式圆板 2506.11558v2 -
218 06-24 RCStat: A Statistical Framework for using Relative Contextualization in Transformers RCStat: Statistischer Rahmen für die Verwendung der relativen Kontextualisierung in Transformern RCStat: 在变异器中使用相对环境化的统计框架 2506.19549v1 -
219 06-24 Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection Health Sentinel: Eine KI-Pipeline zur Erkennung von Krankheiten in Echtzeit 健康哨兵:AI 实时疾病爆发检测管道 2506.19548v1 -
220 06-24 KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs KnowMap: Effiziente, wissensgetriebene Aufgabenanpassung für LLMs KnowMap: 高效知识驱动任务适应LLMs 2506.19527v1 -
221 06-24 Automatic Posology Structuration : What role for LLMs? Automatische Posologie Structuration : Welche Rolle spielen LLMs? 自动人口结构结构:LLMS发挥什么作用? 2506.19525v1 -
222 06-24 heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation heiDS bei ArchEHR-QA 2025: Von Fixed-k zu Query-dependent-k für Retrieval Augmented Generation ArchEHR-QA 2025:从固定公里到求回增代的依靠查询公里 2506.19512v1 -
223 06-24 AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization für KV Cache in großen Sprachmodellen AnTKV: 大语言模型中 KV 缓存的 Anchor Token- Aware 子Bit 矢量定量 2506.19505v1 -
224 06-24 NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling NaviAgent: Bilevel-Planung auf Werkzeugabhängigkeitsgraphen für Funktionsaufruf NaviAgent: 功能调用工具依赖图双层规划 2506.19500v1 -
225 06-24 Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs Ist lange bis kurz ein kostenloses Mittagessen? Untersuchung von Inkonsistenz und vernünftiger Effizienz in LRMs 长到短是免费午餐吗? 调查低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低 2506.19492v1 -
226 06-24 Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning Dialogische Pädagogik für große Sprachmodelle: Konversationale KI mit bewährten Theorien des Lernens ausrichten 大语言模型对话教学法:使大赦国际与证明的学习理论相一致 2506.19484v1 -
227 06-24 Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models Commonsense-Erstellung und -Evaluierung für Dialogsysteme mit großen Sprachmodellen 使用大语言模式生成和评估对话系统 2506.19483v1 -
228 06-24 LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages LEVOS: Leveraging Vocabulary Overlap mit Sanskrit zur Generierung von technischen Lexikons in indischen Sprachen LEVOS:利用梵文的词汇重叠来生成印度语言的技术词汇 2407.06331v2 -
229 06-24 MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages MuBench: Bewertung der Mehrsprachigkeit von großen Sprachmodellen in 61 Sprachen Bunch:评估61种语文大语言模式多语文能力 2506.19468v1 -
230 06-24 Can Large Language Models Capture Human Annotator Disagreements? Können große Sprachmodelle menschliche Annotatoren-Disagreements erfassen? 大语言模型能捕捉人类代言人分歧吗? 2506.19467v1 -
231 06-24 Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights Mehrsprachige Tokenisierung durch die Gläser indischer Sprachen: Herausforderungen und Einsichten 通过印度语言的镜头多语种教学:挑战和洞察 2506.17789v2 -
232 06-24 TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems TTSDS2: Ressourcen und Benchmark für die Bewertung von Human-Quality Text to Speech Systems TTSTSDS2:评价向语音系统提供的人质量文本的资源和基准 2506.19441v1 -
233 06-24 Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System Mem4Nav: Förderung der Vision-und-Sprache-Navigation in städtischen Umgebungen mit einem Hierarchischen Raum-Kognition Long-Short-Memory-System Mem4Nav:利用一个等级空间-定位长短记忆系统促进城市环境中的视觉和语言导航 2506.19433v1 -
234 06-24 Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study Lernen, mit Sprach-VAEs latente Regeln zu entwirren: Eine systematische Studie 学习解开语言 VAE 的边端理由解释规则:系统研究 2506.19418v1 -
235 06-24 Automated Detection of Pre-training Text in Black-box LLMs Automatisierte Erkennung von Vorschulungstexten in Black-Box LLMs 黑箱LMM 培训前文字自动检测 2506.19399v1 -
236 06-24 Statistical Multicriteria Evaluation of LLM-Generated Text Statistische Multikriterien Bewertung des LLM-generierten Textes 对LLLM-创用文字的多标准统计评价 2506.18082v2 -
237 06-24 Measuring and Guiding Monosemanticity Messen und Führen von Monosemantik 衡量和指导单项措施 2506.19382v1 -
238 06-24 ReDit: Reward Dithering for Improved LLM Policy Optimization ReDit: Belohnung für verbesserte LLM-Policy-Optimierung Redit:为改进LLM政策优化而向优利分差 2506.18631v2 -
239 06-24 SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents SpokenWOZ: Ein großformatiger Sprach-Text-Benchmark für gesprochene Task-Orientierte Dialog-Agenten pokenWOZ:针对以任务为主的对话代理方的大型演讲-文本基准 2305.13040v7 -
240 06-24 Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation Aus-von-Charakter-Behavior zu erkennen: Atom-Level-Bewertung von Persona Fidelity in Open-Ended Generation 《查哈拉彻查行为外的观察:对不限名额的一代人助人行为的原子层面评价》 2506.19352v1 -
241 06-24 In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly In-Context Occams Razor: Wie Transformer einfachere Hypothesen auf der Fliege bevorzugen 内文 Occam 的剃刀: 如何在飞行中发生变形人更倾向于简单易碎的假说 2506.19351v1 -
242 06-24 Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations Analysieren des Wissens von LLMs Grenzenkognition über Sprachen hinweg durch die Linse interner Repräsentationen 分析LLM女士通过内部代表的镜头跨越各种语言的知识边界认知 2504.13816v3 -
243 06-24 RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成 2506.11555v2 -
244 06-24 FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression FLAT-LLM: Feinkörnige Low-rank Aktivierung Raumtransformation für großsprachliche Modellkompression FLAT-LLM: 用于大语言模型压缩的精制低级激活空间转换 2505.23966v2 -
245 06-24 JCAPT: A Joint Modeling Approach for CAPT JCAPT: Ein gemeinsamer Modellierungsansatz für CAPT JCAPT: CAPT的联合示范方法 2506.19315v1 -
246 06-24 Long-Context Generalization with Sparse Attention Lang-Kontext Verallgemeinerung mit Sparse Achtung 以粗略的注意力概括化长文本 2506.16640v2 -
247 06-24 Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs Skywork-SWE: Enthüllen von Datenskalierungsgesetzen für Software-Engineering in LLMs SKYWork-SWE:LLM中软件工程法数据缩放法 2506.19290v1 -
248 06-24 Evaluating Transparent Reasoning in Large Language Models for Accountable Critical Tasks Bewertung der transparenten Begründung in großen Sprachmodellen für buchhalterische kritische Aufgaben 评价可问责关键任务大语言模式的透明理由评估 2408.01933v5 -
249 06-24 Disentangling Reasoning and Knowledge in Medical Large Language Models Entwirren von Vernunft und Wissen in medizinischen großen Sprachmodellen 在医疗大语言模式中分离理由和知识 2505.11462v2 -
250 06-24 EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition EmoStage: Ein Rahmen für eine präzise Empathetic Response Generation über Perspective-Taking und Phasenerkennung 电站:通过透视和分阶段识别准确产生同情反应的框架 2506.19279v1 -
251 06-24 Process Reward Models That Think Prozess Belohnung Modelle, die denken 思考的流程奖励模式 2504.16828v3 -
252 06-24 Personality Prediction from Life Stories using Language Models Persönlichkeitsvorhersage aus Lebensgeschichten mit Sprachmodellen 使用语言模型对生活故事的个性预测 2506.19258v1 -
253 06-24 MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models MSR-Align: Policy-Grounded Multimodal Alignment für sicherheitsbewusste Begründung in Vision-Language-Modellen MSR-对称:在愿景-语言模型中安全意识理由的政策-全方位多式联运协调 2506.19257v1 -
254 06-24 Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen 职位:机器学习会议应建立“反驳和批评”轨道 2506.19882v1 -
255 06-24 Augmenting Multi-Agent Communication with State Delta Trajectory Augmenting Multi-Agent Kommunikation mit State Delta Trajektorie 增强与国家三角三角洲轨迹的多代理通信 2506.19209v1 -
256 06-23 (1) Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition Bayesische evolutionäre Schwarmarchitektur: Ein formales epistemisches System, das im wahrheitsbasierten Wettbewerb verankert ist Bayesian 进化型蜂轮结构:以基于真相的竞争为基础的正式流行系统 2506.19191v1 -
257 06-23 Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages Prompt, Übersetzen, Fine-Tune, Re-Initialize, oder Instruction-Tune? Anpassung von LLMs für das In-Context-Lernen in Low-Resource-Sprachen 迅速、翻译、微调、微调、重新启动或指示-图? 2506.19187v1 -
258 06-23 Transferring Features Across Language Models With Model Stitching Übertragung von Funktionen über Sprachmodelle mit Modellstich 使用模型裁剪的跨语言模型传输功能 2506.06609v2 -
259 06-23 Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data Verbesserter Hybrid-Transducer und Aufmerksamkeits-Encoder-Decoder mit Textdaten 带有文本数据的增强混合转换器和注意编码器解码器 2506.19159v1 -
260 06-23 ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs ProxSparse: Regularisiertes Lernen von halbstrukturierten Sparsity Masken für vortrainierte LLMs ProxSparse:为预先培训的LMM 定期学习半结构化半结构化的顶罩 2502.00258v2 -
261 06-23 Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series Zeit-IMM: Ein Datensatz und Benchmark für irreguläre multimodale Multivariate Zeitreihen 时间-IMM:非正常多式联运多变时间序列的数据集和基准 2506.10412v2 -
262 06-23 TRAIL: Trace Reasoning and Agentic Issue Localization TRAIL: Nachvollziehen von Vernunft und Agentik Lokalisierung TRAIL:追踪理由和制剂问题地方化 2505.08638v3 -
263 06-23 Human-Aligned Faithfulness in Toxicity Explanations of LLMs Menschlich ausgerichtete Treue in der Toxizität Erklärungen von LLMs 人与人之间和谐的对LLMM 的毒性解释的信念 2506.19113v1 -
264 06-23 ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities ADVLLM: Iterative Selbst-Tuning LLMs für verbesserte Jailbreaking-Fähigkeiten ADVLLM: 强化破室能力自动自动自调LMs 2410.18469v4 -
265 06-23 Small Language Models in the Real World: Insights from Industrial Text Classification Kleine Sprachmodelle in der realen Welt: Einblicke aus der industriellen Textklassifikation 《现实世界中的小语言模式:对工业文本分类的洞察》 2505.16078v3 -
266 06-23 Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting Sprachmodelle verstehen dich vielleicht nicht: Theorie des Geistes über Story Prompting bewerten 语言模型可能无法理解你:通过故事提示评估心理理论 2506.19089v1 -
267 06-23 MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanation MFTCXplain: Ein Mehrsprachiger Benchmark-Datensatz zur Bewertung der moralischen Vernunft von LLMs durch Hassreden-Multi-Hop-Erklärung MFTCXplain:通过仇恨言论多呼多呼解释评估LLMs道德理由的多语言基准数据集 2506.19073v1 -
268 06-23 HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models HAWAII: Hierarchischer Wissenstransfer für effiziente Vision-Sprachenmodelle HAWAII:通过高层次视觉知识转让促进高效视觉语言模型 2506.19072v1 -
269 06-23 NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching NLPnorth @ TalentCLEF 2025: Vergleich diskriminativer, kontrastiver und prompt-basierter Methoden für Jobtitel und Skill Matching NLPnortth @ TalentCLEF 2025:比较有歧视、有抵触和基于迅速的方法的职称和技能匹配方法 2506.19058v1 -
270 06-23 Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages Auswirkungen des visuellen Kontextes auf geräuschreiche multimodale NMT: Eine empirische Studie für Englisch-Indisch-Sprachen 视觉背景对噪音、多式NMT:英语对印度语经验研究的影响 2308.16075v2 -
271 06-23 Rational Metareasoning for Large Language Models Rationale Metaveraking für große Sprachmodelle 大语言模型的逻辑比值 2410.05563v3 -
272 06-23 Self-reflecting Large Language Models: A Hegelian Dialectical Approach Selbstreflektierende große Sprachmodelle: Ein hegelianischer dialektischer Ansatz 自我反映大语言模型:海格利人对立方法 2501.14917v6 -
273 06-23 Plan for Speed – Dilated Scheduling for Masked Diffusion Language Models Plan für Geschwindigkeit – Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle 速度计划 – – 蒙面传播语言模型的压缩排程计划 2506.19037v1 -
274 06-23 Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations Gebrochene Token? Ihr Sprachmodell kann geheim nicht-kanonische Tokenisierungen handhaben 断断的音调? 您的语言模式可以秘密处理非天体的音调 。 2506.19004v1 -
275 06-23 Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge Mirage of Mastery: Memorization Tricks LLMs in künstlich aufgeblasenes Selbstwissen 万能万能幻影:记忆的秘诀 人工膨胀的自我知识 2506.18998v1 -
276 06-23 Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations Vision als Dialekt: Visuelle Verständigung und Generierung über textorientierte Repräsentationen vereinen 视觉透视:通过文本统一代表方式统一视觉理解和生成 2506.18898v1 -
277 06-23 ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs GrundFlux-PRM: Trajektorien-Bewusst-PRMs für lange Ketten-of-Thought-Reasoning in LLMs 合理性-PRM:LLMs中用于长期研究链原因的轨迹-软件 2506.18896v1 -
278 06-23 OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization OMEGA: Kann LLMs Vernunft außerhalb der Box in Mathe? Bewertung von exploratorischen, kompositorischen und transformativen Generalisierung OMEGA: 数学框外的理学理学优异人士能否评价探索、构成和变换的通用性? 2506.18880v1 -
279 06-23 CommVQ: Commutative Vector Quantization for KV Cache Compression CommVQ: Kommutative Vector Quantization für KV Cache Compression commVQ: KV 缓存压缩的通量矢量量 2506.18879v1 -
280 06-23 A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap Ein Kommentar zu “Die Illusion des Denkens”: Den vernünftigen Cliff als Agent-Gap zurückweisen 关于“思考的幻觉”的评论:将理性断裂重新定位为一种危险差距 2506.18957v1 -
281 06-23 Mechanistic Interpretability Needs Philosophy Mechanistische Dolmetschbarkeit braucht Philosophie 哲学 2506.18852v1 -
282 06-23 USAD: Universal Speech and Audio Representation via Distillation USAD: Universelle Sprach- und Audiodarstellung über Destillation USAD:通过蒸馏实现普遍言论和音频代表 2506.18843v1 -
283 06-23 LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning LongWriter-Zero: Mastering Ultra-Long Text Generation via Verstärkungslernen LongWriter-Zero:通过强化学习掌握超大龙制文本 2506.18841v1 -
284 06-23 EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions EMULATE: Multi-Agenten-Rahmen für die Bestimmung der Veracity von atomaren Forderungen durch Emulation menschlicher Handlungen EMULATE:通过模拟人类行动确定原子索赔的多机构框架 2505.16576v2 -
285 06-23 STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning STU-PID: Steuerungs-Token-Verwendung über PID-Controller für effizientes Large Language Model Reasoning STU-PID:通过PID控制器用于高效大语言示范理由的指导使用 2506.18831v1 -
286 06-23 MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task MLLP-VRAIN UPV-System für die IWSLT 2025 Simultanübersetzung MLLP-VRAIN IWSLT 2025 IWSLT 双声翻译翻译任务MLLP-VRAIN UPV系统 2506.18828v1 -
287 06-23 A Survey on Data Selection for LLM Instruction Tuning Eine Umfrage zur Datenauswahl für LLM Instruction Tuning 关于LLM指示图示数据选择的调查 2402.05123v2 -
288 06-23 RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies RWEZusammenfassung: Ein Rahmen und Test zur Auswahl großer Sprachmodelle zur Zusammenfassung von Real-World Evidence (RWE) Studien RWE 摘要:为总结真实世界证据研究而选择大语言模型的框架和测试 2506.18819v1 -
289 06-23 Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models Schritt für Schritt Entlarvung für Parameter-Effizient Feinabstimmung von großen Sprachmodellen 大型语言模型的分步制式分步式微微调 2408.14470v3 -
290 06-23 Existing LLMs Are Not Self-Consistent For Simple Tasks Bestehende LLMs sind für einfache Aufgaben nicht selbstkonsistent 现有的LLLM女士对于简单任务不具有自我关联性 2506.18781v1 -
291 06-23 Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training Programmierung durch Backprop: LLMs Erwerben wiederverwendbarer algorithmischer Abstraktionen während der Code-Schulung 按后方程式分列的编程情况: 守则培训期间可重复使用的演算摘要LLM Acquire Accre Repre Reable Agrotic Empactations 2506.18777v1 -
292 06-23 Neural Total Variation Distance Estimators for Changepoint Detection in News Data Neurale Gesamtvariationsdistanz-Schätzer für Changepoint Detection in News Daten 用于新闻数据中变化点探测变化点的神经总变化 2506.18764v1 -
293 06-23 SEAL: Scaling to Emphasize Attention for Long-Context Retrieval SEAL: Skalierung zur Betonung der Aufmerksamkeit für die Langzeitretrieval-Retrieval SEAL: 逐步强调对长期检索的重视 2501.15225v2 -
294 06-23 Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX Auge des Urteils: Die Bewertung der russischsprachigen LLMs mit POLLUX 判断之眼:用POLLUX对讲俄语的LLMs的评价进行分解 2505.24616v2 -
295 06-23 Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation Multimodaler Ankerverteiler mit Wissensdestillation zur Emotionserkennung im Gespräch 具有知识蒸馏的多式锁定器变异器,在对话中承认情感 2506.18716v1 -
296 06-23 Handling Numeric Expressions in Automatic Speech Recognition Umgang mit numerischen Ausdrücken bei der automatischen Spracherkennung 在自动语音识别中处理数字表达式 2408.00004v2 -
297 06-23 Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition Kontext Biasing für Aussprachen-Orthographie Missverhältnis in der automatischen Spracherkennung 自动语音识别中出现偏差以引发-正正对学误差的背景情况 2506.18703v1 -
298 06-23 Better Language Model Inversion by Compactly Representing Next-Token Distributions Bessere Sprachmodellumwandlung durch kompakte Darstellung von Next-Token-Distributionen 由代表下移分发的语法缩略语 2506.17090v2 -
299 06-23 HausaNLP at SemEval-2025 Task 11: Hausa Text Emotion Detection HausaNLP bei SemEval-2025 Aufgabe 11: Hausa Text Emotion Detection SemEval-2025任务11:Hausa短信情感检测 2506.16388v2 -
300 06-23 “I understand why I got this grade”: Automatic Short Answer Grading with Feedback “Ich verstehe, warum ich diese Note bekommen habe”: Automatisches Kurzbeantworten mit Feedback “我理解我为什么得到这个分级”: 自动简短的回答,加上反馈 2407.12818v2 -
301 06-23 Is There a Case for Conversation Optimized Tokenizers in Large Language Models? Gibt es einen Fall für Gespräche optimierte Tokenizer in großen Sprachmodellen? 在大语言模型中是否有“最优化调价器”对话的例子? 2506.18674v1 -
302 06-23 C-SEO Bench: Does Conversational SEO Work? C-SEO Bench: Funktioniert gesprächige SEO? C-SEO法官:对口的SEO有效吗? 2506.11097v2 -
303 06-23 Alignment Helps Make the Most of Multimodal Data Alignment hilft, multimodale Daten optimal zu nutzen 对齐帮助大多数多模式数据 2405.08454v3 -
304 06-23 Pretraining Language Models to Ponder in Continuous Space Vorschulung von Sprachmodellen im kontinuierlichen Raum 连续空间Ponder语言模型培训前 2505.20674v2 -
305 06-23 ByteSpan: Information-Driven Subword Tokenisation ByteSpan: Informationsgetriebene Subwort-Tokenisierung ByteSpan: 信息驱动小词 2506.18639v1 -
306 06-23 AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs AggTruth: Kontextuelle Halluzination Detection mit aggregierten Aufmerksamkeits-Scores in LLMs AggTurth:利用LLMM中累计关注分数进行背景幻觉探测 2506.18628v1 -
307 06-23 The Anatomy of Speech Persuasion: Linguistic Shifts in LLM-Modified Speeches Die Anatomie der Sprachüberzeugung: Linguistische Verschiebungen in LLM-modifizierten Reden 语音解剖分析:LLM-修改后的语音语言变化 2506.18621v1 -
308 06-23 Semantic similarity estimation for domain specific data using BERT and other techniques Semantische Ähnlichkeitsschätzung für bereichsspezifische Daten unter Verwendung von BERT und anderen Techniken 使用BERT和其他技术对具体领域数据进行语义相似性估计 2506.18602v1 -
309 06-23 Reply to “Emergent LLM behaviors are observationally equivalent to data leakage” Antwort auf “Emergente LLM-Verhalten sind Beobachtungsäquivalent zu Daten Leckage” 对“紧急LLM行为”的答复在观测上等同于数据泄漏” 2506.18600v1 -
310 06-23 No Training Wheels: Steering Vectors for Bias Correction at Inference Time Keine Trainingsräder: Lenk-Vektoren für Bias-Korrektur zur Inferenzzeit 无培训轮:推论时间比亚更正指导矢量 2506.18598v1 -
311 06-23 Airalogy: AI-empowered universal data digitization for research automation Airalogy: KI-fähige universelle Daten-Digitalisierung für die Forschungsautomatisierung 航空学:用于研究自动化的AI-动力通用数据数字化 2506.18586v1 -
312 06-23 Parallel Continuous Chain-of-Thought with Jacobi Iteration Parallele Kontinuierliche Kette von Gedanken mit Jacobi Iteration 与Jacodi 迭代平行连续研究链 2506.18582v1 -
313 06-23 A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance Modulare Taxonomie für Hass-Sprachdefinitionen und deren Auswirkungen auf die Leistungsfähigkeit von Null-Schuss-LLM-Klassifikation 仇恨言论定义的模块分类学及其对零沙粒LLM分类绩效的影响 2506.18576v1 -
314 06-23 When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking Wenn das Feintuning fehlschlägt: Lektionen aus dem MS MARCO Passage Ranking 罚款出错时:从MS MARCO通过分级中汲取的教训 2506.18535v1 -
315 06-23 LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Inconsistencies LLMs Lost in Translation: M-ALERT entdeckt Cross-Linguistic Safety Inkonsistenzen LLMs 失落于翻译:M-ALERT发现跨语言安全不一致 2412.15035v3 -
316 06-23 Affordable AI Assistants with Knowledge Graph of Thoughts Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken 具有知识思想知识图的负担得起的AI助理 2504.02670v4 -
317 06-23 End-to-End Spoken Grammatical Error Correction End-to-End-Spoken Grammatical Error Correction 端端到端口语语语法错误校正 2506.18532v1 -
318 06-23 Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic? Pilotieren von Copilot, Codex und StarCoder2: Heiße Temperatur, kalte Prompts oder schwarze Magie? 联合飞行员 代码代码和星际代码2: 热温、冷感或黑魔法? 2210.14699v3 -
319 06-23 ASCenD-BDS: Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping ASCenD-BDS: Anpassbares, stochastisches und kontextbewusstes Framework zur Erkennung von Bias, Diskriminierung und Stereotypisierung ASCenD-BDS:可适应、可储存和符合实际情况的发现偏见、歧视和陈规定型观念框架 2502.02072v2 -
320 06-23 HiRAG: Retrieval-Augmented Generation with Hierarchical Knowledge HiRAG: Retrieval-Augmented Generation mit Hierarchischem Wissen HIRAG: 具有等级知识的回收养代 2503.10150v2 -
321 06-23 MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis MedTVT-R1: Eine multimodale LLM-Empowering medizinischer Vernunft und Diagnose MedTTT-R1:增强医疗原因和诊断能力的一个多模式LLM 2506.18512v1 -
322 06-23 Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts Smooth Operators: LLMs übersetzen unvollkommene Hinweise in Disfluency-Rich-Transkriptionen 平滑运算符: LLMs 将不合格提示转换成 disfunency- Rich 轨迹 2506.18510v1 -
323 06-23 Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung 国家劳工政策关键任务:力量、弱点和具体具体绩效 2506.18501v1 -
324 06-23 AI-Generated Song Detection via Lyrics Transcripts AI-Generated Song Detection via Lyrics Transcripts AI 创名歌曲通过歌词谱状探测 2506.18488v1 -
325 06-23 MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models MeRF: Motivierungs-verstärkte Verstärkungs-Feinsteuerung für große Vernunftmodelle MERF: 大力加强大型理由模型的强化强化微调 2506.18485v1 -
326 06-23 MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems MORTAR: Multiturn Metamorphic Testing für LLM-basierte Dialogsysteme MORTAR:以LLM为基础的对话系统的多轨变形测试 2412.15557v3 -
327 06-23 LLMs on a Budget? Say HOLA LLMs auf einem Budget? Sagen Sie HOLA 预算LLLM 预算? 2506.18952v1 -
328 06-23 Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset Richtige Substantive Diakritisierung für Arabisch Wikipedia: Ein Benchmark-Datensatz 阿拉伯维基百科:基准数据集 2505.02656v3 -
329 06-23 PlantDeBERTa: An Open Source Language Model for Plant Science PlantDeBERTa: Ein Open Source Sprachmodell für die Pflanzenwissenschaft PlantDebERTA:植物科学开放源语言模型 2506.08897v3 -
330 06-23 OAgents: An Empirical Study of Building Effective Agents OAgents: Eine empirische Studie über den Aufbau effektiver Agenten 业务主管:建立有效代理的经验研究 2506.15741v2 -
331 06-23 Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models Circuit Compositions: Erforschen von modularen Strukturen in transformerbasierten Sprachmodellen 电路构成:探索以变换语言模式为基础的模块结构 2410.01434v3 -
332 06-23 Compromising Honesty and Harmlessness in Language Models via Deception Attacks Kompromisse zwischen Ehrlichkeit und Harmlosigkeit in Sprachmodellen durch Täuschungsangriffe 通过欺骗性攻击破坏语言模式的诚实和无弊 2502.08301v2 -
333 06-23 TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models TReB: Umfassender Benchmark für die Bewertung von Tabellen mit Gründen für Fähigkeiten großer Sprachmodelle TreB:评价大语言模式表说明能力的综合基准 2506.18421v1 -
334 06-23 Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models Infi-MMR: Curriculumbasiertes Entsperren multimodaler Vernunft durch schrittweises Verstärktes Lernen in multimodalen Small Language-Modellen Infi-MMMR:通过在多模式小型语言模式中分阶段强化学习,以课程为基础解锁多模式原因 2505.23091v3 -
335 06-23 Lemmatization as a Classification Task: Results from Arabic across Multiple Genres Lemmatisierung als Klassifikationsaufgabe: Ergebnisse aus dem Arabischen über mehrere Genres hinweg 作为一项分类任务, Lemmatiz化: 阿拉伯语在多个流派中产生的结果 2506.18399v1 -
336 06-23 SLR: An Automated Synthesis Framework for Scalable Logical Reasoning SLR: Ein automatisiertes Synthese-Framework für skalierbare logische Vernunft SLR: 一个可缩放逻辑理由的自动合成框架 2506.15787v2 -
337 06-23 Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics Bewertung von Kausalerklärungen in medizinischen Berichten mit LLM-basierten und von Menschen ausgerichteten Metrics 与以LLM为基础和人与人之间比较的计量单位在医疗报告中评价因果解释 2506.18387v1 -
338 06-23 Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control Song Form-aware Full-Song Text-to-Lyrics Generation mit Multi-Level Granularity Syllable Count Control Song 具有多层次颗粒可调制计数控制功能的有形式觉醒的全声全文本到文字生成器 2411.13100v3 -
339 06-23 A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages Eine rigorose Bewertung von LLM-Datenerstellungsstrategien für ressourcenarme Sprachen 对LLLM低资源语言数据生成战略的严格评价 2506.12158v2 -
340 06-23 Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations Factual Knowledge in Language Models: Robustheit und Anomalien unter einfachen zeitlichen Kontextvariationen 语言模型中的事实知识:简单时间环境变化下的强力和异常现象 2502.01220v6 -
341 06-23 RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming RePST: Sprachmodell empowered Spatio-Temporal Forecasting via Semantisch-orientierte Reprogrammierung REPST:通过以语义为主的重新编制方案来进行语言模型增强能力SPA-时间预报 2408.14505v3 -
342 06-23 SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation SlimMoE: Strukturierte Kompression großer MoE-Modelle über Expert Slimming und Destillation SlimMoE:通过专家攀爬和蒸馏对大型MOE模型进行结构性压缩 2506.18349v1 -
343 06-23 Systematic Reward Gap Optimization for Mitigating VLM Hallucinations Systematische Belohnungslückenoptimierung zur Minderung von VLM-Halluzinationen 降低VLM幻觉效应的系统回升差距优化 2411.17265v3 -
344 06-23 Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLMs Weniger Daten weniger Tokens: Mehrsprachiges Einheitslernen für effiziente Test-Time-Reasoning in LLMs 减减数据减缩代号:以多种语文统一学习提高LLM中测试时间的高效理由 2506.18341v1 -
345 06-23 Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs) Position ist Macht: Systemprompts als Mechanismus von Bias in großen Sprachmodellen (LLMs) 位置是电源:系统提示作为大语言模型比阿语机制(LLMs) 2505.21091v3 -
346 06-23 TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance TranslationCorrect: Ein einheitliches Framework für maschinelle Übersetzungs-Post-Editing mit Predictive Error Assistance 翻译更正:机器翻译统一框架 2506.18337v1 -
347 06-23 HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States HiddenDetect: Ermitteln von Jailbreak Attacken gegen große Vision-Sprache Modelle durch Überwachung Hidden States 通过监测隐藏国,侦查对大型视觉-语言模型的越狱袭击 2502.14744v4 -
348 06-23 Enhancing Entity Aware Machine Translation with Multi-task Learning Erweitern Entity Aware Machine Translation mit Multi-Task-Lernen 以多任务学习方式增强实体意识机器翻译 2506.18318v1 -
349 06-23 Team LA at SCIDOCA shared task 2025: Citation Discovery via relation-based zero-shot retrieval Team LA bei SCIDOCA geteilte Aufgabe 2025: Citation Discovery über relationsbasierte Null-Shot-Retrieval SCIDOCOCA的LA小组共同承担2025年任务:通过基于关系零发检索发现引文 2506.18316v1 -
350 06-23 Enhancing Document Retrieval in COVID-19 Research: Leveraging Large Language Models for Hidden Relation Extraction Verbesserung der Dokument-Retrieval in COVID-19 Forschung: Nutzung großer Sprachmodelle für versteckte Beziehungsextraktion 在COVID-19研究中加强文件检索:利用大语言模型进行隐藏关系采掘 2506.18311v1 -
351 06-23 PlanGenLLMs: A Modern Survey of LLM Planning Capabilities PlanGenLLMs: Eine moderne Erhebung der LLM-Planungskapazitäten PlanGenLLLMs:LLM规划能力现代调查 2502.11221v3 -
352 06-23 AlzheimerRAG: Multimodal Retrieval Augmented Generation for Clinical Use Cases using PubMed articles AlzheimerRAG: Multimodale Retrieval Augmented Generation für klinische Anwendungsfälle mit PubMed Artikeln SDRRAG: 使用PubMed 文章临床使用案例的多式回收增加代数 2412.16701v2 -
353 06-23 LoRA vs Full Fine-tuning: An Illusion of Equivalence LoRA vs. Full Fine-Tuning: Eine Illusion der Gleichwertigkeit LoRA 与 完全微调: 等同的幻象 2410.21228v2 -
354 06-23 When Large Language Models Meet Vector Databases: A Survey Wenn große Sprachmodelle Vektordatenbanken treffen: Eine Umfrage 当大语言模型与矢量数据库相匹配时:调查 2402.01763v4 -
355 06-23 FutureFill: Fast Generation from Convolutional Sequence Models FutureFill: Schnelle Generation aus konvolutionären Sequenzmodellen 未来金融危机:从变序模型中快速繁衍 2410.03766v3 -
356 06-23 AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining AdaLRS: Loss-Guided Adaptive Learning Rate Suche nach effizientem Foundation Model Pretraining AdaLRS: 为高效基础基础示范培训前而寻找学习率 2506.13274v2 -
357 06-23 RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding RAPID: Lang-Kontext-Schlussfolgerung mit retrieval-Augmented Spekulative Decodierung RAPID: 利用回溯性增强的投机性代谢法进行长文本推理 2502.20330v2 -
358 06-23 Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework Sykopanz in Vision-Language-Modellen: Eine systematische Analyse und ein Inferenz-Zeit-Mitigation-Framework 远景 – – 语言模型的求视 – – 语言模型的求爱:系统分析和推论 – – 时间减缓框架 2408.11261v2 -
359 06-23 RLPR: Extrapolating RLVR to General Domains without Verifiers RLPR: Extrapolieren von RLVR auf allgemeine Domains ohne Prüfer RLPR: 将RLVR外推至普通域域,无验证符 2506.18254v1 -
360 06-23 Craw4LLM: Efficient Web Crawling for LLM Pretraining Craw4LLM: Effizientes Web-Crawling für LLM Pretraining Craw4LLLM: 高效的LLM培训前网络排网 2502.13347v3 -
361 06-23 Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models Chain-of-Experts: Entsperren der Kommunikationskraft von Mixture-of-Experts-Modellen 专家链:解锁混合专家模型的通信能力 2506.18945v1 -
362 06-23 From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents Von der RAG zur Agentur: Validierung islamisch-medizinischer Reaktionen mit LLM-Agenten 从RAG到AG 剂:用LLM代理物验证伊斯兰药物反应 2506.15911v2 -
363 06-23 AdapThink: Adaptive Thinking Preferences for Reasoning Language Model AdapThink: Adaptive Denkeinstellungen für das Sprachmodell der Vernunft AapThink:对理由语言模式的适应性思维偏好 2506.18237v1 -
364 06-23 NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts NovelHopQA: Diagnose von Multi-Hop-Vernünftigkeitsfehlern in langen narrativen Kontexten 新创HopQA:在长期叙述背景下诊断多种原因的失败 2506.02000v2 -
365 06-23 Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalisable ASR Models 推进非洲集中的语音识别:为可通用的 ASR 模型选择突发性不确定性驱动数据 2306.02105v7 -
366 06-22 (7) Shrinking the Generation-Verification Gap with Weak Verifiers Schrumpfung der Generation-Verifikationslücke mit schwachen Prüfern 利用薄弱的验证器缩小代代核查差距 2506.18203v1 -
367 06-22 Supernova Event Dataset: Interpreting Large Language Models’ Personality through Critical Event Analysis Supernova Event Dataset: Die Persönlichkeit großer Sprachmodelle durch kritische Ereignisanalyse interpretieren 超新星事件数据集:通过重大事件分析解释大语言模型的个性 2506.12189v2 -
368 06-22 Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications Entschlüsselung von Emotionen in Kindergeschichten: Eine vergleichende Analyse multimodaler LLMs in pädagogischen Anwendungen 儿童故事书中的破除情感:教育应用中多种模式LMs的比较分析 2506.18201v1 -
369 06-22 Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Moslems in Large Language Models: A Systematic Review 减轻大语言模式中针对阿拉伯人和穆斯林的文化偏见的迅速工程技术:系统审查 2506.18199v1 -
370 06-22 ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists ExpertLongBench: Benchmarking-Sprachmodelle auf Expertenebene Langform-Erstellungsaufgaben mit strukturierten Checklisten 专家关系:专家级长期世代任务的语言模式基准与结构化核对清单 2506.01241v2 -
371 06-22 CareLab at #SMM4H-HeaRD 2025: Insomnia Detection and Food Safety Event Extraction with Domain-Aware Transformers CareLab auf der #SMM4H-HeaRD 2025: Schlaflosigkeitserkennung und Lebensmittelsicherheitsveranstaltung Extraktion mit Domain-Aware Transformern #SMM4H-HeARD 2025:与域-软件变换器一起的失眠检测和食品安全事件提取 2506.18185v1 -
372 06-22 Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know? Vernunft über Ungewissheit: Wissen Vernunftmodelle, wenn sie es nicht wissen? 关于不确定性的原因:理性模型知道他们不知道什么时候知道吗? 2506.18183v1 -
373 06-22 Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness Mehrsprachige retrieval Augmented Generation für kulturell-sensitive Aufgaben: Ein Benchmark für übergreifende Robustheit 文化敏感性任务多语种检索增强一代人:跨语种强力基准 2410.01171v3 -
374 06-22 QuranMorph: Morphologically Annotated Quranic Corpus KoranMorph: Morphologisch kommentierter Koran Corpus 《古兰经》: 体质说明 2506.18148v1 -
375 06-22 Enhancing LLM Knowledge Learning through Generalization Verbesserung des LLM-Wissenslernens durch Verallgemeinerung 通过普遍化加强LLM知识学习 2503.03705v2 -
376 06-22 Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models Sparse-Feature-Koaktivierung zeigt kombinierbare semantische Module in großen Sprachmodellen 大语言模型中可合成的语义模块 2506.18141v1 -
377 06-22 Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs Collage: Ablösbares Rapid Prototyping für Informationsextraktion auf wissenschaftlichen PDFs 结合:可分解的用于科学文件格式信息提取的快速原型 2410.23478v2 -
378 06-22 SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging SE-Merging: Ein selbsterweiterter Ansatz für dynamisches Modellverschmelzen SE-Miging:动态模型合并的自我加强办法 2506.18135v1 -
379 06-22 $φ^{\infty}$: Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models $φ^{\infty}$: Clause Purification, Einbettung von Neuausrichtung und die totale Unterdrückung des Em Dash in autoregressiven Sprachmodellen $infty}$: 条款净化、嵌入的调整和在自动递减语言模式中全面禁止 “ Em Dash “ 。 2506.18129v1 -
380 06-22 The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English The Syntactic Acceptability Dataset (Preview): Eine Ressource für maschinelles Lernen und sprachliche Analyse des Englischen 同步可接受数据集(预评):英语机器学习和语言分析资源 2506.18120v1 -
381 06-22 Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives Psychische Gesundheit in LLMs: Multi-Hop-Frage beantworten, um verstärkte und stillschweigende Perspektiven zu erkennen LLMs中的心理健康平等:利用多种渠道问题回答发现放大和沉默的视角 2506.18116v1 -
382 06-22 Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use Chengyu-Bench: Benchmarking von großen Sprachmodellen für das Verständnis und die Verwendung chinesischer Sprache Chengyuy-Bench:为中语语言理解和使用制定大语言模式基准 2506.18105v1 -
383 06-22 InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating InspireDebatte: Multi-Dimensional Subjektive-Zielive Evaluation-geführte Begründung und Optimierung zur Debattierung 预测性辩论:多维可主观性-客观评价-指导性逻辑和最佳评议优化 2506.18102v1 -
384 06-22 Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution Bewertung prompt-basierter und feinverstellter Ansätze zur tschechischen Anaphora-Resolution 评估对捷克阿原光病决议的即时和精确推荐方法 2506.18091v1 -
385 06-22 RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation RoboTwin 2.0: Ein skalierbarer Datengenerator und Benchmark mit starker Domain Randomisierung für robuste bimanuelle Robotermanipulation RoboTwin 2. 0 : 一个可缩放数据生成器和基准, 具有强力域随机化功能, 用于机械二手机器人操纵的可缩放数据生成器和基准 2506.18088v1 -
386 06-22 TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking TrumorGPT: Graph-basiertes Retrieval-Augmented Large Language Modell für die Fact-Checking TrumorGPT: 用于实况调查的基于图表的检索增强型大语言模型 2505.07891v2 -
387 06-22 Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity 1568 Tokens in einen einzigen Vektor und wieder zurück krammen: Die Grenzen der Einbettung von Raumkapazität erkunden 将1568吨撞成单一矢量和后向:探索嵌入空间能力的极限 2502.13063v3 -
388 06-22 FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLMs FinGPT: Verbesserung der Sentiment-Based Stock Movement Prediction mit Verbreitungs-Bewusst und Kontext-angereicherten LLMs FINGPT:利用传播软件和内容丰富的LMs,加强基于情绪的库存流动预测 2412.10823v2 -
389 06-22 The Democratic Paradox in Large Language Models’ Underestimation of Press Freedom Das demokratische Paradox in großen Sprachmodellen’ Unterschätzung der Pressefreiheit 《大语言模式中的民主悖论》对新闻自由的低估 2506.18045v1 -
390 06-22 Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation Kreuz von links nach rechts Gehirn: Adaptiver Texttraum für Vision-und-Sprachen-Navigation 从左脑到右脑交叉:愿景和语言导航的适应性文本梦想者 2505.20897v2 -
391 06-22 MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval MM-R5: MultiModal reasoning-enhanced ReRanker über Verstärkungs-Lernen für Dokument-Retrieval MM-R5:通过文件检索强化学习加强文件检索,多模式合理改进Reanker 2506.12364v2 -
392 06-22 Markov-Enhanced Clustering for Long Document Summarization: Tackling the ‘Lost in the Middle’ Challenge with Large Language Models Markov-erweiterte Clustering für lange Dokumentzusammenfassung: Die Herausforderung ‘Verloren in der Mitte’ mit großen Sprachmodellen bewältigen Markov-加强长文档摘要集群:用大语言模式应对“中中途迷”挑战 2506.18036v1 -
393 06-22 Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices Splitformer: Eine verbesserte Early-Exit-Architektur für die automatische Spracherkennung an Kantengeräten 分解器:改进了边缘设备自动语音识别的提前退出结构 2506.18035v1 -
394 06-22 PDF Retrieval Augmented Question Answering PDF Retrieval Augmented Frage beantworten PDF PDF 检索递增问题解答 2506.18027v1 -
395 06-22 LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs LongLlaDA: Entsperren langer Kontextkapazitäten in Diffusions-LLMs LongLLALDA:释放扩散长程距离能力 2506.14429v2 -
396 06-22 Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data Lernen aus Referenzantworten: Vielseitige Sprachmodellausrichtung ohne Binäre menschliche Präferenzdaten 从参考资料解答中学习:通用语言模型调整,无二元人类优先数据 2504.09895v2 -
397 06-22 AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs AlphaDecay: Modulweises Gewichtsdecay für schweres Balancing in LLMs AlphaDecay:LLMM中重帆平衡的舱型偏重衰减 2506.14562v2 -
398 06-22 GeAR: Graph-enhanced Agent for Retrieval-augmented Generation GeAR: Graphenverstärkter Agent für retrieval-augmentierte Generation GeAR: 回收后再生一代的图形增强剂 2412.18431v2 -
399 06-22 Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures Cross-Entropy-Spiele für Sprachmodelle: Vom Impliziten Wissen bis hin zu allgemeinen Fähigkeiten 语文模式跨英语运动会:从隐隐知识到一般能力措施 2506.06832v2 -
400 06-22 Reinforcement Learning Teachers of Test Time Scaling Verstärktes Lernen von Lehrern der Testzeitskalierung 测试时间尺度强化学习教师 2506.08388v2 -
401 06-22 A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment Umfassendes Graphen-Framework für die Beantwortung von Fragen mit Modesuche-Präferenzausrichtung 以模式寻找优先调整方式回答问题的综合图表框架 2506.17951v1 -
402 06-22 Scatter-Based Innovation Propagation in Large Language Models for Multi-Stage Process Adaptation Scatter-based Innovation Propagation in großen Sprachmodellen für Multi-Stage-Prozessanpassung 多阶段程序适应大语言模型中基于散散的革新创新推广 2506.17949v1 -
403 06-22 Tutorial: $\varphi$-Transductions in OpenFst via the Gallic Semiring Tutorial: $\varphi$-Übersetzungen in OpenFst über den Gallischen Semiring 教学: $\ varphie$- 通过加仑半径在 OpenFst 传输 2506.17942v1 -
404 06-22 Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model Stream-Omni: Gleichzeitige multimodale Interaktionen mit großem Sprach-Vision-Sprachmodell 流流-奥米尼:与大语言-视觉-语音模型同时使用的多模式互动 2506.13642v2 -
405 06-22 Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective Evolving Prompts In-Context: Eine offene, sich selbst replizierende Perspektive 不断演变的加速:一个开放的、自我复制的视角 2506.17930v1 -
406 06-22 Improving the Efficiency of Long Document Classification using Sentence Ranking Approach Verbesserung der Effizienz der Langdokumentklassifikation mittels Sentence-Ranking-Ansatz 采用判决分级办法提高长文件分类的效率 2506.07248v2 -
407 06-22 LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference LightRetriever: Eine LLM-basierte Hybrid-Retrieval-Architektur mit 1000x schnellerer Query-Inferenz 光探索光: 基于 LLM 的混合回收结构, 具有 1000x 快速查询推断 2505.12260v2 -
408 06-22 Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset Perle: Ein multimodaler kulturbewusster arabischer Unterrichtsdatensatz 珍珠:多式文化-知识阿拉伯文教学数据集 2505.21979v2 -
409 06-22 Effective Red-Teaming of Policy-Adherent Agents Effektives Red-Teaming von Policy-Adherent Agents 有效的政策协调代理人红队 2506.09600v2 -
410 06-22 DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models DSGram: Dynamische Gewichtung von Sub-Metriken zur Korrektur von Grammatikfehlern im Zeitalter großer Sprachmodelle DSGram:大语言模型时代外貌错误校正动态加权子计量法 2412.12832v2 -
411 06-22 LGAI-EMBEDDING-Preview Technical Report LGAI-EMBEDDING-Vorschau Technischer Bericht LGAI-EMBEDD-审查技术报告 2506.07438v2 -
412 06-22 SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback SIPDO: Closed-Loop Prompt Optimierung über Synthetic Data Feedback SIPDO:通过合成数据反馈,通过闭闭电话快速优化 2505.19514v2 -
413 06-22 Large Language Models for Disease Diagnosis: A Scoping Review Große Sprachmodelle für Krankheitsdiagnosen: Eine Bewertung 疾病诊断大语言模型:范围审查 2409.00097v3 -
414 06-22 ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training ECHO-LlaMA: Effizientes Caching für Hochleistungs-LLaMA-Schulungen ECHO-LLAMA: 高效率的高绩效拉马培训 2505.17331v2 -
415 06-22 Multi-turn Jailbreaking via Global Refinement and Active Fabrication Multiturn Jailbreaking über globale Veredelung und aktive Fabrikation 通过全球精炼和积极制造 2506.17881v1 -
416 06-22 How Alignment Shrinks the Generative Horizon Wie Alignment den generativen Horizont schrumpft 协同一致如何缩小生成地平线 2506.17871v1 -
417 06-22 QueueEDIT: Structural Self-Correction for Sequential Model Editing in LLMs QueueEDIT: Strukturelle Selbstkorrektion für sequentielle Modellbearbeitung in LLMs QueeEDIT: LLM 中序列模型编辑结构自校校 2506.17864v1 -
418 06-22 LLMs for Customized Marketing Content Generation and Evaluation at Scale LLMs für maßgeschneiderte Marketing Content Generierung und Evaluation auf Scale 定制营销内容生成和评估规模评估LLM 2506.17863v1 -
419 06-22 Learning to Reason under Off-Policy Guidance Unter außerpolitischer Anleitung zur Vernunft lernen 根据非政策指导学习理由 2504.14945v5 -
420 06-21 (6) Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild Prototypische Mensch-AI-Kollaboration-Verhalten von LLM-Assisted Writing in the Wild 野外LLM协助协助写作者的合作行为 2505.16023v3 -
421 06-21 THCM-CAL: Temporal-Hierarchical Causal Modelling with Conformal Calibration for Clinical Risk Prediction THCM-CAL: Zeitlich-hierarchische Kausalmodellierung mit konformer Kalibrierung für klinische Risikovorhersage THCM-CAL: 临床风险预测与临床风险预测常规校准相结合的时高等级因果关系模型 2506.17844v1 -
422 06-21 Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz 通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配 2506.17828v1 -
423 06-21 Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight Task-Specific Adapters for Automatic Scoring Effiziente Multi-Task-Inferenzierung mit einem geteilten Backbone und leichten Task-Spezifischen Adaptern für die automatische Bewertung 与共享的后骨和轻型任务特定适应器进行自动 Scorting 2412.21065v2 -
424 06-21 Evaluating LLMs with Multiple Problems at once Bewertung von LLMs mit mehreren Problemen auf einmal 立即评价有多重问题的LLMs 2406.10786v3 -
425 06-21 Bayesian Social Deduction with Graph-Informed Language Models Bayesische soziale Deduktion mit Graphen-informierten Sprachmodellen 采用图形化语言模型的巴伊斯社会衰退 2506.17788v1 -
426 06-21 Beyond instruction-conditioning, MoTE: Mixture of Task Experts for Multi-task Embedding Models Über die Instruktionskonditionierung hinaus, MoTE: Mischung von Task-Experten für Multi-Task-Einbettungsmodelle 超越教学-调控,MOTE:多任务嵌入模型任务专家混合 2506.17781v1 -
427 06-21 Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5 Benchmarking und Aufbau eines Null-Shot Hindi Retrieval Modells mit Hindi-BEIR und NBLB-E5 与印地语-BEIR和NLLB-E5一起制定基准和构建零热印地山脉回收模型 2409.05401v3 -
428 06-21 DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training DUMP: Automatisiertes Lehrplanlernen auf Verteilungsebene für RL-basiertes LLM-Post-Training DDMP: 以LLLLM为基础的LLM培训后课程自动分发级别课程学习 2504.09710v2 -
429 06-21 HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations HIDE und Suche: Halluzinationen in Sprachmodellen über entkoppelte Repräsentationen erkennen HIDE & Sear:通过拆分代表方式检测语言模型中的幻觉 2506.17748v1 -
430 06-21 Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages Enthüllungsfaktoren für ein verbessertes POS-Tagging: Eine Studie über ressourcenarme mittelalterliche romanische Sprachen 强化POS贴标签的难解因素:低资源中世纪罗姆语言研究 2506.17715v1 -
431 06-21 Aged to Perfection: Machine-Learning Maps of Age in Conversational English Gealtert bis zur Perfektion: Machine-Learning Alterskarten im Conversational English 成熟至完美:计算机学习时代图,英语 2506.17708v1 -
432 06-21 The Evolution of Natural Language Processing: How Prompt Optimization and Language Models are Shaping the Future Die Evolution der natürlichen Sprachverarbeitung: Wie schnell Optimierung und Sprachmodelle die Zukunft gestalten 《自然语言处理过程的演变:如何迅速优化和语言模式正在塑造未来》 2506.17700v1 -
433 06-21 Zero-Shot Conversational Stance Detection: Dataset and Approaches Zero-Shot Conversational Stance Detection: Datensatz und Ansätze 零热对调调检测:数据集和方法 2506.17693v1 -
434 06-21 Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering Ressourcenfreundliche dynamische Verbesserungskette für Multi-Hop-Fragebeantwortung 多种问题解答资源友好型动态增强链 2506.17692v1 -
435 06-21 Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models Verbesserung des wenigen Schlagwörters Spotting Performance durch vorgefertigte selbstüberwachte Sprachmodelle 通过培训前自我监督的演讲模式,加强 “ 通过培训前自我监督的演讲模式 “ 微小的 “ 关键词 “ 突出成绩 “ 2506.17686v1 -
436 06-21 Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference Vernunftschaltungen in Sprachmodellen: Eine mechanistische Interpretation der syllogistischen Inferenz 语言模型中说明理由的电路:对音频推断的机械解释 2408.08590v3 -
437 06-21 Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization Robustes LLM-Unlearning mit MUDMAN: Meta-Unlearning mit Disruptionsmasken und Normalisierung 与 MUDMAN 一起重新学习: 以干扰蒙蔽和正常化的方式重新学习 2506.12484v2 -
438 06-21 FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies FaithfulSAE: Auf dem Weg zur Erfassung treuer Funktionen mit Sparse Autoencodern ohne externe Datensatzabhängigkeiten 忠实的SAE:在没有外部数据集依赖性的情况下, 与粗略自动解析器一起获取忠实的特征 2506.17673v1 -
439 06-21 TPTT: Transforming Pretrained Transformer into Titans TPTT: Transformieren des vortrainierten Transformers in Titanen TPTT: 将预训练变形器转换成巨人 2506.17671v1 -
440 06-21 Stop Overvaluing Multi-Agent Debate – We Must Rethink Evaluation and Embrace Model Heterogeneity Mehr-Agenten-Debatte stoppen – Wir müssen Bewertung neu denken und Modell Heterogenität umarmen 停止高估多机构辩论 – – 我们必须重新思考评价和拥抱模型多样性 2502.08788v3 -
441 06-21 How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMs Wie numerische Präzision die Fähigkeit von LLMs zur Arithmetik beeinflusst 数字精确度如何影响LLM 的理理原因能力 2410.13857v2 -
442 06-21 Comba: Improving Bilinear RNNs with Closed-loop Control Comba: Bilineare RNNs mit Closed-Loop-Steuerung verbessern Comba: 改进有闭环控制的双线区域网网 2506.02475v3 -
443 06-21 Step-Opt: Boosting Optimization Modeling in LLMs through Iterative Data Synthesis and Structured Validation Schritt-Opt: Steigerung der Optimierungsmodellierung in LLMs durch iterative Datensynthese und strukturierte Validierung 通过迭代数据合成和结构化校验,促进通过迭代数据合成和结构化校验,在LLMs中建立优化优化模型模型 2506.17637v1 -
444 06-21 Self-Preference Bias in LLM-as-a-Judge Selbstpräferenz Bias in LLM-as-a-Richter 以LLM-as-a-jujuds为主的自选比亚斯语 2410.21819v2 -
445 06-21 Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs Antwort-Centric oder Reasoning-Driven? Enthüllen des Latent Memory Ankers in LLMs 解答中心或理由驱动? 解开 LLMS 中隐藏的内存点锚 2506.17630v1 -
446 06-21 UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation UniMoT: Unified Molecule-Text Language Model mit diskreter Token-Darstellung UniMoT: 具有分立调制调制解析器表示式的统一分子文字语言模式 2408.00863v2 -
447 06-21 CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning CLiViS: Unleashing Kognitive Karte durch Linguistisch-Visuelle Synergie für eingedickte visuelle Vernunft CLiVVIS:通过视觉机能理性的语言-视觉协同法解析认知图 2506.17629v1 -
448 06-21 Dual Debiasing for Noisy In-Context Learning for Text Generation Dual Debiasing für lautes In-Context-Lernen für Textgenerierung 为产生文本进行有噪音的内文学习双向偏差 2506.00418v2 -
449 06-21 A Closer Look into Mixture-of-Experts in Large Language Models Ein genauerer Blick in Mixture-of-Experts in großen Sprachmodellen 更密切地研究大语言模型混合专家 2406.18219v3 -
450 06-21 Anthropocentric bias in language model evaluation Anthropozentrische Voreingenommenheit in der Sprachmodellbewertung 语言模式评价中的人文中心偏见 2407.03859v2 -
451 06-21 A Dual-Directional Context-Aware Test-Time Learning for Text Classification Ein Dual-Directional Context-Aware Test-Time Learning für die Textklassifikation 文本分类双调背景-软件-测试-时间学习 2503.15469v5 -
452 06-21 OpusLM: A Family of Open Unified Speech Language Models OpusLM: Eine Familie von offenen, einheitlichen Sprachmodellen OpusLM: “ 开放统一语言模式 “ 家庭 2506.17611v1 -
453 06-21 TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting TyphoFormer: Sprachgesteigerter Transformer für präzise Typhoon-Track-Prognose 台风前台风:用于准确预报台风轨道的语文增强变换器 2506.17609v1 -
454 06-21 Mind the Gap: Assessing Wiktionary’s Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages Mind the Gap: Bewertung von Wiktionarys massenhaft linguistischem Wissen über morphologische Lücken in zwei verwandten Sprachen 认识差距:评估维基多里人对两种相关语言的病理差距的密集语言知识 2506.17603v1 -
455 06-21 Steering LLMs for Formal Theorem Proving Lenkung LLMs für formale Theorem Proving 正式理论证明指导LLMs 2502.15507v4 -
456 06-21 Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models Cite Pretrain: Retrieval-freie Wissenszuweisung für große Sprachmodelle Cite Prettrain: 大语言模型的检索-无知识归属 2506.17585v1 -
457 06-21 AgriCHN: A Comprehensive Cross-domain Resource for Chinese Agricultural Named Entity Recognition AgriCHN: Eine umfassende Cross-Domain-Ressource für die Anerkennung der chinesischen Landwirtschaftseinheit AGRICHN:中国农业命名实体认证综合跨部门资源 2506.17578v1 -
458 06-21 Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions Deep Binding of Language Model Virtual Personas: eine Studie über die Annäherung der politischen Partisanen-Misswahrnehmungen 语言模拟虚拟人:关于政治党派近似误解的研究 2504.11673v3 -
459 06-21 SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning SRPO: Verbesserung der multimodalen LLM-Reasoning durch Reflection-Aware-Verstärkung SRPO: 通过反射-软件强化学习,加强多式LLM 2506.01713v2 -
460 06-21 LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning LLM-getriebene medizinische Report Generierung über kommunikationseffizientes Heterogenes Federated Learning LLM 驱动的通过通信效率高的异质联邦学习编写医学报告 2506.17562v1 -
461 06-21 ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation Parammute: Unterdrückende wissenskritische FFNs für treue retrieval-erweiterte Generation 分量:制止知识-关键FFFF,以用于忠实检索-养殖一代 2502.15543v3 -
462 06-21 Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception Probing for Phonology in Self-Supervised Speech Representations: Eine Fallstudie zur beschleunigten Wahrnehmung 自我监督演讲代表中的声学研究:关于接受程度的案例研究 2506.17542v1 -
463 06-21 DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning DuaShepherd: Integration von Schrittweiser Korrektheit und potenziellen Belohnungen für mathematische Vernunft DuaShepherd: 整合数学理由的逐步纠正和潜在奖励 2506.17533v1 -
464 06-21 Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning Datenqualitätsfragen in mehrsprachigen Sprachdatensätzen: Der Bedarf an soziolinguistischer Sensibilisierung und proaktiver Sprachplanung 多语言语言数据集的数据质量问题:社会语言意识和前瞻性语言规划的必要性 2506.17525v1 -
465 06-20 (5) Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards Agent-RLVR: Training Software Engineering Agents über Beratung und Umwelt Belohnungen RLVR: 通过指导和环境奖励培训软件工程代理 2506.11425v2 -
466 06-20 $L^*LM$: Learning Automata from Examples using Natural Language Oracles $L^*LM$: Automata lernen aus Beispielen mit natürlichen Sprach-Orakeln $LLM$:从使用自然语言甲骨文的例子中学习自动地图 2402.07051v2 -
467 06-20 VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM VeriLocc: End-to-End Cross-Architektur Register Zuordnung über LLM VeriLocc:通过LLM进行端至端跨建筑登记册分配 2506.17506v1 -
468 06-20 When can isotropy help adapt LLMs’ next word prediction to numerical domains? Wann kann Isotropie helfen, die nächste Wortvorhersage von LLMs an numerische Domänen anzupassen? 何时才能帮助LLMS的下一个字词预测适应数字域? 2505.17135v4 -
469 06-20 Language Models Grow Less Humanlike beyond Phase Transition Sprachenmodelle wachsen weniger menschlich jenseits des Phasenübergangs 超越阶段过渡后,语言模式人文化程度逐渐降低 2502.18802v2 -
470 06-20 Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems 理解大语言模型对书写和信息生态系统的影响的计算方法 2506.17467v1 -
471 06-20 Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages Überbrückung des Transkriptions-Bottlenecks: Feinabstimmungs-ASR-Modelle für extrem ressourcenarme Feldarbeitssprachen 打破剪裁瓶颈:极低资源外地工作语言的微调 ASR 模型 2506.17459v1 -
472 06-20 Directional Gradient Projection for Robust Fine-Tuning of Foundation Models Richtgradientenprojektion für robuste Feinsteuerung von Fundamentmodellen 基金会模型硬性精美调整方向梯度预测 2502.15895v2 -
473 06-20 FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering FRAMES-VQA: Benchmarking Fine-Tuning Robustheit über Multi-Modal Shifts in der visuellen Fragestellung FRAMES-VQA:确定视觉问题解答中多模式变化的精确调整强度基准 2505.21755v2 -
474 06-20 Beyond the Link: Assessing LLMs’ ability to Classify Political Content across Global Media Beyond the Link: Bewertung der Fähigkeit von LLMs, politische Inhalte in globalen Medien zu klassifizieren 超越链接:评估LLMs在全球媒体对政治内容进行分类的能力 2506.17435v1 -
475 06-20 UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making UProp: Untersuchung der Unsicherheitsausbreitung von LLMs in mehrstufiger agentischer Entscheidungsfindung UPROP:调查多级制剂决策中LLMs的不确定性传播情况 2506.17419v1 -
476 06-20 Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study LLMs zur Bewertung von Tutorenbewegungen in Real-Life-Dialogen zu nutzen: Eine Machbarkeitsstudie 利用LLMs来评估现实生活对话中的导师动向:可行性研究 2506.17410v1 -
477 06-20 Exploring the Potential of Encoder-free Architectures in 3D LMMs Erforschung des Potenzials von encoderfreien Architekturen in 3D-LMMs 探索3DLMMs中无编码器建筑的潜力 2502.09620v3 -
478 06-20 AQA-Bench: An Interactive Benchmark for Evaluating LLMs’ Sequential Reasoning Ability AQA-Bench: Ein interaktiver Benchmark für die Bewertung der sequenziellen Begründungsfähigkeit von LLMs AQA- “ AQA-区 “ :评估LLLMs按顺序推理能力的互动基准 2402.09404v2 -
479 06-20 Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency Feintuning senkt die Sicherheit und beeinträchtigt die Bewertungskonsistenz 安全性和干扰性评价一致性 2506.17209v1 -
480 06-20 Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems Desecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems SWE-区领导板拆解:LLM-和代理修理系统的分析提交者和结构 2506.17208v1 -
481 06-20 High-Dimensional Interlingual Representations of Large Language Models Hochdimensionale interlinguale Darstellungen großer Sprachmodelle 大语言模式的多种语文间高语言代表 2503.11280v4 -
482 06-20 Towards AI Search Paradigm Auf dem Weg zur KI-Suche Paradigma 走向 AI 搜索范式 2506.17188v1 -
483 06-20 CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models CLEAR-3K: Bewertung von Kausalerklärbarkeiten in Sprachmodellen CLEAR-3K:评估语文模式中的原因解释能力 2506.17180v1 -
484 06-20 TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models TALE: Ein Tool-Augmented Framework zur referenzfreien Bewertung großer Sprachmodelle TALE:无参考资料大语言模式评估工具增强框架 2504.07385v2 -
485 06-20 LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning LaRS: 研究连锁理据的后备理据技能 2312.04684v4 -
486 06-20 Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM Zuschauen und hören: Audio-Visual-Sprechmomente mit multimodaler LLM verstehen 观看和收听: 了解多式LM的视听话语道 2505.18110v2 -
487 06-20 Watermarking Language Models through Language Models Wasserzeichen von Sprachmodellen durch Sprachmodelle 通过语言模型建立语言模型 2411.05091v2 -
488 06-20 Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement Kalibrierung vortrainierter Sprachklassifikatoren auf LLM-generierten Noisy-Labels über iterative Veredelung 通过迭代精炼校准LLM产生的噪音标签上的训练前语言分类校准 2505.19675v2 -
489 06-20 Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs? Cache mich, wenn Sie können: Wie viele KVs benötigen Sie für effektive Lang-Kontext LMs? 如果可以的话, 缓存我 : 您需要多少 KV 才能有效长文本 LM ? 2506.17121v1 -
490 06-20 MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation MEXA: Auf dem Weg zu einer allgemeinen multimodalen Vernunft mit dynamischer Multi-Expert-Aggregation MEXA:争取采用动态多专家聚合的通用多模式理由 2506.17113v1 -
491 06-20 Are Bias Evaluation Methods Biased ? Sind Bias Evaluation Methoden Biased ? 是否对评估方法有偏见? 2506.17111v1 -
492 06-20 Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving Auf dem Weg zu einer fortgeschrittenen mathematischen Begründung für LLMs über Logic Theorem-Proving erster Ordnung 争取通过一阶逻辑理论验证,为LLMs提供高级数学理由 2506.17104v1 -
493 06-20 Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation Chain-of-Thought Prompting Obscures Halluzination Cues in großen Sprachmodellen: Eine empirische Bewertung 引导大语言模型中传译锥体:经验评价 2506.17088v1 -
494 06-20 ScholarSearch: Benchmarking Scholar Searching Ability of LLMs ScholarSearch: Benchmarking Scholar Suche Fähigkeit von LLMs 搜索学者:参照基准,学者搜索LLMs的能力 2506.13784v2 -
495 06-20 Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs Tower+: Überbrückung Allgemeinheit und Übersetzung Spezialisierung in mehrsprachigen LLMs 塔+:在多语种LMM中连接通俗和翻译专业 2506.17080v1 -
496 06-20 Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 Simultane Übersetzung mit Offline-Speech- und LLM-Modellen in CUNI-Einreichung bei IWSLT 2025 同时翻译与CUNI提交2025年IWSLT的离线演讲和LLM模型的LLM模型同步翻译 2506.17077v1 -
497 06-20 Contextual modulation of language comprehension in a dynamic neural model of lexical meaning Kontextuelle Modulation des Sprachverständnisses in einem dynamischen neuronalen Modell der lexikalischen Bedeutung 以动态的神经模式对语言理解进行上下文调整 2407.14701v2 -
498 06-20 From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers Von Konzepten zu Komponenten: Konzept-agnostische Aufmerksamkeit Modul Entdeckung in Transformatoren 从概念到组成部分:在变异器中发现概念 – – 不可接受注意模块 2506.17052v1 -
499 06-20 Geopolitical biases in LLMs: what are the “good” and the “bad” countries according to contemporary language models Geopolitische Voreingenommenheiten in LLMs: Was sind die “guten” und die “schlechten” Länder nach zeitgenössischen Sprachmodellen LLMM中的地缘政治偏见:根据当代语言模式,什么是“好”和“坏”国家? 2506.06751v2 -
500 06-20 MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models MUCAR: Multilinguale Cross-Modal Ambiguity Auflösung für multimodale große Sprachmodelle Benchmarking MUCAR:为多模式大语言模型制定多语言跨模式的多语种和多模式模糊分辨率基准 2506.17046v1 -
501 06-20 COS-DPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework COS-DPO: Bedingtes eins-shot Multi-Objective Fine-Tuning Framework COS-DPO: 有条件的单片多目标微调框架 2410.08316v3 -
502 06-20 Principles of semantic and functional efficiency in grammatical patterning Grundsätze der semantischen und funktionalen Effizienz bei der grammatischen Musterung 语义和功能效率原则 2410.15865v2 -
503 06-20 Cash or Comfort? How LLMs Value Your Inconvenience Bargeld oder Komfort? Wie LLMs Wert Ihre Unannehmlichkeit 现金还是安慰? 2506.17367v1 -
504 06-20 Incivility and Rigidity: The Risks of Fine-Tuning LLMs for Political Argumentation Beweglichkeit und Starrheit: Die Risiken von Feinsteuerungs-LLMs für politische Argumentation 不文明和僵硬:政治辩论的微调LMLMs 2411.16813v3 -
505 06-20 ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization ErsetzenMe: Netzwerkvereinfachung durch Tiefenkorrektur und Transformer Block Linearisierung 替换Me:通过深度推进和变换器块条线化简化网络 2505.02819v3 -
506 06-20 Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning Instituto de Telecomunicações auf der IWSLT 2025: Ausrichtung von Sprach- und Sprachmodellen für das Sprach-zu-Text-Lernen IWSLT 2025年国际电信电信研究所:调和小型话语和语音到文字学习语言模式 2506.17019v1 -
507 06-20 Can Large Language Models Replace Human Subjects? A Large-Scale Replication of Scenario-Based Experiments in Psychology and Management Können große Sprachmodelle menschliche Subjekte ersetzen? Eine groß angelegte Replikation von szenariobasierten Experimenten in Psychologie und Management 大语言模型能够取代人类课题吗?在心理学和管理中大规模重复基于设想的实验 2409.00128v3 -
508 06-20 LLM-Generated Feedback Supports Learning If Learners Choose to Use It LLM-generated Feedback unterstützt Lernen, wenn Lernende wählen, es zu verwenden 如果学习者选择使用LLM-创用LLM反馈支持学习 2506.17006v1 -
509 06-20 PersonalAI: Towards digital twins in the graph form PersonalAI: Auf dem Weg zu digitalen Zwillingen in der Grafikform 个人AAI:走向图示形式的数字双胞胎 2506.17001v1 -
510 06-20 Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling Think&Cite: Verbesserung der zugeschriebenen Textgenerierung durch selbst geführte Baumsuche und Fortschrittsprämienmodellierung Think&Cite: 改进自导树搜索和进步奖励模型的属性文本生成 2412.14860v2 -
511 06-20 TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs TeXpert: Ein Multi-Level-Benchmark zur Bewertung der LaTeX-Code-Generierung durch LLMs TeXpert:由LLMs评估LaTeX代码生成的多层次基准 2506.16990v1 -
512 06-20 SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments SHAKTI: Ein 2,5 Milliarden Parameter kleines Sprachmodell optimiert für Edge-KI und Low-Resource-Umgebungen SHAKTI:为边缘AI和低资源环境优化的2.5亿亿亿分数小语言模型 2410.11331v2 -
513 06-20 Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation Knapsack Optimization-based Schema Linking für die LLM-basierte Text-zu-SQL-Generierung Knapsack 基于LLM的基于LLM的文本到SQL生成的基于优化的气相连接 2502.12911v2 -
514 06-20 Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond Sprachengpässe-Modelle: Ein Rahmen für interpretierbares Wissen auf Tracing und darüber hinaus 语言瓶颈模式:可解释知识追踪框架及以后 2506.16982v1 -
515 06-20 Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework Polysemantik mit PRISM erfassen: Ein Multi-Konzept-Feature Beschreibung Framework 利用PRISM获得多边性能:多概念特征描述框架 2506.15538v2 -
516 06-20 Latent Concept Disentanglement in Transformer-based Language Models Latent Concept Disentanglement in Transformer-basierten Sprachmodellen 以变换器为基础的语言模型中的边端概念分解 2506.16975v1 -
517 06-20 PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval PromptDSI: Prompt-basiert Probefrei Instance-wise Incremental Learning for Document Retrieval 快速DSI:为文件检索进行基于即时的无排练-不重复式递增学习 2406.12593v3 -
518 06-20 Coreference as an indicator of context scope in multimodal narrative Koreferenz als Indikator für den Kontextumfang in multimodaler Erzählung 共同参照作为多式联运说明中背景范围的一项指标 2503.05298v2 -
519 06-20 LogProber: Disentangling confidence from contamination in LLM responses LogProber: Entwirren des Vertrauens in LLM-Antworten 日志Prober:解除对LLM反应中污染的信心 2408.14352v3 -
520 06-20 Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs Schritt-für-Schritt-Verbesserung und überprüfbare medizinische Vernunft in MLLMs 加强微低LLMs的逐步和可核实医疗理由 2506.16962v1 -
521 06-20 From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts Von Daten zu Wissen: Bewertung, wie effizient Sprachmodelle Fakten lernen 从数据到知识:评价如何高效语言模式学习事实 2506.16912v1 -
522 06-20 On Almost Surely Safe Alignment of Large Language Models at Inference-Time Zur fast sicher sicheren Ausrichtung großer Sprachmodelle bei Inferenz-Time 在推断时几乎可以安全地统一大语言模型 2502.01208v3 -
523 06-20 Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models Dynamische Wissensintegration für evidenzgetriebene Counter-Argument-Generation mit großen Sprachmodellen 具有大语言模型的有说服力的知识集成,用大语言模型进行证据驱动的反造反投标书生成 2503.05328v2 -
524 06-20 Deep Learning based Visually Rich Document Content Understanding: A Survey Deep Learning based Visually Rich Document Content Understanding: Eine Umfrage 基于深层学习的视觉丰富文件内容理解:调查 2408.01287v2 -
525 06-20 Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation Anpassung beim Lernen: LLMs für wissenschaftliche Probleme mit intelligenter Werkzeugverwendung anpassen 在学习期间适应适应:利用智能工具适应科学问题定位LMS 2411.00412v4 -
526 06-20 More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models Mehr denken, weniger sehen? Bewertung verstärkter Halluzinationen in multimodalen Vernunftmodellen 更多思考,少见? 评估多模式理由模型中放大的幻觉 2505.21523v3 -
527 06-20 Cost-effective Instruction Learning for Pathology Vision and Language Analysis Kostengünstiges Instruktionslernen für Pathologie Vision und Sprachanalyse 具有成本效益的病理学愿景和语言分析教学 2407.17734v2 -
528 06-20 Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs’ Multi-turn Instruction-Following Ability Fragen, Scheitern, Wiederholen: Meeseeks, ein iterativer Feedback-Benchmark für die Multiturn-Instruction-Following-Fähigkeit von LLMs 问题,失败,重复:Meeseeks,LLLM女士多功能指示-执行能力的一个循环反馈基准 2504.21625v4 -
529 06-20 MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning MIST: Jailbreaking Black-Box Large Language Models über iterative Semantic Tuning MIST: 通过迭代性语义图纸的黑盒黑盒大语言模型 2506.16792v1 -
530 06-20 Reimagining Urban Science: Scaling Causal Inference with Large Language Models Reimagining Urban Science: Skalierung von Kausalität mit großen Sprachmodellen 重新想象城市科学:与大语言模型的大规模因果推断 2504.12345v3 -
531 06-20 Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: 带有内嵌原体的管弦式分布式 AI系统 2403.04311v2 -
532 06-20 DistillNote: LLM-based clinical note summaries improve heart failure diagnosis DistillNote: Zusammenfassungen auf LLM-Basis verbessern die Diagnose der Herzinsuffizienz 蒸馏注:基于LLM的临床说明摘要改善心脏病诊断 2506.16777v1 -
533 06-20 SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation SSR-Zero: Einfaches Selbstveredelungslernen für maschinelle Übersetzung 机械翻译简单自评强化学习 2505.16637v3 -
534 06-20 Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models Cross-Modal Obfuskation für Jailbreak Attacken auf große Vision-Sprache Modelle 对大型视觉-语言模型进行越狱袭击的跨模式阻断 2506.16760v1 -
535 06-20 SocialSim: Towards Socialized Simulation of Emotional Support Conversation SocialSim: Auf dem Weg zu einer sozialisierten Simulation emotionaler Unterstützungsgespräche 社会观:社会化模拟情感支持对话 2506.16756v1 -
536 06-20 Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly Sprachinformierte Synthese von Rational Agent-Modellen für geerdete Theorie-von-Mind-Gründung On-The-Fly 理论理论理论理论理论理论理论理论理论理论理论基础理论 理论理论理论基础理论理论模型的语言综合 2506.16755v1 -
537 06-20 A Structured Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy Ein strukturierter Datensatz von Krankheit-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit 改善诊断准确性疾病 – – 症状协会结构数据集 2506.13610v2 -
538 06-20 Group-Level Data Selection for Efficient Pretraining Gruppen-Level-Datenauswahl für effizientes Vortraining 高效预科培训的集团一级数据选择 2502.14709v2 -
539 06-20 LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization LM-SPT: LM-basierte semantische Destillation für Sprachtokenisierung LM-SPT: LM 统一语法的语义蒸馏 2506.16738v1 -
540 06-20 The Role of Model Confidence on Bias Effects in Measured Uncertainties Die Rolle des Modellvertrauens auf Bias-Effekte bei gemessenen Unsicherheiten 信任模式在衡量不确定性方面对 beas 影响的影响的作用 2506.16724v1 -
541 06-20 ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models ReasonGRM: Generative Reward-Modelle durch große Reasoning-Modelle verbessern 理由GRM:通过大理由模型加强奖励奖励模式 2506.16712v1 -
542 06-20 Techniques for supercharging academic writing with generative AI Techniken zur Aufladung akademischer Schriften mit generativer KI 具有传宗传宗传宗接代的超级奖学金学术写作技术 2310.17143v4 -
543 06-20 MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension MaPPER: Multimodaler vorgeführter Parameter Effizientes Tuning für die Referenzierung von Expression-Verständnis MaPPER: 参考表达式理解的多式前制导参数效率计图 2409.13609v4 -
544 06-20 Large Language Models as Psychological Simulators: A Methodological Guide Große Sprachmodelle als Psychologische Simulatoren: Ein methodischer Leitfaden 《作为心理模拟器的大型语言模式:方法指南》 2506.16702v1 -
545 06-20 GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation GraphRAG-Bench: Herausfordernde Domain-spezifische Begründung für die Auswertung der Graph Retrieval-Augmented Generation 图图RAG-Bench:评估图回收-提款一代的有挑战性域特定原因 2506.02404v3 -
546 06-20 From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology Von Prompts zu Constructs: Ein Dual-Validity-Rahmenwerk für LLM-Forschung in der Psychologie 从提示到构造:心理学法学硕士研究的双重价值框架 2506.16697v1 -
547 06-20 LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions LLMs in der Krankheitsdiagnose: Eine vergleichende Studie von DeepSeek-R1 und O3 Mini Across Chronic Health Conditions 疾病诊断中的LLMs:在整个慢性健康状况中深海Seek-R1和O3 Mini的比较研究 2503.10486v2 -
548 06-20 Theoretical Guarantees for Minimum Bayes Risk Decoding Theoretische Garantien für die Risikodekodierung von Mindestbuchten 最低比亚最低风险编码理论保障 2502.12685v3 -
549 06-20 LegiGPT: Party Politics and Transport Policy with Large Language Model LegiGPT: Parteipolitik und Verkehrspolitik mit großem Sprachmodell 友好社:具有大语言模式的党政治和交通政策 2506.16692v1 -
550 06-20 Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz 嵌入式网络代理:为综合特工情报连接物理数字王国 2506.15677v2 -
551 06-20 Towards Safety Evaluations of Theory of Mind in Large Language Models Zu Sicherheitsbewertungen der Geistestheorie in großen Sprachmodellen 争取对大语言模式中思想理论进行安全评价 2506.17352v1 -
552 06-20 Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations Mechanismen vs. Ergebnisse: Testen für Syntax kann die Leistung bei gezielten syntaktischen Bewertungen nicht erklären 机制与结果:检验语法无法解释定向同步评估的绩效 2506.16678v1 -
553 06-20 Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning Med-U1: Förderung der einheitlichen medizinischen Vernunft in LLMs durch großangelegtes Verstärkungslernen Med-U1:通过大规模加强学习在LLMs中鼓励统一医疗理由 2506.12307v2 -
554 06-20 Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM Zero-Shot Kognitive Impairment Erkennung von Sprache mit AudioLLM 从使用音频LLM的演讲中检测出零热感知损伤 2506.17351v1 -
555 06-20 Kinetics: Rethinking Test-Time Scaling Laws Kinetik: Überdenken von Test-Zeit-Skalierungsgesetzen 动因:重新思考试验时间扩增法 2506.05333v3 -
556 06-20 Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models Adaptive Anleitung beschleunigt die Stärkung des Lernens von Vernunftmodellen 适应性指导加速加速强化理性模型学习 2506.13923v2 -
557 06-19 (4) Arch-Router: Aligning LLM Routing with Human Preferences Arch-Router: LLM-Routing mit menschlichen Präferenzen ausrichten Arch- Router: 与人类首选对齐 LLM Routing 2506.16655v1 -
558 06-19 Learning to Route LLMs with Confidence Tokens Lernen, LLMs mit vertrauensvollen Token zu routen 学习使用充满信心的LLMs路线 2410.13284v3 -
559 06-19 Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models Layer-Wise Alignment: Prüfung der Sicherheitsausrichtung über Bild-Encoder-Ebenen in Vision-Sprachenmodellen 图层对齐: 检查视觉语言模型中图像编码图层的安全对齐情况 2411.04291v2 -
560 06-19 GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View GeoGuess: Multimodale Begründung auf Basis der Hierarchie visueller Informationen in der Straßenansicht GeoGuess:基于街景视觉信息等级的多式联运理由 2506.16633v1 -
561 06-19 Initial Investigation of LLM-Assisted Development of Rule-Based Clinical NLP System Erste Untersuchung der LLM-Assistenten Entwicklung eines regelbasierten klinischen NLP-Systems 利用LLM协助开发有章可循的临床NLP系统的初步调查 2506.16628v1 -
562 06-19 Voice of a Continent: Mapping Africa’s Speech Technology Frontier Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren 非洲大陆之声:测绘非洲语音技术前沿 2505.18436v2 -
563 06-19 Modeling Public Perceptions of Science in Media Modellierung öffentlicher Wahrnehmungen von Wissenschaft in Medien 模拟公众对媒体科学的看法 2506.16622v1 -
564 06-19 From RAG to Memory: Non-Parametric Continual Learning for Large Language Models Vom RAG zum Speicher: Nicht parametrisches kontinuierliches Lernen für große Sprachmodelle 从RAG到内存:为大语言模型进行非计量连续学习 2502.14802v2 -
565 06-19 A Survey of Automatic Hallucination Evaluation on Natural Language Generation Eine Übersicht über die automatische Halluzination der natürlichen Sprachgenerierung 自然语言生成自动幻觉评价调查 2404.12041v3 -
566 06-19 Learning to Refine with Fine-Grained Natural Language Feedback Lernen, mit feinkörnigen natürlichen Sprachfeedback zu verfeinern 学习精细自然语言反馈 2407.02397v3 -
567 06-19 Using Natural Language Explanations to Rescale Human Judgments Natürliche Spracherklärungen verwenden, um menschliche Urteile neu zu skalieren 使用自然语言解释来调整人类判决书的规模 2305.14770v6 -
568 06-19 A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications Eine Bewertung der synthetischen Datengenerierung für biomedizinische Forschung und Anwendungen 生物医学研究和应用合成数据生成范围审查 2506.16594v1 -
569 06-19 Measuring (a Sufficient) World Model in LLMs: A Variance Decomposition Framework Messung eines (ausreichenden) Weltmodells in LLMs: Ein Rahmen für die Abweichungszersetzung 计量(足够)LLMM世界模型:差异分解框架 2506.16584v1 -
570 06-19 A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning A Impliziert B: Schaltungsanalyse in LLMs für propositionelle logische Vernunft A Implies B: 用于推定逻辑理由的LLMLM的电路分析 2411.04105v4 -
571 06-19 Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement Streaming nicht-autoregressives Modell für beschleunigte Umwandlung und Aussprache Verbesserung 流速转换和发音改进非自动递减模式 2506.16580v1 -
572 06-19 Advancing Harmful Content Detection in Organizational Research: Integrating Large Language Models with Elo Rating System Förderung schädlicher Inhaltserkennung in der Organisationsforschung: Integration großer Sprachmodelle mit Elo-Bewertungssystem 在组织研究中推动有害内容的探测:将大语言模型与Elo评分系统相结合 2506.16575v1 -
573 06-19 Weight Factorization and Centralization for Continual Learning in Speech Recognition Gewichtsfaktorisierung und Zentralisierung für kontinuierliches Lernen in der Spracherkennung 语音识别中持续学习的加权因素化和集中化 2506.16574v1 -
574 06-19 Capturing Visualization Design Rationale Capturing Visualization Design Rationale 模拟可视化设计 2506.16571v1 -
575 06-19 MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation MultiFinBen: Ein multilingualer, multimodaler und problemorientierter Benchmark für die finanzielle LLM-Bewertung MultiFinBen: 财务LLM评价的多种语言、多种模式和困难软件基准 2506.14028v2 -
576 06-19 AutoPresent: Designing Structured Visuals from Scratch AutoPresent: Designing Structured Visuals from Scratch 自动提交: 设计来自 Scratch 的结构化视觉 2501.00912v2 -
577 06-19 Automatic Speech Recognition Biases in Newcastle English: an Error Analysis Automatische Spracherkennung in Newcastle English: eine Fehleranalyse Newcastle英语的自动语音识别比数:错误分析 2506.16558v1 -
578 06-19 Revela: Dense Retriever Learning via Language Modeling Revela: Dense Retriever Lernen über Sprachmodellierung Revela:通过语言建模进行密集检索学习 2506.16552v1 -
579 06-19 Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements Feintuning große Audio-Sprachen-Modelle mit LoRA für die präzise zeitliche Lokalisierung von langanhaltenden Expositionstherapieelementen 与LORA一道精细设计大型音频语言模型,用于长期接触治疗元素的精确时间定位 2506.09707v2 -
580 06-19 Essential-Web v1.0: 24T tokens of organized web data Essential-Web v1.0: 24T Token von organisierten Web-Daten 基本Web v1.0: 24个有组织网络数据标记 2506.14111v2 -
581 06-19 xGen-MM (BLIP-3): A Family of Open Large Multimodal Models xGen-MM (BLIP-3): Eine Familie offener großer multimodaler Modelle xGen-MM(BLIP-3): “ 开放型大型多式联运模型大家庭 “ 2408.08872v3 -
582 06-19 Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples Relic: Verallgemeinerung des Prämienmodells für Low-Resource-Indic-Sprachen mit wenigen scharfen Beispielen Relic:加强低资源印度语言的奖赏示范性概括化,只有很少的热实例 2506.16502v1 -
583 06-19 SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development SWE-Dev: Bewertung und Schulung autonomer Feature-getriebener Software-Entwicklung SWE-Dev: 评估和培训自主开发地物-驱动软件开发 2505.16975v2 -
584 06-19 QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation QG-SMS: Verbesserung der Testobjektanalyse durch Studentenmodellierung und Simulation QG-SMS:通过学生建模和模拟加强测试物品分析 2503.05888v2 -
585 06-19 Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection Auf dem Weg zu allgemeingültigen allgemeinen schädlichen Sprachdatensätzen für Implizite Hass-Spracherkennung 争取建立通用通用通用有害言论数据集,用于隐含仇恨言论探测 2506.16476v1 -
586 06-19 Do We Talk to Robots Like Therapists, and Do They Respond Accordingly? Language Alignment in AI Emotional Support Sprechen wir mit Robotern wie Therapeuten, und reagieren sie entsprechend? 我们是否和像治疗师一样的机器人交谈,他们是否做出相应的回应? AI 情感支持中的语言对齐 2506.16473v1 -
587 06-19 Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models Sonde vor Ihnen sprechen: Auf dem Weg zur Black-Box Verteidigung gegen Hintertür Unausrichtung für große Sprachmodelle 在你发言前的探贝之前,你先谈:争取防止大语言模型的后门不匹配的黑箱防御 2506.16447v1 -
588 06-19 StoryWriter: A Multi-Agent Framework for Long Story Generation StoryWriter: Ein Multi-Agenten-Framework für lange Story-Generationen 故事文字:长代代多方行为者框架 2506.16445v1 -
589 06-19 REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing REIS: Ein leistungsstarkes und energieeffizientes Retrieval-System mit In-Storage-Verarbeitung REIS:具有在系统内处理的高效能和节能检索系统 2506.16444v1 -
590 06-19 Quantifying artificial intelligence through algorithmic generalization Quantifizierung künstlicher Intelligenz durch algorithmische Verallgemeinerung 通过算法一般化对人工智能进行量化 2411.05943v2 -
591 06-19 ALTA: Compiler-Based Analysis of Transformers ALTA: Compiler-basierte Analyse von Transformatoren ALTA:以汇编者为基础对变形器的分析 2410.18077v2 -
592 06-19 Unpacking Generative AI in Education: Computational Modeling of Teacher and Student Perspectives in Social Media Discourse Entpacken generativer KI in der Bildung: Computational Modeling von Lehrer- und Studentenperspektiven im Social Media Diskurs 《教育:在社会媒体讨论中教师和学生观点的计算模型》 2506.16412v1 -
593 06-19 When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework Wann funktioniert Trennen und Erobern für den langen Kontext LLM? Ein Lärmzersetzungsrahmen 何时分化和征服工作为长期LLM服务? 噪音分解框架 2506.16411v1 -
594 06-19 On Path to Multimodal Historical Reasoning: HistBench and HistAgent Auf dem Weg zu multimodaler historischer Vernunft: HistBench und HistAgent 通向多式联运历史原因原因之路:历史时尚与历史代理人 2505.20246v3 -
595 06-19 IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks IS-Bench: Bewertung der interaktiven Sicherheit von VLM-getriebenen Körpermitteln bei täglichen Haushaltsaufgaben IS-Bench:评估每日家务任务中VLM-Driven 充装代理人的互动安全 2506.16402v1 -
596 06-19 NepaliGPT: A Generative Language Model for the Nepali Language NepaliGPT: Ein generatives Sprachmodell für die nepalesische Sprache 尼泊尔语:尼泊尔语创作语言模式 2506.16399v1 -
597 06-19 OJBench: A Competition Level Code Benchmark For Large Language Models OJBench: Ein Benchmark für Wettbewerbsebenencodes für große Sprachmodelle OJBench:大语言模式竞争法基准 2506.16395v1 -
598 06-19 From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling Von der LLM-Anation zum LLM-Orchester: Koordinieren kleiner Modelle für die Datenkennzeichnung 从LLM认证到LLM主机:数据标签小型模型协调 2506.16393v1 -
599 06-19 RiOT: Efficient Prompt Refinement with Residual Optimization Tree RiOT: Effiziente Prompt-Verfeinerung mit Residual Optimization Tree RiOT: 高效快速提炼剩余优化树 2506.16389v1 -
600 06-19 Large Language Models in Argument Mining: A Survey Große Sprachmodelle im Argumentbergbau: Eine Umfrage 争议采矿大语言模型:调查 2506.16383v1 -
601 06-19 InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems InstructTTSEval: Benchmarking komplexe natursprachliche Anleitung im Anschluss an Text-zu-Sprach-Systeme 指令TTSEval: 以文字到语音系统为基准的复杂自然语言教学 2506.16381v1 -
602 06-19 Can structural correspondences ground real world representational content in Large Language Models? Können Strukturkorrespondenzen reale Repräsentationsinhalte in großen Sprachmodellen begründen? 结构通信能否在大语言模型中建立真实的世界代表性内容? 2506.16370v1 -
603 06-19 DISCIE – Discriminative Closed Information Extraction DISCIE – Diskriminative Closed Information Extraction DISCIE - 质疑性封闭信息提取 2506.16348v1 -
604 06-19 Analyzing the Influence of Knowledge Graph Information on Relation Extraction Analyse des Einflusses von Wissensgrapheninformationen auf die Beziehungsextraktion 分析知识图表信息对采掘关系的影响 2506.16343v1 -
605 06-19 Generalizability of Media Frames: Corpus creation and analysis across countries Generalisierbarkeit von Medienrahmen: Corpus-Erstellung und -Analyse über Länder hinweg 媒体框架的通用性:公司创建和各国的分析 2506.16337v1 -
606 06-19 Explainable Rule Application via Structured Prompting: A Neural-Symbolic Approach Erklärbare Regel-Anwendung über strukturierte Prompting: Ein neural-symbolischer Ansatz 通过结构化推动可解释的规则应用:神经-循环方法 2506.16335v1 -
607 06-19 PL-Guard: Benchmarking Language Model Safety for Polish PL-Guard: Benchmarking Sprachmodellsicherheit für Polnisch PL-Guard:波兰语言安全模式基准 2506.16322v1 -
608 06-19 AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation AlignDistil: Token-Level-Sprachmodell Alignment als Adaptive Policy Destillation Aligndistil: 作为适应性政策蒸馏的调整级语言模式模型对齐 2503.02832v2 -
609 06-19 LLM-Guided Indoor Navigation with Multimodal Map Understanding LLM-geführte Indoor-Navigation mit multimodalem Kartenverständnis 具有多式地图理解的LLM-引导式室内导航 2503.11702v4 -
610 06-19 Advancing Automated Speaking Assessment Leveraging Multifaceted Relevance and Grammar Information Automatisiertes Sprechen fördern Hebelisierung von facettenreicher Relevanz und Grammatik-Informationen 利用多方相关性和语法信息 2506.16285v1 -
611 06-19 Uncertainty Quantification in Retrieval Augmented Question Answering Unsicherheit Quantifizierung in Retrieval Augmented Question Answering 检索增强回答问题时的不确定性定量 2502.18108v2 -
612 06-19 End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data End-to-End-Sprachübersetzung für Low-Resource-Sprachen mit schwach beschrifteten Daten 使用微弱标签数据翻译低资源语言端对端语音 2506.16251v1 -
613 06-19 Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports Vergleichende Analyse der abstrakten Zusammenfassungsmodelle für klinische Radiologieberichte 临床放射学报告摘要摘要摘要模型比较分析 2506.16247v1 -
614 06-19 BEADs: Bias Evaluation Across Domains BEADs: Bias-Evaluierung über Domains hinweg BEADs: 跨领域偏见评价 2406.04220v5 -
615 06-19 Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts Multi-Preference-Optimierung: Verallgemeinern von DPO über Set-Level-Kontrast 多优先优化:通过定点对比度普及残疾人组织 2412.04628v4 -
616 06-19 Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models Bewertung und Abmilderung medizinischer Kenntnisse Drift und Konflikte in großen Sprachmodellen 评估和减少大语言模式中的医学知识疏漏和冲突 2505.07968v2 -
617 06-19 Learning Dynamics in Continual Pre-Training for Large Language Models Dynamisches Lernen im kontinuierlichen Pre-Training für große Sprachmodelle 大语言模式持续培训前培训中的学习动态 2505.07796v2 -
618 06-19 Web(er) of Hate: A Survey on How Hate Speech Is Typed Web(er) of Hate: Eine Umfrage über die Art und Weise, wie Hate Speech eingegeben wird Web(er) “ 仇恨:关于仇恨言论如何打字的调查 “ 2506.16190v1 -
619 06-19 JETHICS: Japanese Ethics Understanding Evaluation Dataset JETHICS: Japanische Ethik verstehen Evaluierungsdatensatz JETICS:日本道德理解评价数据集 2506.16187v1 -
620 06-19 AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation AUTOLAW: Verbesserung der rechtlichen Compliance in großen Sprachmodellen durch Fallrechtgenerierung und Jury-inspirierte Beratung AUTOAW:通过判例法的产生和陪审团的启发审议,加强大语言模式中的法律合规性 2505.14015v2 -
621 06-19 SGIC: A Self-Guided Iterative Calibration Framework for RAG SGIC: Ein selbstgesteuerter iterativer Kalibrierrahmen für RAG SAGC: RAG自行指导的迭代校准框架 2506.16172v1 -
622 06-19 DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products DeltaProdukt: Verbesserung der State-Tracking in linearen RNNs über Haushaltsprodukte DeltaProduction:通过家用产品改进国家通过家用产品对Linear RNNNs的跟踪 2502.10297v6 -
623 06-19 Under the Shadow of Babel: How Language Shapes Reasoning in LLMs Unter dem Schatten von Babel: Wie sich Sprache in LLMs begründet Babel的阴影之下:LLMM中语言形状如何解释 2506.16151v1 -
624 06-19 PRISON: Unmasking the Criminal Potential of Large Language Models PRISON: Entlarvung des kriminellen Potenzials großer Sprachmodelle 释放大语言模式犯罪潜力 2506.16150v1 -
625 06-19 Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems Beyond Self-Talk: Eine kommunikationszentrische Untersuchung von LLM-basierten Multiagentensystemen 超越自言自语:以LLM为基础的多种机构系统的通信中心调查 2502.14321v2 -
626 06-19 GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning GRPO-CARE: Konsequentitäts-Bewusst-Verstärkungs-Lernen für multimodale Vernunft GROPO-CARE: 统一软件强化学习,用于多模式理由 2506.16141v1 -
627 06-19 Batayan: A Filipino NLP benchmark for evaluating Large Language Models Batayan: Ein philippinischer NLP-Benchmark für die Bewertung großer Sprachmodelle Batayan:菲律宾国家语言方案评估大语言模型基准 2502.14911v2 -
628 06-19 FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning FICOT: 以专家财务理由作为研究链的基础 2506.16123v1 -
629 06-19 DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents DrunkAgent: Stealthy Memory Korruption in LLM-Powered Recommender Agents DrunkAgent:LLM授权建议代理人的隐性记忆腐败 2503.23804v2 -
630 06-19 On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse Über die Grenzen der Sprachgenerierung: Trade-Offs zwischen Halluzination und Modekollaps 语言产生限制:幻觉与模式崩溃之间的取舍 2411.09642v2 -
631 06-19 CIVET: Systematic Evaluation of Understanding in VLMs CIVET: Systematische Bewertung des Verständnisses in VLMs CIVET: 系统评估对脆弱、危险、危险和 2506.05146v2 -
632 06-19 Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning Rethinking External Slow-Thinking: Von Schneeballfehlern zur Wahrscheinlichkeit einer korrekten Begründung 重新思考外部缓慢思考:从雪球错误到正确理由的概率 2501.15602v3 -
633 06-19 Probing the Robustness of Large Language Models Safety to Latent Perturbations Nachweis der Robustheit großer Sprachmodelle Sicherheit zu latenten Störungen 检验大语言模型安全性是否强,以证实大语言模型安全性是否足以应对前端扰动 2506.16078v1 -
634 06-19 Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content Täuschender Humor: Ein synthetischer Mehrsprachiger Benchmark-Datensatz zur Überbrückung von fabrizierten Claims mit humorvollem Inhalt 欺骗性幽默:一个合成多语种基准数据集,用于将制造索赔与幽默内容连接起来 2503.16031v3 -
635 06-19 Cyberbullying Detection in Hinglish Text Using MURIL and Explainable AI Cyberbullying-Erkennung im abschreckenden Text mit MURIL und erklärbarer KI 使用 MURIL 和可解释的 AI 的 Hinglish 文本中的网络欺凌探测 2506.16066v1 -
636 06-19 Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning Selbstkritische Kuriositätsverfeinerung: Ehrlichkeit und Hilfsbereitschaft in großen Sprachmodellen durch In-Context Learning verbessern 自我批评、指导的精炼好奇力改进:通过内文学习加强大语言模式的诚实性和帮助性 2506.16064v1 -
637 06-19 BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs BriefMe: Ein gesetzlicher NLP-Benchmark für die Unterstützung mit rechtlichen Briefen 简报:协助提供法律简报的《国家劳工规划法》法律基准 2506.06619v3 -
638 06-19 Knee-Deep in C-RASP: A Transformer Depth Hierarchy Knie-Tief in C-RASP: Eine Transformer-Tiefe Hierarchie C-RASP:变异深度分层 2506.16055v1 -
639 06-19 A Hybrid DeBERTa and Gated Broad Learning System for Cyberbullying Detection in English Text Ein hybrides DeBERTa- und Gated-Breites Lernsystem für Cyberbullying Detection im englischen Text 混合德贝塔和通用广学系统,用于网络欺凌探测的英文文本 2506.16052v1 -
640 06-19 Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping Leiter-Residual: Parallelismus-bewusste Architektur zur Beschleunigung großer Modellinferenz mit Kommunikationsüberlappung 云梯-残余:加速大型模型推断与通信重叠的平行意识结构 2501.06589v5 -
641 06-19 DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling DynScaling: Effizientes Verifier-freies Inferenzscaling über dynamische und integrierte Sampling DynSACLAG:通过动态和综合抽样,提高验证人无引文的有效比例 2506.16043v1 -
642 06-19 Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3 Verbesserung der Dokumenten-Fragebeantwortung mittels Multi-Hop Retrieval-Augmented Generation mit LLaMA 3 通过多层检索-提法一代加强文件层面的回答问题,LLAMA 3 2506.16037v1 -
643 06-19 Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives Re-TASK: LLM-Aufgaben aus Capability, Skill und Knowledge Perspectives überarbeiten 重新研究TASK:从能力、技能和知识角度重新研究LLM任务 2408.06904v3 -
644 06-19 EvoLM: In Search of Lost Language Model Training Dynamics EvoLM: Auf der Suche nach verlorenen Sprachmodellen EvoLM: 寻找失传语言培训模式 2506.16029v1 -
645 06-19 From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation Von General zu Targeted Rewards: Übertreffen von GPT-4 in Open-Ended Long-Context Generation 从一般奖励到有目标的奖励:在不限期长长一代中取代GPT-4 2506.16024v1 -
646 06-19 Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models Min-p, Max Übertreibung: Eine kritische Analyse der Min-p-Sampling in Sprachmodellen Min-p, Max Explation: 对语言模型的 Min-p 抽样的批判性分析 2506.13681v2 -
647 06-19 Detecting Prefix Bias in LLM-based Reward Models Präfix Bias in LLM-basierten Prämienmodellen erkennen 在以LLM为基础的奖励模型中检测前功ix Bias 2505.13487v2 -
648 06-19 Bayesian Epistemology with Weighted Authority: A Formal Architecture for Truth-Promoting Autonomous Scientific Reasoning Bayesische Epistemologie mit gewichteter Autorität: Eine formale Architektur für wahrheitsfördernde autonome wissenschaftliche Vernunft Bayesian Bayesian 与加权当局合作的巴耶斯派:促进真相促进自主科学理由的正式架构 2506.16015v1 -
649 06-19 Beyond Prediction – Structuring Epistemic Integrity in Artificial Reasoning Systems Jenseits der Vorhersage – Strukturierung epistemischer Integrität in künstlichen Vernunftsystemen 超越预测 – – 构建人造理由说明系统中的宇宙完整性 2506.17331v1 -
650 06-19 Large-Scale Data Selection for Instruction Tuning Großformatige Datenauswahl für die Instruction Tuning 用于教学图示的大型数据选择 2503.01807v2 -
651 06-19 SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes SDE-SQL: Verbesserung der Text-zu-SQL-Generierung in großen Sprachmodellen durch selbstgesteuerte Exploration mit SQL-Probes SDE-SQL:通过自发探索SQL勘探,在大语言模型中加强从文字到SQL的生成 2506.07245v2 -
652 06-19 On Domain-Adaptive Post-Training for Multimodal Large Language Models Zum Domain-Adaptive Post-Training für multimodale große Sprachmodelle 关于多模式大语言模式的多模式后培训 2411.19930v3 -
653 06-19 Core Knowledge Deficits in Multi-Modal Language Models Kernwissensdefizite in multimodalen Sprachmodellen 多模式语言模型中的核心知识缺陷 2410.10855v4 -
654 06-19 Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion Doppelentendre: Robuste audiobasierte KI-generierte Lyrics-Erkennung über Multi-View Fusion 双向内容: 强力音频根据 AI 生成的音频通过多视图组合探测 2506.15981v1 -
655 06-19 A Vietnamese Dataset for Text Segmentation and Multiple Choices Reading Comprehension Ein vietnamesischer Datensatz für Textsegmentierung und Multiple Choices Leseverständnis 用于文字分割和多重选择的越南数据集阅读理解 2506.15978v1 -
656 06-19 FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation FEA-Bench: Ein Benchmark für die Bewertung der Code-Generierung auf Repository-Ebene für die Feature-Implementierung FEA-Bench:评估存储器一级实施地物代码生成的基准 2503.06680v2 -
657 06-19 Multi-use LLM Watermarking and the False Detection Problem Multi-Use LLM Watermarking und das Problem der falschen Erkennung 多用途LLM LLM 水标志和假探测问题 2506.15975v1 -
658 06-19 LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning LazyEviction: Verlangsamte KV-Eviktion mit Aufmerksamkeitsmusterbeobachtung für effizientes Long Reasoning LazyEvition: 以关注方式对有效长长理由进行观察的Lucking KV驱逐 2506.15969v1 -
659 06-19 Reranking-based Generation for Unbiased Perspective Summarization Reranking-basierte Generation für unvoreingenommene Perspektive Zusammenfassung 重新排名的无偏见观点概述一代 2506.15925v1
Article 0
Title@2025-06-26 (4): HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Title: HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation | HalluSegBench: Counterfactual Visual Reasoning for Segmentation Halluzination Evaluation | HalluSegeBench:截肢幻觉评价的反事实视觉理由 2506.21546v1 |
Authors (6): Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou
Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.
然而,这些模型往往通过为不以图像内容为根据的物体制作分离面罩或错误地标出不相干区域而产生幻觉。现有的分解幻觉评价程序主要侧重于标签或文字幻觉,而没有操纵视觉环境,限制了它们诊断重大故障的能力。作为回应,我们引入了HalluSegeBench,这是第一个专门设计通过反事实视觉推理角度评价视觉地面幻觉的基准。我们的基准包括一套新颖的数据集,共有1340对反事实实例,涵盖281个独特的对象类别,以及一套新推出的计量指标,在视觉一致的场景编辑中量化幻觉敏感性。HalluSege-Bench的实验与最先进的视觉语言分解模型显示,由视觉驱动的幻觉比由标签驱动的幻觉更加普遍,模型往往以假分解为主,强调需要反事实推论来判断真真性。
Article 1
Title@2025-06-26 (4): Data Efficacy for Language Model Training
Title: Data Efficacy for Language Model Training | Dateneffizienz für Sprachmodellschulungen | 语文示范培训的数据效率 2506.21545v1 |
Authors (9): Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li
Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.
近期研究致力于提高数据效率,目的是通过选择最低限度或最佳的培训数据子集来最大限度地提高绩效。数据筛选、抽样和选择等技术在这一领域发挥着关键作用。为了补充数据效率,我们定义数据效率,重点是通过优化培训数据组织来最大限度地提高绩效,并且仍然相对未得到充分探讨。这项工作引入了一个总体模式,即DELT,以考虑LM培训的数据效率,这突出了培训数据组织的重要性。DELT由三个部分组成:数据分类、数据选择和数据排序。在这些组成部分中,我们设计了可读性质量分类(LQS),作为数据筛选的新实例,其中既考虑到从梯度一致性角度对每项数据样本的学习性和质量,又考虑到每个数据样本的质量。我们还设计了Folding定序(F),作为数据定序的一个新例子,解决了模型遗忘和数据分布偏差等问题。在LM培训中全面测试数据效率,其中显示以下:首先,各种应用DELT培训实例,作为数据质量的新实例,通过将数据升级数据领域转化为数据系统,提高数据排序。
Article 2
Title@2025-06-26 (4): “What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
Title: “What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets | “Was ist los, Doc?”: Analysieren, wie Nutzer Gesundheitsinformationen in groß angelegten KI-Datensätzen suchen | “怎么了,医生?” :分析用户如何在大型对话的AI数据集中寻求健康信息。 2506.21532v1 |
Authors (8): Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat
人们越来越多地通过互动式聊天室寻求大型语言模型(LLMs)的保健信息,然而,这些对话的性质和内在风险基本上仍未探讨。在本文中,我们过滤了大规模对话性AI数据集,以达到HealChat-11K,这是由25K用户信息组成的11K真实世界对话的集成数据集。我们使用HealthChat-11K和诊所驱动的分类,以了解用户在寻求保健信息时如何与LLMs互动,以便系统研究用户在21个不同健康专业的相互作用。我们的分析揭示了用户如何和为什么寻求健康信息的性质,例如常见的互动、不完全的背景、情感行为和互动(例如,主要问题),从而可以引起对症,强调作为谈话AI部署的LLMS的保健支持能力需要改进。我们检索我们的分析并将其整合到一个整理数据集的代码和工艺品:https://github.com/yskapar/HehrChat。
Article 3
Title@2025-06-26 (4): OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
Title: OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages | OpenNER 1.0: Standardisierte Open-Access-Datensätze für die Entity-Erkennung in 50+ Sprachen | OpenNER 1.0:标准化的开放获取实体识别数据集,50+语言 2412.09587v2 |
Authors (5): Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne Sälevä, Constantine Lignos
We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task.
我们介绍了公开提供的名称实体识别(NER)数据集的标准化OpenNER 1.0。OpenNER包含涵盖52种语言的36个NER公司,对不同名称实体的内涵作人注解。我们纠正了注解格式问题,将原始数据集标准化为统一格式,在整个公司统一实体类型名称,并在一个能够进行多语种和多神学净入学率研究的结构中提供这种收集。我们利用三个经过预先培训的多语言模型和两个大型语言模型提供基线结果,以比较最近模型的性能,促进未来NER的研究。我们发现,没有一种单一模式是所有语文的最佳模式,仍然需要大量工作才能从LLMM公司获得关于NER任务的高绩效。
Article 4
Title@2025-06-26 (4): Potemkin Understanding in Large Language Models
Title: Potemkin Understanding in Large Language Models | Potemkin Verständnis in großen Sprachmodellen | 大语言模型中的波坦金理解 2506.21521v1 |
Authors (4): Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM’s capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs – such as AP exams – are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
大型语言模型(LLMS)定期使用基准数据集进行评估。 但是,根据对一组成熟问题的答复对LLM的能力作出推断的理由何在?本文件首先提出了一个解决这一问题的正式框架。关键在于指出用于测试LLMS的基准 – – 如AP考试 – – 也是用来测试人的基准。然而,这产生了一种含义:这些基准只有在LLMS以反映人类误解的方式误解概念时才有效。否则,在基准上的成功只能表明对问题的理解:答案所驱使的理解幻觉与任何人类如何解释一个概念是无法调和的。我们提出了两个量化Poteemkins存在的程序:一个在三个领域使用专门设计的基准,另一个使用一般程序,提供较低的其流行程度。我们发现,Potemkins在各种模型、任务和领域之间都是普遍的。我们还发现,这些失败不仅反映了错误的理解,而且反映了概念表达中更深的内部不连贯。
Article 5
Title@2025-06-26 (4): skLEP: A Slovak General Language Understanding Benchmark
Title: skLEP: A Slovak General Language Understanding Benchmark | sklep: Ein slowakisches allgemeines Sprachverständnis Benchmark | SkLEP:斯洛伐克一般语言理解基准 2506.21508v1 |
Authors (8): Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko
In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
在这项工作中,我们引入了SkLEP,这是专门为评价斯洛伐克自然语言理解(NLU)模式而设计的第一个全面基准;我们汇编了SkLEP, 涵盖九种不同的任务,包括象征性、句式和文件层面的挑战,从而对模型能力进行彻底评估;为建立这一基准,我们为斯洛伐克人专门设计了新的原始数据集,并仔细翻译了已经建立的英文NLU资源;在本文件中,我们还利用SkLEP任务,对一系列斯洛伐克特有、多语种和经过培训的英语语言模型进行了首次系统和广泛的评价;最后,我们还发布了完整的基准数据、一个开放源工具包,便利了模型的微调和评估,并在https://github.com/slovak-nlp/sklep上发布了一个公共领导板,希望促进斯洛伐克国家语言联盟的再生化和推动未来研究。
Article 6
Title@2025-06-26 (4): Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Title: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten | Mind2Web 2: 与代理法官评估代理搜索 2506.21506v1 |
Authors (26): Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
深层研究系统等大型语言模型自主浏览网络、综合信息和返回全面的引证支持的答案,是用户与网络规模信息互动方式的重大转变。 代理搜索日益复杂和开放,其速度超过了现有的评价基准和方法,这些基准和方法基本上假定短搜索视野和静态答案。 在本文件中,我们引入了Mind2Web 2, 基准为130个现实、高质量和长视线,需要实时网络浏览和广泛的信息合成,以人类劳动1 000多小时为基础。为了应对对时间变化和复杂答案进行评估的挑战,我们提出了一个新的代理工具-A-Judge框架。我们的方法根据树木结构设计设计构建了具体任务法官代理人,以自动评估答案的正确性和源归因。我们对9个前沿代理搜索系统和人类业绩进行了全面评价,同时进行详细的错误分析,以洞察未来发展。 最佳的系统,OpreaAI 深层研究系统,在开发50-70年期基础时,可以提供一个强大的模型基础。
Article 7
Title@2025-06-26 (4): Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments
Title: Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments | Verbesserung des Nutzerengagements im sozial-gesteuerten Dialog durch interaktive LLM-Alignments | 通过互动LLM调整,加强用户参与社会驱动对话 2506.21497v1 |
Authors (8): Jiashuo Wang, Kaitao Song, Chunpu Xu, Changhe Song, Yang Xiao, Dongsheng Li, Lili Qiu, Wenjie Li
Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user’s reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree \textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textit{i$\times$MCTS}, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.
通过互动加强用户参与在社会驱动的对话中发挥着必不可少的作用。虽然先前的工作已经优化了模式,以理会相关知识或规划对话行为流,但用户参与和知识或对话行为之间的关系是微妙的,不能保证用户参与社会驱动的对话。为此,我们通过利用未来对话开发过程中的信号,使互动LLMM能够学习用户参与。具体地说,我们采用了用户参与的一个更直接和更相关的指标,即用户在互动后对对话意向的反应,以此奖励互动LMs。为此,我们开发了一个用户模拟器,与目标互动LLMs互动互动互动,并探索用户与互动LLM系统之间的互动关系,通过\textit{$\ti$\time$\time$MCTS} (\textit{Cralit{Crextit{T}tre le textitalit{Srarts {Sartarg} 用于在互动支持\ textitleitit{i}nteraction。我们收集了一个数据集,包含更高和低质量经验的对应机构,使用\textimitaltime listal develops listal listral listranglates) us asuder (通过高额的用户参与, press tadudududududududududude) us us vidududuductions usaldaldaldald us us us us us uspress) us us
Article 8
Title@2025-06-26 (4): Bridging Offline and Online Reinforcement Learning for LLMs
Title: Bridging Offline and Online Reinforcement Learning for LLMs | Überbrückung Offline- und Online-Verstärkungslernen für LLMs | 为LLMMs搭桥离线和在线加强学习 2506.21495v1 |
Authors (12): Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov
We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.
我们调查了在从离线到半离线到完全在线的可核查和不可核实任务制度时,改进大语言模式的强化学习方法的有效性。我们的实验内容包括经过一套基准评估后进行可核查数学和不可核实教学的培训。在这些环境中,我们广泛比较了在线和半在线直接优惠优化和小组奖励政策优化目标,并令人惊讶地发现这些变式的类似业绩和趋同,它们都大大优于离线方法。我们详细分析了培训动态和超参数选择战略,以取得最佳结果。最后,我们表明,与可核查和不可核实的奖励相结合的多任务共同提高了这两类任务的业绩。
Article 9
Title@2025-06-26 (4): Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages
Title: Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages | Mit Phonemes: Mehrsprachigkeit von LLMs für nicht-lateinische Script-Sprachen verbessern | 以电话提示:提高LLMS的非拉丁文拼写语言多重语言质量 2411.02398v3 |
Authors (7): Hoang H Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
Although multilingual LLMs have achieved remarkable performance across benchmarks, we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin script languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation from both leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.
虽然多语种LLM在基准方面取得了显著的成绩,但我们发现,在当代LLM家族中,这些LLM在非拉丁文字语言上的表现仍然不尽如人意。这一差异源于LLM在接受正拼写文字学培训之前就已接受过拼写文字学的训练,这些文字主要是拉丁字符,这些拉丁字符模糊了他们与非拉丁文字的同声文字学。我们建议利用电话抄录作为辅助信号,以诱导脚本和拉丁文字表达。我们的研究显示,结合语音信号可以改善非拉丁文字和拉丁文字语言的成绩,对缩小两种文字的成绩差距产生特别显著的影响。我们通过详细实验发现,语音和文字文字文字学(ICL)可以找到不同的例子。这激励了我们拟议的混合ICL检索战略,在这两个战略中,从中进一步整合导致我们对拉丁文字语言(高达12.6%)和非拉丁文字语言(高达15.1%)和非拉丁文字语言(高达15.1%)的成绩与随机的ICL检索相比都有显著的改进。
Article 10
Title@2025-06-26 (4): From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Title: From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents | Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents | 从网络搜索到代理深层研究:激励使用理性代理进行搜索 2506.18959v2 |
Authors (23): Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
我们的立场是,具有推理和代理能力的大型语言模型(LLMS)正在引入一个新的范式,称为 “ 干深研究 “ 。这些系统通过紧密整合自主推理、迭代检索和信息合成,超越常规信息搜索技术,形成动态反馈循环。我们追踪从静态网络搜索到互动、代理系统的变化,这些系统计划、探索和学习。我们还引入了测试-时间缩放法,以正式确定计算深度对推理和搜索的影响。我们借助基准结果和开放源执行的崛起,证明 “ 干深研究 “ 不仅大大超越了现有方法,而且还准备成为未来信息搜索的主要范式。所有相关资源,包括工业产品、研究论文、基准数据集和开放源实施,都在https://github.com/DavidZZZ/Awesome-Deep-Research中为社区收集了所有相关资源,包括工业产品、研究文件、基准数据集、基准数据集和开放源实施。
Article 11
Title@2025-06-26 (4): Logios : An open source Greek Polytonic Optical Character Recognition system
Title: Logios : An open source Greek Polytonic Optical Character Recognition system | Logios : Ein offenes griechisches Polytonisches optisches Zeichenerkennungssystem | Logios: 开放源码希腊多元光学特征识别系统 2506.21474v1 |
Authors (2): Perifanos Konstantinos, Goutsos Dionisis
In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.
在本文中,我们展示了专门为准确识别希腊多元文字并将其数字化而设计的光学特征识别系统(OCR),通过利用地物提取和序列学习的复数层的结合优势,我们的系统应对希腊多元文字构成的独特挑战,这一方法旨在克服传统的光学特征识别方法的局限性,在准确性和效率方面提供显著改进,我们将基本模型作为开放源图书馆发布,并使我们的光学特征识别平台可供学术使用。
Article 12
Title@2025-06-26 (4): TopK Language Models
Title: TopK Language Models | TopK-Sprachenmodelle | 顶 K 语言模式 2506.21468v1 |
Authors (4): Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling
Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE’s side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model’s hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.
以变压器为基础的语言模型(LMS)的启动空间的启动空间(LMS)已经成为一个重要分析与解释工具。然而,以变压器为基础的语言模型(SAEs)的启动空间(LMS)的启动空间(LMS)也存在一些缺陷,这些缺陷降低了它们的效用和内部有效性。由于SAE是经过训练的后热层,因此不清楚未能发现特定概念是否是SAE一方的失败,或是因为基本LM不代表这一概念。由于培训条件和结构选择影响到SAE的学习特点,这一问题变得更加严重。当追踪LMS在培训期间如何学习概念时,特性的稳定性也使得难以比较不同检查站的SAE系统特征。为了克服这些局限性,我们引入了对变压器结构结构的修改,在选定的层中引入了TopK激活功能,使模型的隐藏状态等同于TopK SAE的潜伏特征。这个方法消除了对后热培训的需求,同时提供了与SAE的可解释性。由此而形成的TopKLMS模型提供了一种有利的交易交易交易,在模型上的可靠、计算效率和可解释性分析中,而我们则能够通过SttKMSderegregreabrearel 成功地解释了我们所学的模型的原始的模型将使得其原始的模型能能能能能通过LDLMSDLDLDDMADA 使我们所学的初始化的原始的模型能够促进了自己的原始学习工具。
Article 13
Title@2025-06-26 (4): Aligning Spoken Dialogue Models from User Interactions
Title: Aligning Spoken Dialogue Models from User Interactions | Ausrichten von gesprochenen Dialogmodellen aus Benutzerinteraktionen | 校对用户互动中的口语对话框模型 2506.21463v1 |
Authors (4): Anne Wu, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez
We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
我们提出一个新的优惠调整框架,以改善用户互动实时对话的口语对话模式; 目前优惠的学习方法主要侧重于基于文本的语言模式,不直接适应实时语音互动的复杂性,其动态更丰富(例如中断、互接),发言者旋转之间没有明显的分化; 我们创建了大型数据集,由来自原始多端语音对话的150 000多对特惠组合组成,并附有AI反馈说明,以涵盖语言内容和时间背景变化的偏好; 我们利用离线调整方法微调一个全不易自定义的自动语音对语音模式。 广泛的实验表明,对通用对话的反馈可以始终有效地改进口语对话模式,产生更实际、更安全和更符合背景的交互作用。 我们运用微调模型,进行整体的人类评价,以评估单点话之外的影响。 我们的调查结果揭示了各种动态之间平衡兼顾的重要性,这对自然实时语音对话系统至关重要。
Article 14
Title@2025-06-26 (4): Spatial Mental Modeling from Limited Views
Title: Spatial Mental Modeling from Limited Views | Räumliche mentale Modellierung aus begrenzten Ansichten | 根据有限观点进行空间精神建模 2506.21458v1 |
Authors (14): Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for “what-if” movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, “map-then-reason”, that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
视觉语言模型(VLMS) 能够从几个角度来想象完整的场景吗? 人类形成空间心理模型, 隐藏空间空间的内部表现, 以了解布局、 视角和运动。 我们新的MindCube基准, 共21,154个问题, 共3, 268个图像, 暴露了这一关键差距, 现有的VLMS展示了近于随机的性能。 我们使用MindCube, 系统地评估VLMS通过代表位置( 认知映射)、 方向( 视觉) 和动态( “ 东西” 运动的模拟) 来构建强大的空间心理模型有多好。 然后我们探索了三种方法来帮助VLMS近似空间心理模型, 包括隐蔽的中间观点、 自然语言推理链和认知图。 重要的改进来自于协同方法, “ 数字当机 ” , 即联合训练模型, 首次生成认知地图, 然后再解释原因。 通过培训模型来理解这些内部地图, 我们提高了精确度, 从37.8% 到60.8% (+23. 0 % ) 。 。 。 我们添加了强化学习推力 进一步推延到70.7% , , , 和 构建空间空间空间结构 , 我们的主要推力 构建了空间演化 , 和空间演化 , 我们 大大地 构建了空间演化, 我们 的 空间演化, 我们 建了空间演化 的 的 等 建了 空间演化是 的 。
Article 15
Title@2025-06-26 (4): Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
Title: Text2Cypher Across Languages: Evaluating Foundational Models Beyond English | Text2Cypher Across Sprachen: Bewertung von Grundmodellen jenseits des Englischen | 跨语言文本:评价超越英语的基础模型 2506.21445v1 |
Authors (2): Makbule Gulcin Ozsoy, William Tai
Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.
大型语言模型的最近进展使自然语言界面能够将用户问题转换成数据库查询,如Text2SQL、Text2SPARQL和Text2Cypher。这些界面加强了数据库的可访问性,但今天大多数研究只侧重于英语,而其他语言的评价则有限。本文调查了文本2Cypher任务的多种语言基础LMs的绩效。我们创建并发布了一套多语种测试,将英语问题翻译成西班牙文和土耳其文,同时保留原始的Cypher查询,从而能够进行公平的跨语种比较。我们利用标准化的提示和衡量标准对多个基础模型进行评估。我们的成果显示了一种一致的性能模式:英语、西班牙语和德语最高,土耳其语最低。我们将此归因于培训数据提供和语言特点的差异。此外,我们探索了将任务提示翻译成西班牙语和土耳其文的影响。结果显示,评价指标没有多大变化,表明迅速翻译影响很小。我们的调查结果强调,在多语种查询生成方面需要进行更具包容性的评价和发展。未来的工作包括对多种语言的本地化和微调。
Article 16
Title@2025-06-26 (4): Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection
Title: Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection | Domänenwissen-verbesserte LLMs für Betrug und Konzept-Drift-Erkennung | 防止欺诈和概念漂流探测的有知识增强的有限LMs 2506.21443v1 |
Authors (3): Ali Şenol, Garima Agrawal, Huan Liu
Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98\% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.
在动态平台上检测欺骗性对话越来越困难,因为语言模式和概念Drift(CD)-i.)-(CD)-(i)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-)-(d)-(d)-(d)-(d)-)-(d)-(d)-(d)-(d)-(d)-)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-)-(d)-(d)-(d)-(d)-)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-(d)-)-(d)-)-)-)-(d(d)-(d)–)-)—-(d(d)-(d-(d)-)-(d)–(d)-(d)-(d)-(d—)———————)-)———)—)-)-)-)————————-)–)-)————–)—————-)-)——————————-
Article 17
Title@2025-06-26 (4): Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations
Title: Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations | Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen | 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v3 |
Authors (4): Zeinab Dehghani, Mohammed Naveed Akram, Koorosh Aslansefat, Adil Khan
Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.
GPT、LLAMA和Claude等大型语言模型在生成文本方面已经变得非常强大,但它们仍然是黑盒,所以很难理解它们如何决定要说什么。缺乏透明度可能会有问题,特别是在信任和问责很重要的领域。为了对此有所帮助,我们引入了SMILE,这是一个解释这些模型如何对快速的不同部分作出反应的新方法。SMILE是模型的不可知性,通过略微改变输入、测量产出变化,然后突出哪些词具有最大影响来开展工作。创建简单的直观热图,显示一个最迅速的事物的哪些部分。我们用几个主要的LLMMMS测试了SMILE,并使用了精确性、一致性、稳定性和忠诚性等指标来表明它提供了清晰可靠的解释。通过让这些模型更容易理解,SMILE让我们更接近于使AI更加透明和可信。
Article 18
Title@2025-06-26 (4): Rethinking LLM Training through Information Geometry and Quantum Metrics
Title: Rethinking LLM Training through Information Geometry and Quantum Metrics | Rethinking LLM Training durch Informationsgeometrie und Quantenmetrics | 通过信息几何和量度测量重新思考LLM培训 2506.15830v2 |
Authors (1): Riccardo Di Sipio
Optimization in large language models (LLMs) unfolds over high-dimensional parameter spaces with non-Euclidean structure. Information geometry frames this landscape using the Fisher information metric, enabling more principled learning via natural gradient descent. Though often impractical, this geometric lens clarifies phenomena such as sharp minima, generalization, and observed scaling laws. We argue that curvature-aware approaches deepen our understanding of LLM training. Finally, we speculate on quantum analogies based on the Fubini-Study metric and Quantum Fisher Information, hinting at efficient optimization in quantum-enhanced systems.
大型语言模型(LLMs)的最佳化(LLMs)在具有非欧洲语言结构的高维参数空间上展开。信息几何用Fisher信息度量来测量这一景观,从而能够通过自然梯度下降进行更有原则的学习。虽然这种几何透镜通常不切实际,但澄清了尖锐迷你、一般化和观察到的测量法等现象。我们争辩说,曲线认知法的方法加深了我们对LLM培训的理解。 最后,我们根据Fubini-Study 度量子和Quantum Fisher信息对量子类比进行推测,暗示了量子强化系统中的高效优化。
Article 19
Title@2025-06-26 (4): Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference
Title: Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference | Skalierbare Bayesische Low-Rank-Anpassung von großen Sprachmodellen über stochastische Variations-Subraum-Inferenz | 通过Stochastic变异性子空间推断,对大语言模型进行可缩放的Bayesian低Rank 2506.21408v1 |
Authors (5): Colin Samplawski, Adam D. Cobb, Manoj Acharya, Ramneet Kaur, Susmit Jha
Despite their widespread use, large language models (LLMs) are known to hallucinate incorrect information and be poorly calibrated. This makes the uncertainty quantification of these models of critical importance, especially in high-stakes domains, such as autonomy and healthcare. Prior work has made Bayesian deep learning-based approaches to this problem more tractable by performing inference over the low-rank adaptation (LoRA) parameters of a fine-tuned model. While effective, these approaches struggle to scale to larger LLMs due to requiring further additional parameters compared to LoRA. In this work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By repurposing the LoRA parameters as projection matrices, we are able to map samples from this subspace into the full weight space of the LLM. This allows us to learn all the parameters of our approach using stochastic variational inference. Despite the low dimensionality of our subspace, we are able to achieve competitive performance with state-of-the-art approaches while only requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to scale up to the largest Bayesian LLM to date, with four times as a many base parameters as prior work.
尽管使用广泛,但大型语言模型(LLMS)已知会给错误信息带来幻觉,而且校准也差强人意。这使得这些模型具有关键重要性的不确定性量化,特别是在自主和医疗保健等高取域。先前的工作使得贝叶斯人的深深学习方法更加容易解决这一问题,对微调模型的低调适应(LORA)参数进行推论。虽然这些方法有效,但由于比LORA更需要额外的参数,因此难以向更大的LLM(LLLM)推广更大的LMS(LLM)。在这项工作中,我们通过Stochastectic Variational Subspace Inference(ScalaBL) 将美元作为关键参数,我们能够从这个子空间绘制样本到LLM($x)的完整重量空间($xx),BB(B) $(LLM) $(B) 。这样,我们就能通过Stocrical-deal-deal-deal-ladeal-deal-deal-deal-levelyal-laveal-laveal-s) 方法,我们才能在低平地平面上进行最大规模的参数变。
Article 20
Title@2025-06-26 (4): DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
Title: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation | DiffuCoder: Maskierte Difffusionsmodelle für die Codegenerierung verstehen und verbessern | DiffuCoder:理解和改进代代码生成的蒙面传播模式 2506.20639v2 |
Authors (7): Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder’s performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.
大型语言模型( dLLLM ) 是自动递增( AR) 模型的极具说服力的替代方法, 因为它们的拆分模型在整个序列中运作。 dLLM 的全球规划和迭代改进功能对于代码生成特别有用。 然而, dLLM 的编码当前培训和推断机制仍然在开发中。 要解析 dLLM 的解码行为并释放其编码潜力, 我们系统地调查其解析进程和加强学习( RL) 方法。 我们用 130B 代码符号来培训一个 7B dLLLLM,\ textbf{ DiffleuCoder} 。 我们用这个模型来测试, 我们分析其解译行为的行为, 显示它与AR模型的不同:(1) dLLM 可以决定其生成过程的因果关系, 而增加采样的温度稀释不仅象征性的选择, 而且它们的生成顺序。 这种多样性为 RLL 滚动提供了丰富的搜索空间。 对于 RL 培训来说, 要降低 AR- dO- gread- dreadal roup roduction roduction roup roup roup roup dreal roup supdudustration roduction.
Article 21
Title@2025-06-26 (4): Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings
Title: Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings | Hybrides Deep Learning und Signalverarbeitung für die arabische Dialekterkennung in Low-Resource-Einstellungen | 低资源设置中阿拉伯语语音识别的混合深深学习和信号处理 2506.21386v1 |
Authors (2): Ghazal Al-Shwayyat, Omer Nezih Gerek
Arabic dialect recognition presents a significant challenge in speech technology due to the linguistic diversity of Arabic and the scarcity of large annotated datasets, particularly for underrepresented dialects. This research investigates hybrid modeling strategies that integrate classical signal processing techniques with deep learning architectures to address this problem in low-resource scenarios. Two hybrid models were developed and evaluated: (1) Mel-Frequency Cepstral Coefficients (MFCC) combined with a Convolutional Neural Network (CNN), and (2) Discrete Wavelet Transform (DWT) features combined with a Recurrent Neural Network (RNN). The models were trained on a dialect-filtered subset of the Common Voice Arabic dataset, with dialect labels assigned based on speaker metadata. Experimental results demonstrate that the MFCC + CNN architecture achieved superior performance, with an accuracy of 91.2% and strong precision, recall, and F1-scores, significantly outperforming the Wavelet + RNN configuration, which achieved an accuracy of 66.5%. These findings highlight the effectiveness of leveraging spectral features with convolutional models for Arabic dialect recognition, especially when working with limited labeled data. The study also identifies limitations related to dataset size, potential regional overlaps in labeling, and model optimization, providing a roadmap for future research. Recommendations for further improvement include the adoption of larger annotated corpora, integration of self-supervised learning techniques, and exploration of advanced neural architectures such as Transformers. Overall, this research establishes a strong baseline for future developments in Arabic dialect recognition within resource-constrained environments.
由于阿拉伯语语言多样性和缺少大量附加说明的数据集,阿拉伯方言承认在语音技术方面是一个重大挑战,特别是代表比例不足的方言。这项研究调查了将古典信号处理技术与深层学习结构相结合的混合模型战略,以在低资源情景下解决这一问题。开发和评价了两种混合模型:(1)Mel-Forquity Cepstraal Covalties(MFCC),加上一个革命神经网络(CNN),和(2)Dreste Wavelet变换(DWT)特征,加上一个经常性神经网络(NNN)。这些模型在通用阿拉伯声音数据集的方言过滤部分上进行了培训,并配有基于语音元的方言标。实验结果表明,MFCC+CNN架构取得了优异性业绩,准确度为91.2%,精度精确度,回顾,以及F1-CROD(RNN)配置,其精确度达到66.5%。这些发现,将光谱特性与动态模型相结合的有效性,特别是当根据演讲元元元元数据配制分配时,为今后升级的升级的标准化研究,还确定了一个与升级的模型相关的结构。
Article 22
Title@2025-06-26 (4): Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation
Title: Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation | Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation | 利用LLM协助的对活检索一代人查询了解 2506.21384v1 |
Authors (4): Guanting Dong, Xiaoxi Li, Yuyao Zhang, Mengjie Deng
Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.
文件介绍了Omni-RAG,这是一个新的框架,目的是提高RAG系统在现场、开放式环境中的稳健性和有效性。Omni-RAG利用LLM协助的查询理解,通过三个关键模块处理用户的预处理输入:(1) 深查询理解和解剖,利用LLM,利用LLM特制的特制提示,进行隐蔽查询(例如,纠正拼写错误),并将多功能查询分解为结构化的子查询;(2) 内置知识检索,用于从现场、开放式环境中(即,使用OpenSearch的FineWeb)检索每个子查询系统,并通过三个关键模块对预处理用户输入进行汇总:(1) 深查询和解剖,利用LLM的特制提示,专门使用LLM的特制提示,将LM的特制提示用于调查询(例如,纠正拼写错误),并将多功能查询分解为结构化的次查询;(2) Intni-Ak-ORereal Rereal-LG,通过不断的精准的升级能力,通过S-ROG的精选,通过S-RO-G的精选的精选,通过S-LM的精细的精选,通过S-LG-G-G-GM的精选的精选的精选的精选的精选,通过SLG-G-G-LM的精选的精选的精选的精选的精选来改进的精选来改进的精选,通过SLLGM-LB-GM的精选来改进的精选的精选来改进的精选的精选的精选的精选来改进的精选,在S-LG-LG-LG-LM的精选的精选。
Article 23
Title@2025-06-26 (4): Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models
Title: Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models | Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models | AI 文学批评主义的结构性方法:大语言模型利用Greimas半语言广场 2506.21360v1 |
Authors (4): Fangzhou Dong, Yifan Zeng, Yingpeng Sang, Hong Shen
Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs’ ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework’s results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.
大型语言模型(LLMS)在理解和生成文本方面十分出色,但在为深思熟虑和复杂叙事的作品提供专业文学批评方面挣扎不已,本文件提议GLASS(Sephatic Squatic Speatic Squatic Squatic (GSS) (Greimas Literal Alys analys analys) ,这是基于Greimas Sematic Sematic Square(GSS) 的结构性分析框架,目的是提高LLMS 进行深入文学分析的能力。GLASS(LMS) 协助迅速解析叙事迹结构和叙事作品的深刻含义。我们提议为基于GS的文学批评提供第一个数据集,包括对48项作品的详细分析。然后,我们建议使用LLM-as-a-a-judge 范式,为基于GSS的文学批评提出数量指标。我们的框架结果,与多项作品和LLMM-LMs的专家们的批评相比,显示了高绩效。最后,我们将GLASS应用于39项经典作品,我们应用了39项经典著作的理论,我们应用了GLLASS的理论,我们应用了39项经典著作的理论分析为文学研究和高质分析提供了一种AI为基础的文学研究和教育。我们为文学研究和教育提供了一种以AI为基础的理论的理论解释。我们为文学研究和教育提供一种工具。研究工具。这项工具。这项工具。我们为文学研究提供一种工具。我们所。我们用的工具。我们用的工具。我们用工具,为文学研究的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论的理论。我们用工具。我们用工具。我们用工具。
Article 24
Title@2025-06-26 (4): Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
Title: Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts | Latent Prototype Routing: Erzielen einer nahezu perfekten Lastabgleichung in Mixture-of-Experts | 原型原型路由:在混合专家中实现近效果负载平衡 2506.21328v1 |
Authors (1): Jiajie Yang
Mixture-of-Experts (MoE) architectures have emerged as a key strategy for scaling large language models (LLMs) efficiently. However, current MoE systems suffer from severe load imbalance, where only a small subset of experts is consistently activated during training and inference, leading to significant underutilization of model capacity and computational resources. In this work, we revisit expert routing through a clustering perspective and propose Latent Prototype Routing (LPR), a novel routing framework that generalizes existing approaches while promoting balanced expert utilization without compromising downstream performance. Extensive experiments across multiple open-source MoE models – including DeepSeek-V3, Qwen3-MoE, and Mixtral – demonstrate that LPR reduces the Gini coefficient of expert load from 0.70 to 0.035 on average, improves the min-max expert load ratio from 1e-6 to 0.70, achieving near-perfect load balancing.
专家混合结构(Mixture of Experters (MoE)架构已成为有效推广大型语言模型的关键战略,然而,目前的教育部系统承受着严重的负荷不平衡,在培训和推论期间,只有一小部分专家在不断被激活,导致模型能力和计算资源严重利用不足。在这项工作中,我们重新审视专家通过集群视角选择路线,并提议采用Lentant Prototy Routing(LPR)这一新的路线框架,它概括现有方法,同时促进平衡专家利用,同时不损害下游业绩。 多种开放源的教育部模型 – – 包括DeepSeek-V3、Qwen3-MOE和Mixtral – 的广泛实验表明,LPR平均将专家负荷的基尼系数从0.70降至0.035,将最小负载专家负荷比率从1e-6提高到0.70,实现近效负载平衡。
Article 25
Title@2025-06-26 (4): Exploring Adapter Design Tradeoffs for Low Resource Music Generation
Title: Exploring Adapter Design Tradeoffs for Low Resource Music Generation | Erforschung von Adapter-Design-Tradeoffs für Low Resource Music Generation | 探索用于低资源音乐制作的适应设计取舍 2506.21298v1 |
Authors (3): Atharva Mehta, Shivam Chauhan, Monojit Choudhury
Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.
微调大型音乐制作模型,如MusicGen和Mustango,是一个计算成本昂贵的过程,往往需要更新数十亿参数,因此需要大量硬件资源。 参数-高效的美调(PefFT)技术,特别是基于适应器的方法,已经成为一个大有希望的替代方案,在保持模型性能的同时,能够以最起码的训练参数进行适应。然而,适应器的设计选择,包括其结构、位置和大小,数量众多,这些组合中哪些会产生最优化的调校正,为什么,对于某个资源较少的音乐流体来说,这些组合会产生最优化的调适度,为什么呢?在本文件中,我们试图通过研究两种AI音乐模型(Muscult Geting and Mustago)的多种调适度配置来回答这个问题:印度古典和土木马卡姆音乐。我们发现,基于革命的适应器在捕捉精细的本地音乐模型(如装和短调调调)中,而基于变异的调的调能更好地保存对结构化至关重要的长距离培训模型。 在本文中,我们试图进行更精确的调校准的调制的调定的调定时, 。
Article 26
Title@2025-06-26 (4): Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
Title: Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models | Erkennung von Verweisen auf Ausdrücke im visuell begründeten Dialog mit autoregressiven Sprachmodellen | 与自动递减语言模型进行视觉基础对话中检测引用表达式 2506.21294v1 |
Authors (2): Bram Willemsen, Gabriel Skantze
In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.
在本文中,我们探索了一种只使用文本、自动递减语言模型的方法,从视觉对话中提取参考表达方式,更具体地说,目的是调查仅语言背景就能够在多大程度上为探测在对话的视觉背景中具有(可视的)可参考的提及提供参考信息;为此,我们调整了预先训练的大型语言模式(LLM),以在对话中进行相对由过程决定的提及范围,方法是通过下方预测在文本中标出提及界限。我们的调查结果表明,即使使用中等大小的LLM、相对小的数据集和有参数效率的微调,只使用文本的方法也可能有效,突出语言背景对这项任务的相对重要性。然而,我们争辩说,这项任务是一个固有的多式问题,并讨论了单式方法的根本限制。
Article 27
Title@2025-06-26 (4): Small Encoders Can Rival Large Decoders in Detecting Groundedness
Title: Small Encoders Can Rival Large Decoders in Detecting Groundedness | Kleine Encoder können große Decoder bei der Erkennung von Erdlichkeit rivalisieren | 在地面探测中能够使大型分离器在探测地面时发生迭接 2506.21288v1 |
Authors (7): Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less
利用外部环境增强大型语言模型(LLMS),大大改善了其在自然语言处理(NLP)任务方面的表现;然而,LLMS在所提供背景缺乏信息、往往诉诸无根据的猜测或内部知识时,努力可靠地回答询问;基础(产生得到背景的大力支持的答复)对于确保事实的一致性和可信赖性至关重要;本研究的重点是查明某一询问是否以LLMS在昂贵的答案生成之前提供的背景文件为依据;这种检测机制可以大大减少推断时间和资源消耗。我们显示,轻量、特定任务编码模型,如RoBERTA和NomicBERT, 微调整理数据集,能够达到与Llama3 8B和GPT4o等最新高水平的LLMMS的精确度,同时减少数量级的推断力。该代码可在以下网址查阅:https://github.com/chandarlab/Hallucincate-less。
Article 28
Title@2025-06-26 (4): Thinkless: LLM Learns When to Think
Title: Thinkless: LLM Learns When to Think | Denklos: LLM lernt, wann man denkt | 无思想:LLM学习思考时间 2505.13379v2 |
Authors (3): Gongfan Fang, Xinyin Ma, Xinchao Wang
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model’s ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens,
理性语言模型(LLMs)具有扩展思维链推理的能力,在需要复杂逻辑推理的任务上表现出了非凡的成绩。然而,对所有查询应用精细推理往往导致计算效率极低,特别是当许多问题都承认直截了当的解决办法。这促使了一个未决问题:LLMs能够何时思考?为了回答这个问题,我们建议了一个可以学习的框架,使LLM能够根据任务的复杂性和模型的能力,在短形式和长形式推理之间作出适应性选择。无思想者在强化学习范式下接受培训,并使用两个控制牌子,即简明回答的
Article 29
Title@2025-06-26 (4): Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Title: Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning | Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung | 双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由 2506.21285v1 |
Authors (14): Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the “aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.
虽然思维缓慢的大型语言模型(LLMs)的反映式推理(通常被称为“aha moment ” ) , 通常被称为“aha moment : ” : 其生成信息化评论和完善先前解决方案的能力仍然有限。 在本文中,我们引入了双检查器,这是一个原则性框架,旨在通过促进对以往解决方案进行明确的自我精细和迭接完善来提高慢思考的LLMs的推理能力。通过微调我们整理的1 730个自我临界实例,双检查器使长的CoT LLMs在推断中能够反复批评和完善其产出,直到他们根据自我生成的批评来评价其解决方案正确无误。 我们验证了双检查器在一套综合推理基准中的功效,表明反复式自我精度极大地增强了长期CoT LLMs的推理能力。 值得注意的是,我们的双检查器将挑战AIME基准的成绩从4.4%提高到18.2%,而最初的CoT LLMsLMs。这些结果突显了发展更可信和有效的方向。
Article 30
Title@2025-06-26 (4): HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Title: HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context | HumanOmniV2: Vom Verständnis zur Omni-Modalen Vernunft mit Kontext | HumanOmniV2:从理解到以上下文为根据的全方位模式 2506.21277v1 |
Authors (10): Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.
随着多式联运大型语言模式的迅速演变,深入理解和解释人类意图的能力已成为一种关键能力,需要详细和深思熟虑的推理。在最近的研究中,加强学习(RL)显示在加强大语言模型的推理能力方面具有潜力。然而,与使RL适应多式联运数据和格式有关的挑战仍然基本上得不到解决。在本文件中,我们确定了现有多式联运推理模型的两个问题:全球背景理解不足和捷径问题。当模型错误地解释多式联运背景时,背景理解不足,从而导致错误的答案。当模型忽略多式联运投入的关键线索,直接在不考虑多式联运信息的情况下处理问题时,出现捷径问题。为了解决这些问题,我们强调模型必须合理理解大语言模型在多式联运投入中对全球背景背景的了解。这种全球背景理解可以有效地防止模型忽略关键的多式联运提示,确保透彻的推理过程。为确保对多式联运背景信息的准确解释,我们用一个大语言模式、格式和准确的奖赏来评判背景情况。此外,为了提高复杂的推理能力,我们利用LM系统来评估逻辑上的推理学,同时确定整个多式联运的推理学过程。
Article 31
Title@2025-06-26 (4): Cat and Mouse – Can Fake Text Generation Outpace Detector Systems?
Title: Cat and Mouse – Can Fake Text Generation Outpace Detector Systems? | Katze und Maus – Kann die Textgenerierung ausfallende Detektorsysteme fälschen? | 猫和老鼠 – – 假文本生成能否超越检测器系统? 2506.21274v1 |
Authors (2): Andrea McGlinchey, Peter J Barclay
Large language models can produce convincing “fake text” in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless “arms race”, we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models’ ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify “fake text” in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness
大型语言模型可以在学术写作、产品评论和政治新闻等领域产生令人信服的“ 假文本 ” 。 已经调查了许多方法来探测人工生成的文本。 虽然这似乎预示了无休止的“武器竞赛 ” , 但我们注意到,较新的LLMs使用越来越多的参数、培训数据和能源,而相对简单的分类人员则以少量的资源表现出了相当的检测准确度。 要解决模型击败探测器的能力是否因此可能达到一个高点的问题,我们研究统计分类人员以经典侦探小说的方式识别“假文本”的能力。 超过0.5版,我们发现Gemini展示了更大的生成欺骗性文本的能力,而GPT却没有这样做。 这表明,可靠地探测假文本即使对于越来越庞大的模型来说,也仍然可行,尽管新的模型结构可能提高它们的欺骗性。
Article 32
Title@2025-06-26 (4): A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
Title: A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns | Ein Troublemaker mit ansteckenden Jailbreak macht Chaos in ehrlichen Städten | 一个麻烦制造者 与贪婪的监狱破碎 制造混乱 在诚实的城镇 2410.16155v2 |
Authors (6): Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.
由于开发了大型语言模型,它们被广泛用作各个领域的代理物。代理物的一个关键组成部分是记忆,它储存着重要信息,但容易遭到越狱袭击。现有的研究主要侧重于单一剂袭击和共同记忆袭击。然而,现实世界的情景往往涉及独立记忆。在本文中,我们建议采用麻烦制造者在诚实镇制造混乱(TMCHT)任务,这是一个大型、多试剂、多地形文本攻击评价框架。TMCHT涉及一个攻击者代理人试图误导整个代理物社会。我们确定了多剂袭击中的两大挑战:(1)非完整的图表结构,(2)大规模系统。我们把这些挑战归因于一种我们称之为毒性消失的现象。为了解决这些问题,我们建议采用一种反转基因合成的干扰性监狱破碎(ARCJ)方法,这种方法优化检索功能,使中毒样品更容易检索,优化复制功能,使有毒样品具有传染能力。我们展示了在TMCHT袭击中的两种方法的优越性,即23.51%,18.95%,以及安全-93%,以及磁性层系统的顶层系统改进。
Article 33
Title@2025-06-26 (4): DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
Title: DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster | DiLoCoX: Ein kommunikationsarmer groß angelegter Ausbildungsrahmen für dezentralisierte Cluster | DILOCOX:权力下放小组的低通信大范围培训框架 2506.21263v1 |
Authors (9): Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich
The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.
基础模型的分布式培训,特别是大型语言模型(LLMS)的分布式培训需要高水平的通信。 因此,它高度依赖集中集集,具有快速和可靠的互连性。 我们能否在缓慢的网络上进行培训,从而在涉及超过1000亿参数的模型时释放分散式集群的力量? 在本文中,我们提议DiLoCoX,这是一个低通信的大规模分散式集群培训框架。它将管道平行与双优化政策、通信和地方培训的单步重叠和适应性渐进式压缩计划结合起来。这种组合大大改善了参数的规模和模型预培训的速度。我们有理由通过对趋同进行理论分析来证明通信和地方培训的一步骤重叠以及适应性梯度压缩计划的好处。我们很生动地证明DiloCoX能够在1Gbps网络上对107B基础模型进行预培训。 与Vanilla AllRedduce相比, DiLoCoX可以在分配培训中实现357x的首次速度,同时保持100亿分流化模型的成功降解。
Article 34
Title@2025-06-26 (4): Simulating Hard Attention Using Soft Attention
Title: Simulating Hard Attention Using Soft Attention | Simulation der harten Aufmerksamkeit mit weicher Aufmerksamkeit | 使用软关注模拟硬关注 2412.09925v2 |
Authors (4): Andy Yang, Lena Strobl, David Chiang, Dana Angluin
We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several subclasses of languages recognized by hard-attention transformers, which can be defined in variants of linear temporal logic. We demonstrate how soft-attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate general hard-attention transformers, using a temperature that depends on the minimum gap between the maximum attention scores and other attention scores.
我们研究的是使用软关注的变压器能够模拟硬关注的条件,即有效地将所有关注都集中在一组位置上。 首先,我们检查了几个由硬关注变压器承认的亚类语言,这些可按线性时间逻辑变量加以定义。我们演示了软关注变压器如何用无约束的定位嵌入或温度缩放来计算这些逻辑的公式。第二,我们演示了温度缩放如何让软最大变压器模拟普通硬关注变压器,使用温度取决于最大关注分数与其他关注分数之间的最小差距。
Article 35
Title@2025-06-26 (4): Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
Title: Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents | Agent-RewardBench: Auf dem Weg zu einem einheitlichen Benchmark für Prämienmodellierung über Wahrnehmung, Planung und Sicherheit in multimodalen Real-World-Agenten | Agent-RewardBench:建立一个统一基准,用于在现实世界多式联运代理中建立跨认知、规划和安全概念、规划与安全的奖励模型 2506.21252v1 |
Authors (6): Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.
随着多式大语言模型(MLLM)的推进,多式联运代理商在网络导航和体现的智能等现实世界任务中表现出了希望,然而,由于缺乏外部反馈,这些代理商在自我纠正和概括方面遇到困难。一种有希望的方法是利用奖励模式作为外部反馈,但对于如何选择代理商的奖赏模式尚不清楚。因此,迫切需要建立一个针对代理商的奖赏法官席。为了应对这些挑战,我们提议一个旨在评价MLLM的奖赏模型能力的基准,即代理商-RewardBench。该基准的特点是三个关键特征:(1) 多维度和真实世界代理商情景评估。它涵盖7种情景的观念、规划和安全;(2) 逐步奖励评估。它允许在任务的各个步骤上对代理商的能力进行评估,对规划过程中的业绩提供更敏锐的看法;(3) 适当的困难和高质量的。我们仔细地从10种不同的模型中抽取样本,对任务挑战加以控制,并用人工核查以确保数据的完整性。实验表明,即使是州级的示范商培训模式都显示有有限的绩效。
Article 36
Title@2025-06-26 (4): Capturing Style in Author and Document Representation
Title: Capturing Style in Author and Document Representation | Stil in der Autor- und Dokumentdarstellung erfassen | 在作者和文件代表中获取样式 2407.13358v2 |
Authors (3): Enzo Terreau, Antoine Gourru, Julien Velcin
A wide range of Deep Natural Language Processing (NLP) models integrates continuous and low dimensional representations of words and documents. Surprisingly, very few models study representation learning for authors. These representations can be used for many NLP tasks, such as author identification and classification, or in recommendation systems. A strong limitation of existing works is that they do not explicitly capture writing style, making them hardly applicable to literary data. We therefore propose a new architecture based on Variational Information Bottleneck (VIB) that learns embeddings for both authors and documents with a stylistic constraint. Our model fine-tunes a pre-trained document encoder. We stimulate the detection of writing style by adding predefined stylistic features making the representation axis interpretable with respect to writing style indicators. We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship Corpus and IMDb62, for which we show that it matches or outperforms strong/recent baselines in authorship attribution while capturing much more accurately the authors stylistic aspects.
深天然语言处理(NLP)的多种模式包括持续和低维的文字和文件表达方式。 令人惊讶的是,很少有模型为作者进行代表性学习。 这些表达方式可用于许多自然语言处理(NLP)任务,例如作者身份和分类,或建议系统。 对现有作品的强烈限制是,它们没有明确体现写作风格,因此很难适用于文学数据。 因此,我们提议基于动态信息瓶颈(VIB)的新架构,它学习以文体限制的方式嵌入作者和文件。 我们的模型微调是一个经过预先训练的文件编码器。 我们通过添加预定义的文体特征,使代表轴轴在写风格指标方面可以解释,来刺激对写作风格的风格的探测。 我们对三个数据集的方法进行了评估:从古滕贝格项目、博客作者Corpus和IMDb62中提取的文学文集,我们为此表明,它与作者归属的强/中基准相匹配或优于后者,同时更准确地捕捉到作者的文体学方面。
Article 37
Title@2025-06-26 (4): Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
Title: Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval | Automatische Termextraktion mit großen Sprachmodellen durch syntactic Retrieval verbessern | 通过同步检索增强使用大语言模型的自动定期抽取功能 2506.21222v1 |
Authors (5): Yongchan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.
虽然大型语言模型(LLMS)大大推进了各种NLP任务,但它们对ATE的潜力却很少得到研究。我们建议了一个基于检索的提示战略,在微小的场景中,根据\emph{syntalogy}选择演示,而不是语义相似性。这种综合检索方法是域-敏感性的,为获取术语界限提供了更可靠的指导。我们评估了在域内和跨域环境中的方法,分析了查询句及其所收集实例之间的词汇重叠如何影响业绩。对三个专门ATE基准的实验表明,合成检索可以改进F1-核心。这些结果突出表明了LLMS适应术语扩展任务时使用合成提示的重要性。
Article 38
Title@2025-06-26 (4): Complexity-aware fine-tuning
Title: Complexity-aware fine-tuning | Komplexitätsbewusste Feinabstimmung | 复杂度认知微调 2506.21220v1 |
Authors (5): Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average accuracy) and provides comparable with distillation performance while using $62\%$ less data ($0.55$ average accuracy for both). We publish our code and data to facilitate further research in this direction.
通用大语言模型(LLMS)经常通过监督的微调(SFT)进行微调,以提高特定领域的绩效,通过提炼大型模型的思维链,以大量昂贵的电话和大量数据为代价,可以取得更好的结果。我们提出了一个高效微调的新蓝图,该蓝图只对通过英特罗普查明的复杂数据进行推理。具体地说,在两个小型开放模型($\approx 3B$)中,我们将培训数据分成复杂的类别,通过SFT和蒸馏,将一个单一的象征性答录(ROC ACUC 0.73美元)、微调大型语言模型(LLMS)进行精细调,显示我们的输油管大大超过标准SFT方法(0.55美元比0.43美元平均精度),并在使用62美元减去数据(平均精确度(0.55美元)的同时提供可比较的蒸馏性能。我们公布我们的代码和数据,以便利在这方面进行进一步的研究。
Article 39
Title@2025-06-26 (4): Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?
Title: Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? | Kausale Vernunft in großen Sprachmodellen enthüllen: Realität oder Mirage? | 大语言模型中未解的因果理由:现实还是幻影? 2506.21215v1 |
Authors (8): Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han
Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the contrary. Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. To support this hypothesis, methodologically, we delve into the autoregression mechanism of transformer-based LLMs, revealing that it is not inherently causal. Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G^2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs’ causal reasoning processes. Experiments demonstrate that G^2-Reasoner significantly enhances LLMs’ causal reasoning capability, particularly in fresh and counterfactual contexts. This work sheds light on a new path for LLMs to advance towards genuine causal reasoning, going beyond level-1 and making strides towards level-2.
原因推理能力对于推动大型语言模型(LLMS)获得强大的人工智能至关重要。尽管多功能LMS在理解背景因果关系和提供符合因果关系法则的响应方面表现出了能力,但是仍然不清楚它们是否具有与人类相似的真正因果关系推理。然而,目前的证据表明相反。具体地说,LMS只能进行浅(一级)因果推理,主要归因于其参数所包含的因果知识,但它们缺乏真正人性(二级)因果推理的能力。为了支持这一假设,从方法上看,我们进入基于变压LMS的自动反反向推理机制,揭示它并非必然的因果关系。我们引入了一个称为Causal Pro-2024的新的因果推理基准,其体对所研究的LMSMs来说是新鲜而几乎看不见的。LMS与先前的基准相比,其表现显著下降,表明它们主要参与一级(二级)因果推理学。为了弥补以2级为基础的因果关系推理学差距,我们从以下事实中汲取灵感,即人类推理学通常通过一般的了解和预期性推理推理学水平,将GLMSLMS的推理学能力转化为推理学。
Article 40
Title@2025-06-26 (4): TAPS: Tool-Augmented Personalisation via Structured Tagging
Title: TAPS: Tool-Augmented Personalisation via Structured Tagging | TAPS: Tool-Augmented Personalisierung durch strukturiertes Tagging | TAPS: 通过结构拖网提高工具的个性化 2506.20409v2 |
Authors (2): Ekaterina Taktasheva, Jeff Dalton
Recent advancements in tool-augmented large language models have enabled them to interact with external tools, enhancing their ability to perform complex user tasks. However, existing approaches overlook the role of personalisation in guiding tool use. This work investigates how user preferences can be effectively integrated into goal-oriented dialogue agents. Through extensive analysis, we identify key weaknesses in the ability of LLMs to personalise tool use. To this end, we introduce TAPS, a novel solution that enhances personalised tool use by leveraging a structured tagging tool and an uncertainty-based tool detector. TAPS significantly improves the ability of LLMs to incorporate user preferences, achieving the new state-of-the-art for open source models on the NLSI task.
最近,在工具强化的大型语言模型方面有所进展,使得它们能够与外部工具互动,提高它们执行复杂用户任务的能力,但是,现有办法忽略了个性化在指导工具使用方面的作用。这项工作调查了如何将用户的偏好有效地纳入面向目标的对话媒介。通过广泛分析,我们查明了LLMs个人化工具使用能力方面的主要弱点。为此,我们引入了TAPS,这是一个通过利用结构化标记工具和基于不确定性的工具探测器,加强个性化工具使用的新解决方案。TAPS大大提高了LMs在应用用户偏好、实现新的开放源模型方面对 NLSI 任务的最新了解。
Article 41
Title@2025-06-26 (4): LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
Title: LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey | LLM-basierte human-agente Kooperations- und Interaktionssysteme: Eine Umfrage | 以LLM为基础的人类-机构协作和互动系统:调查 2505.00753v4 |
Authors (15): Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. These human-agent collaboration systems enable humans and LLM-based agents to collaborate effectively by leveraging their complementary strengths. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities arising from human-AI collaboration. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
大型语言模型(LLMS)最近的进展引起了人们对建立完全自主的代理商的兴趣,然而,完全自主的LLM公司代理商仍面临重大挑战,包括由于幻觉、处理复杂任务的困难以及巨大的安全和道德风险,这些都限制了其在现实世界应用中的可行性和可信度。为了克服这些限制,基于LLM公司的人类代理系统(LLM-HAS)将人提供的信息、反馈或控制纳入代理系统,以提高系统性能、可靠性和安全性。这些人类代理商合作系统使人类和LLM公司代理商能够利用其互补优势进行有效的合作。本文件提供了LLM-HAS的首次全面和结构化调查。该文件澄清了基本概念,系统地介绍了形成这些系统的核心组成部分,包括环境与特征分析、人类反馈、互动类型、协同和通信,探索新出现的应用,并讨论了人类-AI合作带来的独特挑战和机遇。通过整合现有知识并提供结构化的概览,我们的目标是促进在这一迅速演变的跨学科领域开展进一步的研究和创新。文件清单和资源可在 https://githubab-Cormao-Cormaus-Cory-Cormaus-Co-Sy-Ang-Systry-Syard-Sy-Sy-Systry-Sy-A/Hy-Syard-Z/Hing-As/Hing-SomeAsgo-S)。
Article 42
Title@2025-06-26 (4): Prompt-Guided Turn-Taking Prediction
Title: Prompt-Guided Turn-Taking Prediction | Prompt-geführte Turn-Taking-Vorhersage | 即时指导的回转预测 2506.21191v1 |
Authors (7): Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.
转换预测模型是语音对话系统和谈话机器人中必不可少的组成部分。 最近的方法利用变压器结构来连续和实时预测语音活动。 在本研究中,我们提出了一个新的模型,使变压器预测能够通过文本提示进行动态控制。 这种方法通过“ 加速” 或“ 加速” 等指示,动态地适应对话伙伴和背景,允许直觉和明确控制。 拟议的模型基于变压器声音活动预测模型,将文字快速嵌入频道变压器和跨通道变压器。 我们用超过950小时的人类语音对话数据评估了我们的方法的可行性。 由于现有数据集没有关于拟议方法的文字提示数据,我们使用一个大语言模型(LLM)来生成合成快速的句子。 实验结果显示,拟议的模型根据文字提示提高了预测的准确性,并有效地改变了变速计时行为。
Article 43
Title@2025-06-26 (4): Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
Title: Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks | Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks | 维持MDEB:实现长期使用和可复制嵌入基准 2506.21182v1 |
Authors (5): Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB’s continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results’ generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb
大量文本嵌入基准(MDEB)已成为文本嵌入模型的标准评价平台。虽然以前的工作已经确立了核心基准方法,但本文件侧重于确保MDEB继续可复制和可推广的工程方面,我们介绍了我们维持强有力的连续整合管道的方法,这些管道验证数据集的完整性、自动测试执行以及评估基准结果的一般性。我们详细介绍了共同加强可复制性和可用性的设计选择。此外,我们讨论了处理社区贡献和以新的任务和数据集扩展基准的战略。这些工程做法有助于扩大MTEB的规模,使之更加全面,同时保持质量,并最终与外地相关。我们的经验为基准维护者提供了宝贵的洞见,他们在确保机器学习评估框架的可复制性和可用性方面面临类似挑战。MTEB存放于https://github.com/embeddings-benchmark/mteb。我们可在以下网址查阅:https://github. com/embeddings-benchmark/mteb。
Article 44
Title@2025-06-26 (4): Compressed and Smooth Latent Space for Text Diffusion Modeling
Title: Compressed and Smooth Latent Space for Text Diffusion Modeling | Komprimierter und glatter Latent-Raum für Text-Diffusionsmodellierung | 压缩和平滑的文本传播中缓流空间模型模型 2506.21170v1 |
Authors (5): Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov
Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference.
自动递减语言模型主宰现代文本生成,但其相继性质具有根本性的局限性:解码过程缓慢,维持全球一致性仍然具有挑战性。传播模式通过平行生成和灵活控制提供了有希望的替代方法;然而,在文本生成中的应用却由于象征性代表的高度性而受到阻碍。我们引入了宇宙,这是完全在压缩、光滑的、专门用于扩散的潜伏空间中运作的文本生成的一种新颖方法。我们学习了这一空间,它使用的是同时培训的自动编码器,用于象征性的重整和与来自预先训练的语言编码器的冷冻激活相匹配,提供强有力的语义定位和使有效的扰动增强成为可能。我们很自然地证明,文字表达可以压缩8美元,同时保持与象征性扩散模式相近的生成质量。此外,增加潜在序列长度使得宇宙能够超过扩散基础和自动递增的基线。我们评估了四种不同基因化任务的宇宙,包括故事生成、问题生成、合成和解毒化,并将其与各种基因化范式进行比较。在2年相比,宇宙的生成质量更高或更高。
Article 45
Title@2025-06-26 (4): CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models
Title: CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models | CVC: Eine groß angelegte chinesische Wertregel Corpus zur Wertausrichtung großer Sprachmodelle | CVC: 大型中文大语言模式价值调整大型中国价值规则公司 2506.01495v4 |
Authors (9): Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at https://github.com/Beijing-AISI/CVC.
确保大语言模型符合主流人类价值观和道德规范对于AI的安全与可持续发展至关重要。当前价值评价和协调受到西方文化偏见和不完全的国内框架的制约,而后者依赖非本地规则;此外,缺乏可扩展的、由规则驱动的情景生成方法,使得不同文化背景的评价成本高昂且不足。为应对这些挑战,我们提议了一个基于中国核心价值观的等级价值框架,包括三个主要方面、12个核心价值和50种衍生价值。基于这一框架,我们建立了一个大型中国价值公司(CVC),其中包含250 000多项价值规则,通过人类注解得到加强和扩大。实验结果显示,CVC导的情景在价值界限和内容多样性方面优于直接生成的情景。在六个敏感主题(例如,套装、自杀)的评价中,7个主流LMS偏爱C在70.5%以上的案例中生成的选项。5个中国人类注释公司显示87.5%与C一致,确认了其普遍性、文化相关性,以及与中国价值观的强有力调整。此外,我们在SEVC/AI-ALDS标准中,我们建立了现有的404级标准。
Article 46
Title@2025-06-26 (4): Do Large Language Models Advocate for Inferentialism?
Title: Do Large Language Models Advocate for Inferentialism? | Befürworten große Sprachmodelle den Inferentialismus? | 大语言模型是否为推定主义辩护? 2412.14501v2 |
Authors (2): Yuzuki Arai, Sho Tsugawa
The emergence of large language models (LLMs) such as ChatGPT and Claude presents new challenges for philosophy of language, particularly regarding the nature of linguistic meaning and representation. While LLMs have traditionally been understood through distributional semantics, this paper explores Robert Brandom’s inferential semantics as an alternative foundational framework for understanding these systems. We examine how key features of inferential semantics – including its anti-representationalist stance, logical expressivism, and quasi-compositional approach – align with the architectural and functional characteristics of Transformer-based LLMs. Through analysis of the ISA (Inference, Substitution, Anaphora) approach, we demonstrate that LLMs exhibit fundamentally anti-representationalist properties in their processing of language. We further develop a consensus theory of truth appropriate for LLMs, grounded in their interactive and normative dimensions through mechanisms like RLHF. While acknowledging significant tensions between inferentialism’s philosophical commitments and LLMs’ sub-symbolic processing, this paper argues that inferential semantics provides valuable insights into how LLMs generate meaning without reference to external world representations. Our analysis suggests that LLMs may challenge traditional assumptions in philosophy of language, including strict compositionality and semantic externalism, though further empirical investigation is needed to fully substantiate these theoretical claims.
大型语言模型(LLMs)的出现,如ChatGPT和Claude等,对语言哲学提出了新的挑战,特别是语言含义和代表性的性质。LLMs传统上是通过分布语义学理解的,本文探讨了Robert Brandom的推断性语义学,作为理解这些体系的替代基础框架。我们研究了推论性语义学的关键特征,包括其反代表性立场、逻辑直观主义和准组合方法 – – 与以变异者为基础的LMs的建筑和功能特征相一致。通过分析ISA(推论、替代、Anaphora)方法,我们表明LMs在语言处理过程中表现出根本上反代表性的特性。我们通过RLHF等机制进一步制定适合LMs的共识性真理理论。我们承认推论的哲学承诺和LMs的亚理学处理之间存在严重的紧张关系,但本文指出,推论性语义对LMs如何在严格意义上解释语言中产生意义,而无需外部解释。我们的分析表明,这些理论论论理学论的理论论理学系可能进一步解释,包括外部分析。
Article 47
Title@2025-06-26 (4): Learning Evaluation Models from Large Language Models for Sequence Generation
Title: Learning Evaluation Models from Large Language Models for Sequence Generation | Learning Evaluation Models aus großen Sprachmodellen für die Sequenzgenerierung | 序列生成大语言模式学习评价模式 2308.04386v3 |
Authors (9): Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, Yue Zhang, Jingbo Zhu
Automatic evaluation of sequence generation, traditionally reliant on metrics like BLEU and ROUGE, often fails to capture the semantic accuracy of generated text sequences due to their emphasis on n-gram overlap. A promising solution to this problem is to develop model-based metrics, such as BLEURT and COMET. However, these approaches are typically hindered by the scarcity of labeled evaluation data, which is necessary to train the evaluation models. In this work, we build upon this challenge by proposing the Customized Sequence Evaluation Metric (CSEM), a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development, thereby eliminating the need for human-labeled data. Additionally, we expand the scope of CSEM to support various evaluation types, including single-aspect, multi-aspect, reference-free, and reference-based evaluations, enabling the customization of metrics to suit diverse real-world scenarios. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data. Further experiments in reinforcement learning and reranking show that metrics developed through CSEM outperform traditional evaluation metrics, leading to substantial improvements in sequence quality as evaluated by both commonly used metrics and ChatGPT.
对序列生成的自动评价传统上依赖BLEU和ROUGE等标准,这种自动评价传统上依赖BLEU和ROUGE等标准,由于强调正克重叠,往往无法捕捉生成文本序列的语义准确性,因为强调正克重叠。这个问题的一个有希望的解决办法是开发BLEURT和CWET等基于模型的衡量标准。然而,这些方法通常受到标签化评价数据稀缺的阻碍,而这些数据是培训评价模型所必需的。在这项工作中,我们提出定制化的序列评价计量标准(CSEM),这是一个三阶段评价模式培训方法,利用大语言模型生成标签化数据,用于模型化衡量开发,从而消除对人标数据的需求。此外,我们扩大CEMEM的范围,以支持各种评价类型,包括单一目标、多层、无参考和基于参考的评价,使指标的定制能够适应不同的现实世界情景。Summeval基准的实验结果表明,CSEMEM可以有效地培训一个评价模型,而没有人类标签化的数据。在加强学习和重新排序方面进行实验,从而通过常规标准质量评估,通过共同评估,在标准中进行实质性的进度上进行实质性的改进。
Article 48
Title@2025-06-26 (4): Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models
Title: Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models | Progtuning: Progressives Fine-Tuning-Framework für transformerbasierte Sprachmodelle | 改进:基于变换器的语文模式逐步微调框架 2506.21119v1 |
Authors (5): Xiaoshuang Ji, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Zeyao Liu
Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning methods require updating the same number of parameters as the initial size, ignoring the unequal contribution across Transformer blocks and leading to extremely inefficient allocation of computing resources. In this paper, we propose Progtuning, the novel fine-tuning framework combined with progressive learning for Transformer-based language models. Specifically, Progtuning progressively reduces the number of updated transformer blocks based on the contribution. Remarkably, Progtuning optimizes resource allocation and reduces the number of updated parameters by approximately 25\%, while still maintaining competitive performance. And it also exhibits high adaptability with parameter-efficient fine-tuning methods, demonstrating excellent performance across various adaptation scenarios.
在下游任务中,微调是利用以变异器为基础的语言模型的一个很有希望的技术。随着模型规模的继续增长,更新所有模型参数的成本越来越高。参数效率微调方法通过有选择地更新一小部分参数来有效解决这一问题。然而,微调和大多数现有的有参数效率微调方法需要更新与初始规模相同的参数数量,忽视不同变异器区块的不平等贡献,导致极低效率的计算资源分配。在本文中,我们提议了新颖的微调框架,加上以变异器为基础的语言模型的渐进学习。具体地说,精调逐步减少以贡献为基础的更新变异器区块的数量。显著的是,在保持竞争力的同时,优化资源分配,将更新参数的数量减少大约25。它还表现出高的适应性,并展示了各种适应情景的出色表现。
Article 49
Title@2025-06-26 (4): Learning to Skip the Middle Layers of Transformers
Title: Learning to Skip the Middle Layers of Transformers | Lernen, die mittleren Schichten der Transformer zu überspringen | 学习跳过变换器的中层 2506.21103v1 |
Authors (2): Tim Lawson, Laurence Aitchison
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a ‘sandwich’ or ‘perilayernorm’ scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for ‘simpler’ tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
有条件计算是提高变异器效率的流行战略。 现有方法通常针对单个模块( 如专家层混合) 或互不相干。 然而, 可解释性研究显示, 中变异器的中层表现出更大的冗余, 并且早期的层汇总信息会形成象征位置。 在这些洞察力的指导下, 我们提出了一个新结构, 动态地跳过中外的多层。 特别是, 一个学习的格子机制决定是否绕过基于输入的中央区块对称范围, 以及一个门式注意机制阻止随后的标志进入暂记位置。 剩余规范由“ andwich” 或“ perilaynorom” 方案控制, 门式偏差导致适应性规范损失。 我们的目的是减少对“ 质性” 符号的折数要求, 并可能促进新兴的多层代表等级等级。 但是, 在所调查的尺度上, 我们的方法并没有在验证跨层和估计的FLOPs之间的折算法上实现改进。 我们发布了在 http://gimb/ comb- lab- laps at at- sqours 。
Article 50
Title@2025-06-26 (4): ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry
Title: ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry | ComRAG: Retrieval-Augmented Generation mit dynamischen Vector Stores für Echtzeit-Community-Frageantworten in der Industrie | ComRAG: 利用动态矢量储存库实时社区工业问题回答实时社区问题的回收-原始一代 2506.21098v1 |
Authors (8): Qinwen Chen, Wenbiao Tao, Zhiwei Zhu, Mingfan Xi, Liangzhong Guo, Yuan Wang, Wei Wang, Yunshi Lan
Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines–achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.
社区问题解答平台(CQA)可被视为社区的重要知识基础,但有效地利用历史互动和实时领域知识仍是一项挑战。现有方法往往没有充分利用外部知识,没有纳入动态的历史QA环境,或缺乏适合工业部署的记忆机制。我们提议ComRAG,即实时工业问题解答平台的检索强化生成框架,通过为检索、生成和高效存储设计的基于机器人的记忆机制,将静态知识与动态的历史质量评估对口结合起来。对三个工业性质量解析数据集进行了评估,ComRAG始终超越了所有基线实现率达到25.9%的矢量相似性改进率,将延迟率从8.7%降至23.3%,并将迭代面积增长从20.23%降至2.06%。
Article 51
Title@2025-06-26 (4): HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Title: HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics | HERMES: zeitlich-zusammenhängendes lang-für-M Verständnis mit Episoden und Semantik | HERMES: 与分数和语义学的理解 2408.17443v4 |
Authors (6): Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu
Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, featuring two versatile modules that can enhance existing video-language models or operate as a standalone system. Our Episodic COmpressor (ECO) efficiently aggregates representations from micro to semi-macro levels, reducing computational overhead while preserving temporal dependencies. Our Semantics ReTRiever (SeTR) enriches these representations with semantic information by focusing on broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. We demonstrate that these modules can be seamlessly integrated into existing SOTA models, consistently improving their performance while reducing inference latency by up to 43% and memory usage by 46%. As a standalone system, HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.
长式视频理解提出了超越传统的短视分析方法的独特挑战,特别是在捕捉长距离依赖性、高效处理冗余信息以及提取高层次语义概念方面。为了应对这些挑战,我们提出了更准确地反映人类认知的新型方法。本文介绍了HERMES:与Episodes和语义有关的时间相近的长期理解:与Episodes和语义相关的时间相近的长期理解,由两个多功能模块组成,可以加强现有视频语言模式或作为独立系统运行。我们的Episodic Compressor(ECO)高效综合了微观到半宏观层面的展示,减少了计算间接费用,同时保持了时间依赖性。我们的Sememistrict Retraver(SeTR)通过侧重于更广泛的背景,大幅度降低特征的维度,同时保存相关的宏观信息。我们证明,这些模块可以与现有的SOTA模型紧密结合,不断改进其性能,同时将误差拉度降低到43%,记忆使用46%。作为独立系统,在多摄像化的系统下,HEMMES完全地实现了对州级的完整理解。
Article 52
Title@2025-06-26 (4): DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning
Title: DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning | DALR: Dual Level Alignment Learning für multimodales Sentence Representative Learning | DALR: 双级统一学习促进多式判决代表制学习 2506.21096v1 |
Authors (5): Kang He, Yuzhe Ding. Haining Wang, Fei Li, Chong Teng, Donghong Ji
Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.
以往的多式联运判决代表制学习方法取得了令人印象深刻的成绩,然而,大多数方法侧重于将图像和文本在粗略水平上加以统一,面临两个关键挑战:跨现代的不匹配偏差和内部的语义差异,这大大削弱了服刑代表制的质量。为了应对这些挑战,我们建议DALR(多模式判决代表制的双级统一学习)。关于跨模式的调整,我们提议了一个一致性学习模块,以软化负面样本,并利用与辅助任务的语义相似性实现细微的跨模式协调。此外,我们主张,判决关系超越二进制正反标签,展示一个更为复杂的排名结构。为了更好地捕捉这些关系,提高代表性质量,我们将排行与全球模式内部协调学习相结合。关于语义相似性(STS)和转让(TR)任务的全面实验证实了我们的方法的有效性,并始终表明它优于最新基线。
Article 53
Title@2025-06-26 (4): Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph
Title: Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph | Verbesserung der LLM-Tool-Nutzung mit hochwertigen Instruktionsdaten aus Wissensgrafik | 利用来自知识图的高质量教学数据加强LLM工具的使用 2506.21071v1 |
Authors (10): Jingwei Wang, Zai Zhang, Hao Qian, Chunjing Gan, Binbin Hu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Bin Shi, Bo Dong
Teaching large language models (LLMs) to use tools is crucial for improving their problem-solving abilities and expanding their applications. However, effectively using tools is challenging because it requires a deep understanding of tool functionalities and user intentions. Previous methods relied mainly on LLMs to generate instruction data, but the quality of these data was often insufficient. In this paper, we propose a new method that uses knowledge graphs to generate high-quality instruction data for LLMs. Knowledge graphs are manually curated datasets rich in semantic information. We begin by extracting various query pathways from a given knowledge graph, which are transformed into a broad spectrum of user queries. We then translate the relationships between entities into actionable tools and parse the pathways of each query into detailed solution steps, thereby creating high-quality instruction data. Our experiments show that fine-tuning on just a small sample of this synthetic data can significantly improve the tool utilization and overall capabilities of LLMs.
教授大型语言模型(LLMS)使用工具对于提高其解决问题的能力和扩大应用至关重要。然而,有效使用工具具有挑战性,因为它要求深入了解工具功能和用户意图。以前的方法主要依靠LLMS生成教学数据,但这些数据的质量往往不足。在本文中,我们建议采用新方法,使用知识图为LLMS生成高质量的教学数据。知识图是人工整理的数据集,其语义信息丰富。我们首先从特定知识图中提取各种查询路径,这些路径被转换成广泛的用户查询。我们然后将各实体之间的关系转化为可操作的工具,将每个查询路径分析成详细的解决方案步骤,从而创建高质量的教学数据。我们的实验表明,只对少量的合成数据进行微调,就可以大大改进LMS的工具利用和总体能力。
Article 54
Title@2025-06-26 (4): MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection
Title: MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection | MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung | MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法 2506.21053v1 |
Authors (7): Fuqiang Niu, Genan Dai, Yisha Lu, Jiayu Liao, Xiang Li, Hu Huang, Bowen Zhang
In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.
在当代社交媒体领域,自动定位探测对于探索观点至关重要,因为它综合并审查用户对有争议的议题的观点,以发现当前趋势和情绪;传统定位检测研究往往针对个别案例,从而限制其模拟在真实社交媒体情景中典型的多党讨论的能力;这一缺陷主要源于缺乏真正捕捉社交媒体互动动态的数据集,从而阻碍对对话姿态检测的进展;在本文件中,我们引入了MT2-CSD,这是一个用于多目标、多点对口姿态检测的综合数据集;根据我们的知识,MT2-CSD是可用于此目的的最大数据集,包括24,457个附加说明的实例,展示了最大对话深度,从而提出了对定位探测的新挑战;为应对这些挑战,我们提议采用大型语言模型增强反向关系关注网络(LLLMM-CRAN),利用LMM的推理能力来提高对对话的理解;我们进行广泛的实验,以评价LM-CRAN在MT2-CSD数据集上的功效。实验结果表明,LMM-CRAN的探测定位模型大大超越了高基准。
Article 55
Title@2025-06-26 (4): Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach
Title: Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach | Bewertung der Diagnoseleistung bei seltenen Krankheiten bei Symptomen: Ein synthetischer Vignette-Simulationsansatz | 评价症状检查器中的罕见疾病诊断性能: 合成Vignette模拟方法 2506.19750v3 |
Authors (3): Takashi Nishibayashi, Seiji Kanazawa, Kumpei Yamada
Symptom Checkers (SCs) provide medical information tailored to user symptoms. A critical challenge in SC development is preventing unexpected performance degradation for individual diseases, especially rare diseases, when updating algorithms. This risk stems from the lack of practical pre-deployment evaluation methods. For rare diseases, obtaining sufficient evaluation data from user feedback is difficult. To evaluate the impact of algorithm updates on the diagnostic performance for individual rare diseases before deployment, this study proposes and validates a novel Synthetic Vignette Simulation Approach. This approach aims to enable this essential evaluation efficiently and at a low cost. To estimate the impact of algorithm updates, we generated synthetic vignettes from disease-phenotype annotations in the Human Phenotype Ontology (HPO), a publicly available knowledge base for rare diseases curated by experts. Using these vignettes, we simulated SC interviews to predict changes in diagnostic performance. The effectiveness of this approach was validated retrospectively by comparing the predicted changes with actual performance metrics using the R-squared ($R^2$) coefficient. Our experiment, covering eight past algorithm updates for rare diseases, showed that the proposed method accurately predicted performance changes for diseases with phenotype frequency information in HPO (n=5). For these updates, we found a strong correlation for both Recall@8 change ($R^2$ = 0.83,$p$ = 0.031) and Precision@8 change ($R^2$ = 0.78,$p$ = 0.047). Our proposed method enables the pre-deployment evaluation of SC algorithm changes for individual rare diseases. This evaluation is based on a publicly available medical knowledge database created by experts, ensuring transparency and explainability for stakeholders. Additionally, SC developers can efficiently improve diagnostic performance at a low cost.
Symptom Checkers (SCs) 提供适合用户症状的医疗信息。 SC 开发过程中的一个重大挑战是,在更新算法时,防止个别疾病,特别是罕见疾病出现意外性性能退化。 这种风险源于缺乏实用的部署前评估方法。 对于罕见疾病,很难从用户反馈中获得足够的评价数据。 要评估算法更新对个人罕见疾病的诊断性能的影响,本研究提出并验证了一个全新的合成Vignette模拟方法。 这个方法旨在高效和低成本地进行这一基本评估。 为了估计算法更新的影响,我们从人类基因型肿瘤学(HPO) 的疾病-芬型描述中产生了合成性格反应器。 一个公开的关于稀有疾病的知识库。 对于这些稀有疾病,我们模拟了算法访问,以预测诊断性性表现的变化。 通过将预测的变化与实际性能指标进行比较,使用R- 0.0美元(R=2美元) 个人可获取的系数。 我们关于稀有疾病的八个过去的算法更新数据, 显示以HPO = R= R= R= R= Rexx 之前的准确预测性能更新数据。
Article 56
Title@2025-06-26 (4): Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs
Title: Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs | Suche und Verfeinerung während des Denkens: Autonome Retrieval-Augmented Reasoning von LLMs | 思考期间的搜索和记忆:自主检索-强化理据(LLM) 2505.11277v3 |
Authors (8): Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think’’ paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
大型语言模型展示了令人印象深刻的推理能力,但本质上受到知识库的限制。 检索强化推理通过允许LLMs查询外部资源而减轻了这一限制,但现有方法往往检索不相干或吵闹的信息,从而妨碍准确推理。 在本文中,我们提议AutoRefine,这是一个强化的学习后培训框架,采用新的“搜索和检索”思维模式。 AutoRefine在连续的搜索电话之间引入了明确的知识改进步骤,使模型能够在生成答案之前进行迭接过滤、蒸馏和组织证据。此外,我们纳入了量身定制的检索特定奖项,同时采用群体相对的政策优化来回答正确性奖项。关于单手和多手QA基准的实验表明,AutRefine大大超越了现有方法,特别是在复杂、多手的推理假设中。详细分析表明,AutRefine公司在生成频繁、更高质量的搜索和合成证据之前都存在问题。
Article 57
Title@2025-06-26 (4): A Semi-supervised Scalable Unified Framework for E-commerce Query Classification
Title: A Semi-supervised Scalable Unified Framework for E-commerce Query Classification | Ein halbüberwachtes skalierbares Unified Framework für die E-Commerce Query Classification | 半监督的电子商务查询分类可扩展统一框架 2506.21049v1 |
Authors (8): Chunyuan Yuan, Chong Zhang, Zheng Fang, Ming Pang, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law
Query classification, including multiple subtasks such as intent and category prediction, is vital to e-commerce applications. E-commerce queries are usually short and lack context, and the information between labels cannot be used, resulting in insufficient prior information for modeling. Most existing industrial query classification methods rely on users’ posterior click behavior to construct training samples, resulting in a Matthew vicious cycle. Furthermore, the subtasks of query classification lack a unified framework, leading to low efficiency for algorithm optimization. In this paper, we propose a novel Semi-supervised Scalable Unified Framework (SSUF), containing multiple enhanced modules to unify the query classification tasks. The knowledge-enhanced module uses world knowledge to enhance query representations and solve the problem of insufficient query information. The label-enhanced module uses label semantics and semi-supervised signals to reduce the dependence on posterior labels. The structure-enhanced module enhances the label representation based on the complex label relations. Each module is highly pluggable, and input features can be added or removed as needed according to each subtask. We conduct extensive offline and online A/B experiments, and the results show that SSUF significantly outperforms the state-of-the-art models.
查询分类,包括意图和类别预测等多个子任务,对于电子商务应用至关重要。电子商务查询通常很短,而且缺乏背景,无法使用标签之间的信息,因此无法使用标签之间的信息,造成建模前的信息不充分。大多数现有的工业查询分类方法都依靠用户的事后点击行为来建立培训样本,从而形成一个马修恶性循环。此外,查询分类的子任务缺乏统一框架,导致演算优化效率低。在本文中,我们提议了一个新的半监督可缩放统一框架(SSUF),包含多个强化模块,以统一查询分类任务。知识强化模块利用世界知识加强查询表达和解决查询信息不足的问题。标签强化模块使用标签语义和半超导信号来减少对海报标签的依赖。结构强化模块加强了基于复杂标签关系的标签代表比例。每个模块都是可插入性很强的,输入特征可以按照每个子任务需要添加或删除。我们进行了广泛的离线和在线的SSA/B实验,展示了大量A/B型模型。
Article 58
Title@2025-06-26 (4): MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting
Title: MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting | MockLLM: Ein Multi-Agent-Behavior-Kooperationsrahmen für Online-Jobsuche und Recruiting | MockLLLM:网上求职和招聘多代理行为协作框架 2405.18113v2 |
Authors (6): Hongda Sun, Hongzhan Lin, Haiyu Yan, Yang Song, Xin Gao, Rui Yan
Online recruitment platforms have reshaped job-seeking and recruiting processes, driving increased demand for applications that enhance person-job matching. Traditional methods generally rely on analyzing textual data from resumes and job descriptions, limiting the dynamic, interactive aspects crucial to effective recruitment. Recent advances in Large Language Models (LLMs) have revealed remarkable potential in simulating adaptive, role-based dialogues, making them well-suited for recruitment scenarios. In this paper, we propose \textbf{MockLLM}, a novel framework to generate and evaluate mock interview interactions. The system consists of two key components: mock interview generation and two-sided evaluation in handshake protocol. By simulating both interviewer and candidate roles, MockLLM enables consistent and collaborative interactions for real-time and two-sided matching. To further improve the matching quality, MockLLM further incorporates reflection memory generation and dynamic strategy modification, refining behaviors based on previous experience. We evaluate MockLLM on real-world data Boss Zhipin, a major Chinese recruitment platform. The experimental results indicate that MockLLM outperforms existing methods in matching accuracy, scalability, and adaptability across job domains, highlighting its potential to advance candidate assessment and online recruitment.
在线招聘平台改造了求职和招聘流程,驱动了对强化员工匹配的应用需求的增加。传统方法一般依赖分析简历和职务说明的文本数据,限制了对有效招聘至关重要的动态互动方面。大语言模式(LLMs)最近的进展显示,在模拟适应性、基于角色的对话方面,大语言模式(LLMs)的近期进展显示了显著的潜力,使其适合于招聘的情景。在本文中,我们提议了\ textbf{MockLLM},这是一个生成和评估模拟面试互动的新框架。这个系统由两个主要部分组成:模拟面试生成和握手协议中的双面评价。MockLLLM通过模拟面试和候选人角色,为实时和双面匹配提供了一致的协作互动。为了进一步提高匹配质量,MockLLM进一步纳入了反思记忆生成和动态战略修改,根据以往的经验完善行为。我们用真实世界数据评价MockLLM,Boss Zhipin,一个中国主要招聘平台。实验结果表明,MockLLM在匹配准确性、可缩缩略性、跨职业领域和适应性候选人在线评估方面,超越了现有方法。
Article 59
Title@2025-06-26 (4): SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
Title: SceneGenAgent: Precise Industrial Scene Generation with Coding Agent | SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent | SceneGenerAgenti: 精密工业场景与编码剂生成 2410.21909v3 |
Authors (8): Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong
The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at https://github.com/THUDM/SceneGenAgent .
工业场景模型对于工业制造业的模拟至关重要。大型语言模型(LLMS)在用文字描述生成一般的3D场景方面取得了显著进展,而利用LLMS生成工业场景则因其对精确测量和定位的需求而带来了独特的挑战,需要对空间安排进行复杂的规划。为了应对这一挑战,我们引入了C#代码生成工业场景的基于LLM的CeneGenAgenti代理商SceenGenAgenti(CeneGenAgenti)确保了精确的布局规划,通过结构化和可计算的格式、布局核查以及迭接的完善以满足工业场景的数量要求。实验结果表明,SceneGenAgenent所驱动的LMS超过其最初的性能,在真实世界工业场景生成任务中达到高达81.0%的成功率,并有效地满足了大多数场景生成要求。为了进一步提高无障碍性,我们建造了SeenInstruct(SenInGPTHMM),一个旨在将开源LMMS-GPTSentrentrental 3./MUD)和MSUDSUDMS。
Article 60
Title@2025-06-26 (4): Large Language Models Acing Chartered Accountancy
Title: Large Language Models Acing Chartered Accountancy | Große Sprachmodelle Aking Chartered Accountancy | 特许会计会计 2506.21031v1 |
Authors (7): Jatin Gupta, Akhil Sharma, Saransh Singhania, Mohammad Adnan, Sakshi Deo, Ali Imam Abidi, Keshav Gupta
Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.
CA-Ben是一个特许会计基准,专门用来评价LLMS的财务、法律和定量推理能力。CA-Ben由印度特许会计师协会(注册会计师协会)进行的严格审查所产生的结构性问答数据集组成,涵盖基础、中间和高级CA课程阶段。结论强调目前LLMMS的长处和局限性,建议通过混合推理和定量分析(特别是定量分析)今后作出精确分析。
Article 61
Title@2025-06-26 (4): SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control
Title: SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control | SAC: Ein Rahmen für die Messung und Induktion von Persönlichkeitseigenschaften in LLMs mit dynamischer Intensitätskontrolle | SAC: 具有动态强度控制的LMLM中测量和诱导个性轨迹的框架 2506.20993v1 |
Authors (5): Adithya Chittem, Aishna Shrivastava, Sai Tarun Pendela, Jagat Sesh Challa, Dhruv Kumar
Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.
大型语言模型(LLMS)近年来在一系列广泛的领域中获得了显著的牵引力。 人们也越来越期望它们能在互动中展示像人一样的个性。 为了实现这一期望,许多研究提出了通过心理评估模拟LLM个性的方法。 但是,大多数现有模型面临两大限制:它们依赖大五(OCEAN)框架,这个框架只能提供粗化的个性层面,它们缺乏控制性格强度的机制。在本文中,我们通过扩大机器人格清单(MPI)来弥补这一差距,它最初使用大五模式,以纳入16个个个个个个个个性(16PFF)模型,允许对16个不同特性进行直观控制。为了实现这一期望,我们还制定了一个结构化框架,称为具体属性控制(SAC),用来评估和动态地诱导引引出LLMSMs的特质强度。 我们的方法引入了基于诱导的语义的语义定位,用以引导性强度表达和在五个强度因素上的行为问题:\ Text{Requity},\ text{derity}, lish}、\ lishing likeding liction liction liction real democal democal democal democal demodududustr lacustr sution lating laxing us lax we swebly view view 在我们 vicaltaltaltaltaltaltaltal vial vical vicaltaltaltal vical vical vicild vicild vicil vicil vicil vicil vicil viciltal 中, 在我们找到一个直到一个直观中, vical vical vical vical vical vicaldal extal vical vical 上, 在我们在持续的工作,在持续 vical 上, vical vical vical vicil vi能、在持续的深度结构中, vical vical vical
Article 62
Title@2025-06-26 (4): SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
Title: SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes | SharpZO: Hybrid Sharpness-Aware Vision Sprachmodell Prompt Tuning via Forward-Only Passes | SharpZO: 混合尖锐-敏锐视觉语言模型,通过前向-单行道快速调试 2506.20990v1 |
Authors (6): Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO) optimization, and often fail to achieve satisfactory performance. In this paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning via a sharpness-aware warm-up training. SharpZO features a two-stage optimization process: a sharpness-aware ES stage that globally explores and smooths the loss landscape to construct a strong initialization, followed by a fine-grained local search via sparse ZO optimization. The entire optimization relies solely on forward passes. Detailed theoretical analysis and extensive experiments on CLIP models demonstrate that SharpZO significantly improves accuracy and convergence speed, achieving up to 7% average gain over state-of-the-art forward-only methods.
微调视觉语言模型(VLMS)在各种下游任务中取得了显著的成绩;然而,它要求通过反向调整(BP)获得模型梯度,使其不适合于记忆限制的、仅推断的边缘设备。为解决这一局限性,先前的工作探索了各种不考虑BP的微调方法。然而,这些方法往往依赖于高差异进化战略(ES)或零级优化(ZO),往往不能取得令人满意的业绩。在本文件中,我们建议采用混合的夏丁斯-觉Zeros-顺序优化(SharpZO)方法,具体设计该方法的目的是通过敏锐度-敏锐的热度培训提高ZO VLM微调的性能。 SharpZO具有两个阶段的优化过程:敏锐的ES级演化阶段,通过全球探索和平缓冲损失场景以构建强大的初始化,然后通过稀薄ZO优化进行精细的本地搜索。整个优化完全依靠前方的传递。关于CLIP模型的详细理论分析和广泛的实验表明,SharpZO的高级前进速度大大改进了前向率和速度。
Article 63
Title@2025-06-26 (4): SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
Title: SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization | SACL: Verständnis und Bekämpfung von Textbias im Code Retrieval mit semantisch-angereicherter Reranking und Lokalisierung | SACL: 理解和打击《规则》中与语义-增强的重新排级和本地化相结合的 “ 检索法 “ 中的 “ 理解和打击 “ 理论上的 “ 种族 “ 行为 2506.20081v2 |
Authors (3): Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie
Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant. Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).
在这项工作中,我们通过在保存代码功能的同时系统掩蔽具体特征,对代码检索进行深入分析。我们的发现包括:(1) 尽管对代码进行了培训,但当前检索者严重依赖地表水平的文本特征(例如, docstries, 识别符号名称),以及(2) 他们表现出强烈的偏向于记录良好的代码,即使文件无关紧要。根据我们的发现,我们提议SACL,这是一个通过增加语义信息的代码或结构知识来丰富文本信息并减少偏向的框架。广泛的实验显示,SACL大大改进代码检索(例如,在HumanEval/MBPPP/SWE-Bench-Lite上,12.8%/9.4%/7.0%的回调@1,这也导致更好的代码生成性能(例如,在人文经济学上,4.88%的通过@1)。
Article 64
Title@2025-06-26 (4): Can Gradient Descent Simulate Prompting?
Title: Can Gradient Descent Simulate Prompting? | Kann Gradient Descent Simulate Prompting? | 梯子源模拟能刺激吗? 2506.20989v1 |
Authors (3): Eric Zhang, Leshem Choshen, Jacob Andreas
There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM’s own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance – showing improvement on the ``reversal curse’’ tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.
将新信息纳入语言模式(LM)有两种主要方式:改变其快速或改变其参数,例如微调。参数更新不产生长期存储成本。然而,对于许多模型更新来说,推动效果要大得多:推动模型能够从单一实例中强有力地概括,并得出标准微调下不会发生的逻辑推论。可以修改模型,以便微调能够效仿推理结果吗?本文描述了一个元培训LM方法,例如梯度更新可以模仿调整新信息的效果。我们的方法使用基于梯度的元学习工具,但使用LM自己推动的预测作为目标,消除对地面真相标签的需求。随后的梯度下降培训恢复了一些(有时还恢复全部)激励模型性能 – – 显示“逆向诅咒”任务的改进,并在单一梯度更新后回答关于文本段落的问题。这些结果表明,随着适当的初始化,梯度下降率可以令人惊讶地表达。我们的结果显示,通过新的途径可以进行长文本模型建模和直观基于梯度的一般学习能力。
Article 65
Title@2025-06-26 (4): Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models
Title: Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models | Vergleich von Retrieval-Augmentation und Parameter-Effizient Fine-Tuning für Datenschutz-Erhaltung Personalisierung von großen Sprachmodellen | 比较大语言模型的检索增强和参数有效微量微量美化,促进保护隐私和保持个人特征化 2409.09510v2 |
Authors (2): Alireza Salemi, Hamed Zamani
Despite its substantial impact on various search, recommendation, and question answering tasks, privacy-preserving methods for personalizing large language models (LLMs) have received relatively limited exploration. There is one primary approach in this area through retrieval-augmented generation (RAG), which generates personalized outputs by enriching the input prompt with information retrieved from the user’s personal data. This paper studies an orthogonal approach to RAG that involves learning user-dependent LLM parameters through parameter-efficient fine-tuning (PEFT). This paper presents the first systematic study for exploration of PEFT for LLM personalization and provides an extensive comparisons between RAG- and PEFT-based solutions, across a broad set of seven diverse datasets from the LaMP benchmark. Our results demonstrate that, on average, both RAG- and PEFT-based personalization methods yield 14.92% and 1.07% improvements over non-personalized LLMs, respectively. When combining RAG with PEFT, we observe a further improvement of 15.98%, highlighting the effectiveness of their integration in enhancing personalized text generation. Additionally, we identify a positive correlation between the amount of user data available and the effectiveness of PEFT. This finding suggests that RAG is particularly beneficial for cold-start users – users with limited personal data – while PEFT performs better when more user-specific data is available.
尽管对各种搜索、建议和答答任务有重大影响,但大型语言模型(LLMs)个人化的隐私保护方法得到的探索相对有限,在这方面,通过检索增强的生成(RAG),通过利用用户个人数据获取的信息,丰富输入,产生个性化产出。本文研究对RAG的正统方法,涉及通过节能微调(PEFT)学习依赖用户的LLM参数。本文件介绍了首次系统研究LLM个人化PEFT的个人化PEFT, 并提供了RAG-和PEFT的解决方案之间的广泛比较。通过拉MP基准的七套不同数据集(RAG),这一领域存在一种主要方法,通过利用用户个人数据获取的信息来丰富输入信息,从而产生个性化产出。本文研究了对RAG-和PEFT的个人化方法分别产生14.92%和1.07%的改进。在将RAG与PFT相结合时,我们观察到了15.98%的进一步改进,强调它们融入个人化文本的效用。此外,我们确定在用户的可获取的冷端端端用户数据时,这种数据与PFTAFT数据进行更好的表现。
Article 66
Title@2025-06-26 (4): Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Title: Reward-Guided Speculative Decoding for Efficient LLM Reasoning | Belohnungsgeführte spekulative Dekodierung für effiziente LLM-Reasoning | 高效 LLM 理由说明的受奖励指导的投机性说明 2501.19324v3 |
Authors (8): Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios. The code is available at https://github.com/BaohaoLiao/RSD.
我们引入了一个旨在提高大语言模型(LLMS)推算效率的新框架(RSD),这是一个旨在提高大语言模型(LLMs)推算效率的新框架。RSD将轻量型草案模式与更强有力的目标模型协同结合,结合一种控制型偏向,将高回报产出优先排序,这与现行的严格不偏袒的投机解码方法形成对照。RSD采用一个过程奖励模式,评价中间解码步骤,并动态地决定是否采用目标模型,优化计算成本和产出质量之间的权衡。我们理论上证明,基于阈值的混合战略在资源利用和业绩之间实现了最佳平衡。对具有挑战性的推理基准(包括奥林匹克层面的任务)的广泛评价表明,RSDD在与目标模型解码方面带来显著的效益收益(比平均的平行解码方法(至+3.5)要高得多。这些结果突出表明,RSDD是在资源密集情景中部署LMSD是一种稳健和具有成本效益的方法。
Article 67
Title@2025-06-26 (4): Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization
Title: Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization | Ranken lernen für mehrere Retrieval-Augmented Modelle durch iterative Utility Maximierung | 通过迭代功用最大化学习多重检索增强型号排名 2410.09942v2 |
Authors (2): Alireza Salemi, Hamed Zamani
This paper investigates the design of a unified search engine to serve multiple retrieval-augmented generation (RAG) agents, each with a distinct task, backbone large language model (LLM), and RAG strategy. We introduce an iterative approach where the search engine generates retrieval results for the RAG agents and gathers feedback on the quality of the retrieved documents during an offline phase. This feedback is then used to iteratively optimize the search engine using an expectation-maximization algorithm, with the goal of maximizing each agent’s utility function. Additionally, we adapt this to an online setting, allowing the search engine to refine its behavior based on real-time individual agents feedback to better serve the results for each of them. Experiments on datasets from the Knowledge-Intensive Language Tasks (KILT) benchmark demonstrates that our approach significantly on average outperforms baselines across 18 RAG models. We demonstrate that our method effectively ``personalizes’’ the retrieval for each RAG agent based on the collected feedback. Finally, we provide a comprehensive ablation study to explore various aspects of our method.
本文探讨设计一个统一的搜索引擎,为多个检索增强的生成代理器(RAG)提供服务,每个代理器都有不同的任务、主干大语言模型(LLM)和RAG战略。我们采用了一种迭接方法,搜索引擎为RAG代理器产生检索结果,并在离线阶段收集对所检索文件质量的反馈。然后,这种反馈用于利用期望最大化算法迭接优化搜索引擎,目标是最大限度地发挥每个代理器的实用功能。此外,我们将此方法调整到一个在线环境,使搜索引擎能够根据实时单个代理器反馈改进其行为,以便更好地为每个代理器提供结果。关于知识强化语言任务基准(KILT)中数据集的实验表明,我们的方法在很大程度上超越了18个RAG模型的平均格式基线。我们证明,我们的方法有效地“个性化”了根据所收集的反馈检索每个RAG代理器的功能。最后,我们提供了一项全面的分析研究,以探讨我们方法的各个方面。
Article 68
Title@2025-06-26 (4): Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
Title: Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation | Jenseits der reaktiven Sicherheit: Risiko-Bewusst LLM-Ausrichtung über Long-Horizon Simulation | 超越反应安全性:通过长休松模拟使风险-警用LLM对齐 2506.20949v1 |
Authors (4): Chenkai Sun, Denghui Zhang, ChengXiang Zhai, Heng Ji
Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling more robust alignment. To assess the long-term safety awareness of language models, we also introduce a dataset of 100 indirect harm scenarios, testing models’ ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts. Our approach achieves not only over 20% improvement on the new dataset but also an average win rate exceeding 70% against strong baselines on existing safety benchmarks (AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer agents.
鉴于语言模式代理人对从公共政策到医疗保健等社会决策的高度影响越来越大,确保其有利影响要求理解其建议具有深远影响。我们提议了一个概念证明框架,用以预测模型产生的建议如何通过大型社会系统长期以宏观规模传播,从而能够更有力地协调。为了评估语言模式的长期安全意识,我们还引入了100种间接伤害假设数据集,测试模型是否有能力预见看似无害的用户的不利、非明显结果。我们的方法不仅在新数据集上实现了20%以上的改进,而且比现有安全基准(AdvBench、SafeRLHF、WildGuardMix)的强基线平均赢率超过70%,为更安全的代理提供了有希望的方向。
Article 69
Title@2025-06-26 (4): Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters
Title: Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters | Bewertung großer Sprachmodelle für automatisierte klinische Abstraktion in pulmonalen Embolism Registries: Performance Across Modellgrößen, Versionen und Parameter | 评价肺部新陈代谢登记簿自动临床抽象化的大型语言模型:不同模型大小、版本和参数的性能 2503.21004v2 |
Authors (9): Mahmoud Alwakeel, Emory Buck, Jonathan G. Martin, Imran Aslam, Sudarshan Rajagopal, Jian Pei, Mihai V. Podgoreanu, Christopher J. Lindsell, An-Kwok Ian Wong
Pulmonary embolism (PE) registries accelerate practice improving research but rely on labor intensive manual abstraction of radiology reports. We examined whether openly available large language models (LLMs) can automate concept extraction from computed tomography PE (CTPE) reports without loss of data quality. Four Llama 3 variants (3.0 8B, 3.1 8B, 3.1 70B, 3.3 70B) and one reviewer model, Phi 4 14B, were tested on 250 dual annotated CTPE reports from each of MIMIC IV and Duke University. Accuracy, positive predictive value (PPV) and negative predictive value (NPV) versus a human gold standard were measured across model size, temperature and shot count. Mean accuracy rose with scale: 0.83 (3.0 8B), 0.91 (3.1 8B) and 0.96 for both 70B variants; Phi 4 14B reached 0.98. Accuracy differed by less than 0.03 between datasets, indicating external robustness. In dual model concordance (L3 70B plus Phi 4 14B) PPV for PE presence was at least 0.95 and NPV at least 0.98, while location, thrombus burden, right heart strain and image quality artifacts each achieved PPV of at least 0.90 and NPV of at least 0.95. Fewer than four percent of individual concept annotations were discordant, and full agreement occurred in more than seventy five percent of reports. Large language models therefore provide a scalable, accurate solution for PE registry abstraction, and a dual model review workflow can safeguard data quality with minimal human oversight.
我们研究了公开提供的大型语言模型(LLMS)能否在不损失数据质量的情况下,将计算成透视 PE(CTPE)报告的概念提取自动化。四张Llama 3变量(3.0 8B、3.1 8B、3.1 70B、3.3 70B)和一个审查模型(Phi 4 14B)在MIMIMIC IV和杜克大学的250份双倍附加说明的 CTPE报告中测试了Phi 4 14B。在模型大小、温度和点数之间测量了可公开提供的大型语言模型(LMLMS)的准确度、正预测值和负预测值(NPV)与人金标准之间的对比。在模型中,70B变量(3.0 8B)、3. 0.91(3.1 8B)和0.96(Phi 4 14B)和一个审查模型(Phi 4 4 14B)的精确度为0.98, 数据集的精确度比0.03%,因此显示外部质量的双模型(L3 70B+ 4 Phi 14B),PPV 的全级和0.9 Rent Ral Ral 版本的P 版本的P 版本的精确度报告为最小的精确度,在0.9 5 和0.9 和0.9 和最小的精确度上,在模型的精确度上至少的精确度报告为0.9 和最小的精确度 和最小的精确度 和最小的精确度报告为P 5 。
Article 70
Title@2025-06-26 (4): PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Title: PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks | PP-DocBee: Multimodales Dokumentenverständnis durch Tricks verbessern | PP-Docbee:通过一袋小把戏改进多式文件理解 2503.04065v3 |
Authors (7): Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu
With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal large language model designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
随着数字化的快速发展,各种文件图像正在生产和日常生活中更加广泛地应用,越来越迫切需要对文件图像的内容进行快速和准确的分辨,因此,本报告介绍了PP-Docbee,这是为理解端到端文件图像而设计的新的多式大型语言模型。首先,我们为文件情景设计了一个数据综合战略,以建立多样化的数据集来改进模型的概括化。然后,我们应用了一些培训技术,包括动态比例抽样、数据预处理和OCR后处理战略。广泛的评价表明PP-Docbee的优异性表现,在英文文件理解基准方面达到了最新的结果,甚至超过了中国文件理解中的现有开放源码和商业模型。源代码和预先培训的模型可在href{https://github.com/PaddlePadddle/PaddleMIX_https://github.com/PaddlePaddddle/dleMIX}上公开查阅。
Article 71
Title@2025-06-26 (4): KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
Title: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model | KaLM-Embedding-V2: Überlegene Trainingstechniken und Daten inspirieren ein vielseitiges Einbettungsmodell | KaLM-Embedding-V2:高级培训技术和数据预报 2506.20923v1 |
Authors (17): Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.
在本文中,我们提出KALM-Embeding-V2,这是一个多功能和紧凑的嵌入模式,通过利用高级培训技术和数据,在通用文本嵌入任务中取得令人印象深刻的业绩,我们的主要创新包括:(1) 为更好地使结构与代表性学习保持一致,我们去除因果关注面罩,并采用一个完全双向变压器,配有简单而有效的平均资源库,以产生固定长度嵌入;(2) 我们采用一个多阶段培训管道:(一) 大规模、监管薄弱的开放源码参数(开放源码)的预培训;(二) 对高质量检索和非检索数据集进行微调;以及(三) 平均为稳健的通用数据组合参数。此外,我们引入了一种集中学习困难样品和在线硬性负式混合战略,以持续浓缩硬性负值,而无需昂贵的离线采矿;(三) 我们收集了20多类以上的培训前数据和100类数据,用于细调,以提升嵌入型模型的性及非重复性数据;以及(四) 大规模评估18个模型、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、中制、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、低级、
Article 72
Title@2025-06-26 (4): FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language
Title: FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language | FineWeb2: Eine Pipeline, um sie alle zu skalieren – Anpassung der Vorschulungsdatenverarbeitung an jede Sprache | FineWeb2: 将全部标准缩放的一条管道 – – 将培训前数据处理适应于每种语言 2506.20920v1 |
Authors (10): Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf
Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.
培训前最先进的大型语言模型(LLMS)需要大量清洁和多样化的文本数据。虽然在开放开发高质量的大规模高质量英语培训前数据集方面最近取得了长足进展,但培训表现优秀的多语言LMS仍是一个挑战,这在很大程度上是由于将过滤和分解管道与大量语言相适应的固有困难。在这项工作中,我们采用了一种新的基于FineWeb的训练前数据整理管道,可以自动调整,以支持任何语言。最后,我们用一套九种不同语言,在一套有意义和内容丰富的评价任务的指导下,将管道的设计选择扩大至一千多种语言,这一套任务是通过一套基于可计量标准的新的选择过程选择的。最后,我们表明我们的管道可以用来创建非英语公司,产生比先前的数据集更多的性能模型。我们还引入了一种直接和原则性的方法来重新平衡数据集,既考虑到重复的计算和质量,又提供了额外的性能提升。最后,我们用近100种共同的纸笔缩缩图将我们的管道设计方法扩大到1000多种语言,这是我们制作了50亿个新版本的多语言版本文件,这是我们制作的多语言版本的多语言,我们制作了20世纪的版本。
Article 73
Title@2025-06-26 (4): Optimising Language Models for Downstream Tasks: A Post-Training Perspective
Title: Optimising Language Models for Downstream Tasks: A Post-Training Perspective | Sprachmodelle für Downstream-Aufgaben optimieren: Eine Perspektive nach dem Training | 优化下游任务的语言模式:培训后展望 2506.20917v1 |
Authors (1): Zhengyan Shi
Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence.
语言模型(LMS)在NLP中表现出了非凡的能力,然而,高效和有力地调整这些模型以适应具体任务,仍然具有挑战性。随着其规模和复杂性的增长,在标签数据上微调LMS的LMS微调LMS往往没有充分利用现有的未贴标签数据,导致过多地配置小型任务集,并带来巨大的计算成本。这些局限性妨碍了将其应用于现实世界语言任务的开放式环境。这一论文提出了一系列方法,以更好地使LMS更好地适应下游应用。首先,我们探索从未贴标签的数据中提取与任务相关的知识的战略,引入新的持续培训前技术,这种技术超越了先进的半监督方法。接下来,我们提出了一个具有参数效率的微调LMM方法,大大降低记忆力和计算成本,同时保持有竞争力的业绩。我们还引入了经过监督的改进的微调方法,使LM系统能够更好地遵守指示,特别是当贴标签数据稀少时,提高他们在一系列NLP任务中的绩效,包括开放式一代。最后,我们制定了新的评价方法和基准,例如多步式的空间推式推式推式推式推式推方法,使更稳健的半超方法更深入地推进,使LMSLM系统更接近于更接近于更深入地研究,使LMSLM的能力和全面地展示。
Article 74
Title@2025-06-25 (3): Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues
Title: Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues | Erforschung von fünf großen Persönlichkeits- und KI-Kapazitätseffekten in LLM-simulierten Verhandlungsdialogen | 探讨LLM模拟谈判对话中的五大个性和AI 2506.15928v2 |
Authors (7): Myke C. Cohen, Zhe Su, Hsien-Te Kao, Daniel Nguyen, Spencer Lynch, Maarten Sap, Svitlana Volkova
This paper presents an evaluation framework for agentic AI systems in mission-critical negotiation contexts, addressing the need for AI agents that can adapt to diverse human operators and stakeholders. Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence LLM-simulated social negotiation outcomes–a capability essential for a variety of applications involving cross-team coordination and civil-military interactions. Experiment 1 employs causal discovery methods to measure how personality traits impact price bargaining negotiations, through which we found that Agreeableness and Extraversion significantly affect believability, goal achievement, and knowledge acquisition outcomes. Sociocognitive lexical measures extracted from team communications detected fine-grained differences in agents’ empathic communication, moral foundations, and opinion patterns, providing actionable insights for agentic AI systems that must operate reliably in high-stakes operational scenarios. Experiment 2 evaluates human-AI job negotiations by manipulating both simulated human personality and AI system characteristics, specifically transparency, competence, adaptability, demonstrating how AI agent trustworthiness impact mission effectiveness. These findings establish a repeatable evaluation methodology for experimenting with AI agent reliability across diverse operator personalities and human-agent team dynamics, directly supporting operational requirements for reliable AI systems. Our work advances the evaluation of agentic AI workflows by moving beyond standard performance metrics to incorporate social dynamics essential for mission success in complex operations.
本文介绍了在任务关键谈判背景下采用人工智能系统的评估框架,解决了对能够适应不同人类操作者和利益攸关方的人工智能代理人的需要。我们用软体模型作为模拟测试台,提出两个实验,系统评估个人特征和人工智能特征对LLM模拟社会谈判结果的影响如何,这种能力对于涉及跨团队协调和军民互动的各种应用至关重要。实验1采用因果发现方法衡量个人特征如何影响价格讨价还价谈判,通过这种方法,我们发现,可接受性和外向对可视性、目标实现和知识获取成果产生重大影响。从团队通信中得出的社会认知法措施发现,在代理人的光学通信、道德基础和观点模式方面,存在着细微的差别,为在高团队协调和军民互动中可靠运行的人工智能系统提供了可操作的洞察力。实验2通过调整模拟人的个性特征和人工智能系统特点,特别是透明度、能力、适应性能、适应性,表明AI代理对任务有效性的影响如何。这些结论确立了一种可重复的评价方法,用以在机构性磁性动态方面,即我们操作者们的操作性标准,将公司性动态直接纳入公司性工作进展系统。
Article 75
Title@2025-06-25 (3): GroundCap: A Visually Grounded Image Captioning Dataset
Title: GroundCap: A Visually Grounded Image Captioning Dataset | GroundCap: Ein visuell geerdeter Bildbeschriftungs-Datensatz | GroundCap:视觉定位图像说明数据集 2502.13898v3 |
Authors (3): Daniel A. P. Oliveira, Lourenço Teodoro, David Martins de Matos
Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking. We present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and the segmentation of background elements through K-means clustering. We propose gMETEOR, a metric combining caption quality with grounding accuracy, and establish baseline performance by fine-tuning Pixtral-12B and Qwen2.5-VL 7B on GroundCap. Human evaluation demonstrates our approach’s effectiveness in producing verifiable descriptions with coherent object references.
当前图像字幕系统缺乏将描述文字与特定视觉元素连接起来的能力,使其产出难以校验。 虽然最近的方法提供了一些基础能力, 但它们无法同时跟踪多个引用或地面动作和对象的物体身份。 我们提出一个新的基于 ID 的定位系统, 使对象参考跟踪和动作对象连接一致。 我们提出GroundCap, 数据集包含77部电影的52,016个图像, 有344个人类附加说明和52,016个自动生成字幕。 每个字幕都基于检测到的物体( 132类)和行动( 51类) , 使用标签系统, 维持对象身份, 同时将动作与相应对象连接起来。 我们的方法为参考跟踪、 明确的行动对象链接和通过 K- means 集对背景元素进行分解设置。 我们提出GMETEOR, 将字幕质量与地标精度相结合的参数, 并通过对 Pixtral-12B 和 Quen2.5- VL 7B 进行微调确定基线性性能。 人类评估表明我们的方法在用一致的物体引用进行可核实的描述方面的有效性。
Article 76
Title@2025-06-25 (3): A3 : an Analytical Low-Rank Approximation Framework for Attention
Title: A3 : an Analytical Low-Rank Approximation Framework for Attention | A3: ein analytischer Rahmen für die Annäherung an den Low-Rank-Wert | A3: 分析性低Rank接近度关注框架 2505.12942v3 |
Authors (7): Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao
Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose $\tt A^\tt 3$, a post-training low-rank approximation framework. $\tt A^\tt 3$ splits a Transformer layer into three functional components, namely $\tt QK$, $\tt OV$, and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component’s functional loss ($\it i.e.$, error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that $\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA’s 7.87 by 3.18. We also demonstrate the versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
大型语言模型表现出了显著的性能; 然而, 其巨大的参数计数使得部署费用非常昂贵。 低级别近似提供了很有希望的压缩解决方案, 但现有方法有两个主要局限性:(1) 侧重于将单个线性层的产出错误最小化, 不考虑变异器的建筑特征, 并且(2) 将一个大重矩阵分解成两个小的低级矩阵。 因此, 这些方法往往比其他压缩技术( 比如修剪和四分化) 差得要差得多, 并引入运行时间性管理管理, 比如额外的 GEMM 核心启动功能性能。 为了解决这些局限性, 我们建议 $\ t 3, 培训后低级的低级缩缩缩缩框架。 $t A, 将一个变异端的变异端矩阵分为三个功能部分, 即 $t QKOV 和 美元 美元 。 同样的问题, $ t A t t t 提供一种分析解决方案, 降低每个部件的隐藏的维度规模, 同时将功能损失最小化为 i. i. i. i. preallientalalalalal deal deal dill deal dal daltial dent exaltial ex ex dental ex ex ex ex
Article 77
Title@2025-06-25 (3): Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine
Title: Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine | Entscheiden Sie weniger, kommunizieren Sie mehr: Auf dem Konstrukt Gültigkeit der End-to-End-Fact-Checking in der Medizin | 决定少决定少决定少决定,交流多交流: 2506.20876v1 |
Authors (9): Sebastian Joseph, Lily Chen, Barry Wei, Michael Mackert, Iain J. Marshall, Paul Pu Liang, Ramez Kouzy, Byron C. Wallace, Junyi Jessy Li
Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. To understand this, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.
科技进步导致在被视为具有挑战性的任务(如自动核对事实)方面取得具体进展,例如自动核对事实; 有兴趣采用这些公共卫生和医药系统,由于医疗决定和对广泛而多样的医疗文献进行批判性评估的挑战性质,对采用这些公共卫生和医药系统的兴趣已经增加。 循证医学与每一个人都有联系,但医学性质却技术性很强,使大多数使用者的医疗知识知识不足以充分浏览领域。 医疗通信问题使最终到最终核实事实的代理商的土壤受到侵蚀:检查一项针对现有医疗文献的索赔,并用证据支持的判决返回。然而,这类系统在很大程度上仍未被使用。为了理解这一点,我们提出第一份研究报告,审查临床专家如何通过综合医学证据来核实社会媒体的真实索赔。在寻找这一具有高度局限性的医学应用时,我们揭示了端到端检查中的基本挑战:在临床试验中难以将要求与科学证据联系起来; 定义不足的索赔与不匹配的意图混杂; 以及内在的主观真实性标签。我们争辩说,应当将事实检查作为交互式交流问题而不是最终处理和评价。
Article 78
Title@2025-06-25 (3): Leaner Training, Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA
Title: Leaner Training, Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA | Leaner Training, Lower Leakage: Die Erinnerung an LLM Fine-Tuning mit LoRA | 皮皮培训,《下下渗漏:重新研究LLM与LORA的精细调整的记忆 2506.20856v1 |
Authors (2): Fei Wang, Baochun Li
Memorization in large language models (LLMs) makes them vulnerable to data extraction attacks. While pre-training memorization has been extensively studied, fewer works have explored its impact in fine-tuning, particularly for LoRA fine-tuning, a widely adopted parameter-efficient method. In this work, we re-examine memorization in fine-tuning and uncover a surprising divergence from prior findings across different fine-tuning strategies. Factors such as model scale and data duplication, which strongly influence memorization in pre-training and full fine-tuning, do not follow the same trend in LoRA fine-tuning. Using a more relaxed similarity-based memorization metric, we demonstrate that LoRA significantly reduces memorization risks compared to full fine-tuning, while still maintaining strong task performance.
大型语言模型(LLMs)的记忆化使得他们容易受到数据提取攻击。虽然对培训前记忆化进行了广泛研究,但探索其微调影响的工作较少,特别是Lora微调,这是一个广泛采用的具有参数效率的方法。在这项工作中,我们重新审查微调的记忆化,发现不同微调战略与以往发现的差异令人惊讶。模型规模和数据重复等因素,严重影响了培训前记忆化和全面微调,但与Lora微调的类似趋势不同。我们使用较宽松的类似度模化标准,证明Lora与全面微调相比,大量减少记忆化风险,同时保持很强的任务性能。
Article 79
Title@2025-06-25 (3): Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
Title: Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training | Datenschutz Ripple Effekte von Hinzufügen oder Entfernen von persönlichen Informationen in Sprachmodell-Training | 语言模式培训中增加或删除个人信息对隐私的波纹效应 2502.15680v2 |
Authors (6): Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, Christopher A. Choquette-Choo
Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training. Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage. We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices. We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII significantly (in our settings, as much as $\approx!7.5\times$); and (3) removing PII can lead to other PII being memorized. Model creators should consider these first- and second-order privacy risks when training models to avoid the risk of new PII regurgitation.
由于个人识别信息(PII)的敏感性质,其所有者可能有权控制其纳入或要求将其从大语言模型(LLM)培训中除名,此外,由于不断演变的数据集校正技术,或者由于这些技术是新报废的再培训,或者由于它们被列入一个新的下游微调阶段,PII的量和易读性是整个培训管道中演变的模型的动态属性,并取决于通常改变的设计选择。我们把三种新现象定性为:(1)在培训中后来看到的类似出现的PII可导致我们称之为辅助记忆化的早期序列的记忆化,这是一个重要的因素(在我们的环境里,最高可达1/3);(2)添加PII可大大提高其他PII的记忆化(在我们的环境里,相当于$\approx\@7.5\times$);(3)去除PII可导致其他PII的记忆化。当培训模型避免新的风险时,模型设计者应考虑这些第一级和第二级隐私风险。
Article 80
Title@2025-06-25 (3): Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes
Title: Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes | Enthüllen versteckter gewalttätiger Tendenzen in LLMs: Eine demografische Analyse über Verhaltensvignetten | 《LLMs中隐蔽的隐藏暴力时期:通过行为征兆进行的人口分析》 2506.20822v1 |
Authors (2): Quintin Myers, Yanjun Gao
Large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, yet their ability to reason about morally ambiguous, real-world scenarios remains underexamined. We present the first study to evaluate LLMs using a validated social science instrument designed to measure human response to everyday conflict, namely the Violent Behavior Vignette Questionnaire (VBVQ). To assess potential bias, we introduce persona-based prompting that varies race, age, and geographic identity within the United States. Six LLMs developed across different geopolitical and organizational contexts are evaluated under a unified zero-shot setting. Our study reveals two key findings: (1) LLMs surface-level text generation often diverges from their internal preference for violent responses; (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology.
为了发现和应对网上暴力内容,人们越来越多地提出大型语言模型(LLMs),然而,它们对道德模糊、现实世界情景的思考能力仍然不足。我们提出第一份研究报告,利用一种有效的社会科学工具,即暴力行为Vignette 问卷(VBVQ)来衡量人类对日常冲突的反应。为了评估潜在的偏见,我们在美国引入了基于人、因人、因种族、年龄和地理特征而异的刺激。在不同地缘政治和组织背景下开发的六个LLMs,在统一零射场下进行评估。我们的研究揭示了两大结论:(1) LLMs地表层文本的生成往往与其内部对暴力反应的偏好不同;(2) 其暴力趋势在人口结构上各不相同,经常与在犯罪学、社会科学和心理学方面的既定发现相矛盾。
Article 81
Title@2025-06-25 (3): MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering
Title: MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering | MultiFinRAG: Ein optimiertes multimodales Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering | 多金融inRAG: 最佳多式联运回收-提款一代(RAG)金融问题回答框架 2506.20821v1 |
Authors (3): Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh
Financial documents–such as 10-Ks, 10-Qs, and investor presentations–span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.
金融文件 – – 如10K、10Q和投资者陈述 – – 长达数百页,并综合了多种模式,包括密集的叙述性文字、结构化表格和复杂数字。回答关于这类内容的问题,往往需要不同模式的联合推理,这些模式由于象征性限制、布局损失和分散的跨模式背景,使传统的大型语言模型和检索增强的一代管道紧张。我们引入了多功能反向战略,即为金融QA建立的一种检索强化的生成框架。多功能搜索组首先通过将表格和图象图像分组成批进行多式联运提取,并将它们发送到一个轻量的、量化的开放源式多式联运LLM,后者既产生结构化的JSON产出,又产生简明的文本摘要。这些产出加上叙述性文字,与模式认知相似的临界值一致,以便精确检索。我们引入了一个分级后退战略,然后在必要时从只读文本+表+image环境动态升级,使跨模式推理,同时减少不相干的背景。尽管在商品硬件上运行,多功能-高端的FinRAG在高端组合文本上达到19个百分比的精确度,但也实现了。
Article 82
Title@2025-06-25 (3): The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
Title: The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas | Die Ideation-Execution-Gap: Ergebnisse der LLM-generierten gegen menschliche Forschungsideen | 观察与执行差距:LLM-Genered与人类研究概念的执行结果 2506.20803v1 |
Authors (3): Chenglei Si, Tatsunori Hashimoto, Diyi Yang
Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.
大型语言模型(LLMS)在加速科学研究管道方面表现出了希望。 这一过程的一个关键能力是能够产生新的研究想法,而先前的研究发现,LLM公司产生的研究想法被评为比人类专家想法更新颖的,但是,一个好的想法不应该简单看起来是新颖的,它在执行后也应导致更好的研究。为了检验AI公司产生的想法是否导致更好的研究成果,我们开展了一项执行研究,征聘了43名专家研究人员随机执行专家或专家或LLM公司提出的想法。 每位专家花费了100多小时的时间来实施这一想法,并写了一份四页的短文件来记录实验。然后,所有执行的项目都由NLP公司的专家研究人员盲目地加以审查。比较了相同想法在实施前后的评分,LM公司提出的想法的得分比专家撰写的关于所有评价指标(新颖、刺激、有效和整体;第 < 0.05页)的构想。 每位专家都花100多小时的时间来落实这一想法,并撰写了一份四页短的短的短的论文。 在比较执行过程中,所有执行中的所有项目项目都被盲目审查的评分时,然后由NLPM公司研究人员。 研究的评分中,我们观察到在实际的评分中观察到的评分中,甚至观察了执行中的许多评测测测测了执行中的许多评分。
Article 83
Title@2025-06-25 (3): Multi-lingual Functional Evaluation for Large Language Models
Title: Multi-lingual Functional Evaluation for Large Language Models | Multilinguale Funktionsbewertung für große Sprachmodelle | 大语言模式多语文职能评价 2506.20793v1 |
Authors (3): Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian
Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks – Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)– by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there’s a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.
大型语言模式的多语言能力往往通过固定数据基准,如Belebele、MM-MMLU和M-GSM等静态数据基准进行评估。然而,这些评价往往无法充分理解多种语言环境中模式的实际业绩和稳健性。作为回应,我们创建了多种语言功能基准 – – 跨语言年级学校数学符号(CL-GSM符号)和跨语言教学-落实Eval(CL-IFEval) – – 将现有功能基准模板从英语翻译成另外五种语言,这五种语言涵盖国家语言的可用资源范围:法语、西班牙语、印地语、阿拉伯语和约鲁巴语。我们的结果表明,一些静态多语言基准的功能性能比其他基准更接近于其他标准(例如,在各种模式中,M-GSM-GSM和C-L-GSMC符号(C-IFEval)之间的性能分别下降了24%、17%和18%;同样,Belebele和C-I-IFval之间的性表现模式只有0.5%至3%。
Article 84
Title@2025-06-25 (3): CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
Title: CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement | CodeLutra: Steigerung der LLM-Code-Generierung durch Präferenz-geführte Verfeinerung | 代码Lutra:通过优先指导改进促进LLM代码生成 2411.05199v3 |
Authors (7): Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan Rossi, Yixuan Li, Saayan Mitra
Large Language Models (LLMs) have revolutionized code generation but require significant resources and often over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective alternative. However, standard supervised approaches rely only on correct examples, missing valuable insights from failures. We introduce CodeLutra, a framework that leverages both correct and incorrect code attempts. Instead of using only correct solutions, CodeLutra applies iterative preference-based refinement, comparing successful and failed outputs to better approximate desired results. This approach narrows the performance gap with state-of-the-art larger models without requiring massive datasets or auxiliary models. For instance, on a challenging data science coding task, using only 500 samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s level. By learning from both successes and mistakes, CodeLutra provides a scalable and efficient path to high-quality code generation, making smaller open-source models more competitive with leading closed-source alternatives.
大型语言模型(LLMS)已经使代码生成发生革命,但需要大量资源,而且往往过于宽泛,限制了任务特定的效率。微调小型、开放源码的LLMS提供了具有成本效益的替代方法。然而,标准监督方法只依靠正确的例子,从失败中缺少宝贵的洞察力。我们引入了CodLutra,这是一个利用正确和错误代码尝试的杠杆框架。CodLutra不仅使用正确的解决方案,还应用了反复的优惠改进,比较了成功和失败的产出,以更好地接近预期的结果。这种方法缩小了与最先进的大型模型的绩效差距,而不需要大规模数据集或辅助模型。例如,在具有挑战性的数据科学编码任务上,仅使用500个样本改进了Llama-3-8B的精确度,从28.2%到48.6%,接近了GPT-4的水平。CodLutra从成功和错误中汲取了一条可扩展而有效的途径,从而实现高质量代码的生成,使较小的开放源码模型与主要的封闭源替代方法更具竞争力。
Article 85
Title@2025-06-25 (3): Towards Probabilistic Question Answering Over Tabular Data
Title: Towards Probabilistic Question Answering Over Tabular Data | Auf dem Weg zu einer probabilistischen Fragestellung über tabellarische Daten | 走向在表格数据上回答概率问题 2506.20747v1 |
Authors (3): Chen Shen, Sajjadur Rahman, Estevam Hruschka
Current approaches for question answering (QA) over tabular data, such as NL2SQL systems, perform well for factual questions where answers are directly retrieved from tables. However, they fall short on probabilistic questions requiring reasoning under uncertainty. In this paper, we introduce a new benchmark LUCARIO and a framework for probabilistic QA over large tabular data. Our method induces Bayesian Networks from tables, translates natural language queries into probabilistic queries, and uses large language models (LLMs) to generate final answers. Empirical results demonstrate significant improvements over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.
相对于诸如NL2SQL系统等表列数据,目前问答(QA)的当前方法对于直接从表格中检索答案的事实问题效果良好,然而,这些方法在不确定情况下需要推理的概率问题方面却不尽如人意。在本文中,我们引入了新的基准LUCARIO和一个比大表列数据概率QA的框架。我们的方法引导贝叶斯网络从表格中引入,将自然语言查询转化为概率查询,并使用大语言模型来生成最终答案。经验性结果显示比基线有重大改进,突出了混合符号性推理的好处。
Article 86
Title@2025-06-25 (3): MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation
Title: MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation | MAPIE: Ein Datensatz für die multi-Agenten-Kontextbewertung von PrIvacy | MAGPIE: 多动环境普尔瓦茨评价数据集 2506.20737v1 |
Authors (4): Gurusha Juneja, Alon Albalak, Wenyue Hua, William Yang Wang
The proliferation of LLM-based agents has led to increasing deployment of inter-agent collaboration for tasks like scheduling, negotiation, resource allocation etc. In such systems, privacy is critical, as agents often access proprietary tools and domain-specific databases requiring strict confidentiality. This paper examines whether LLM-based agents demonstrate an understanding of contextual privacy. And, if instructed, do these systems preserve inference time user privacy in non-adversarial multi-turn conversation. Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks where private information can be easily excluded. We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains. These scenarios are designed such that complete exclusion of private data impedes task completion yet unrestricted information sharing could lead to substantial losses. We then evaluate the current state-of-the-art LLMs on (a) their understanding of contextually private data and (b) their ability to collaborate without violating user privacy. Empirical experiments demonstrate that current models, including GPT-4o and Claude-2.7-Sonnet, lack robust understanding of contextual privacy, misclassifying private data as shareable 25.2\% and 43.6\% of the time. In multi-turn conversations, these models disclose private information in 59.9\% and 50.5\% of cases even under explicit privacy instructions. Furthermore, multi-agent systems fail to complete tasks in 71\% of scenarios. These results underscore that current models are not aligned towards both contextual privacy preservation and collaborative task-solving.
以LLM为主的代理机构的扩散导致越来越多地为时间安排、谈判、资源分配等任务部署机构间协作。在这类系统中,隐私至关重要,因为代理机构往往能够获取需要严格保密的专有工具和域域专用数据库。本文件审查了以LLM为主的代理机构是否表现出对背景隐私的理解。如果得到指示,这些系统在非对立的多点对话中保留了推断时间用户隐私。现有的评估LLM代理机构背景隐私的基准主要是评估单向、低兼容度的隐私任务,而私人信息可以轻易排除。我们首先提出了一个基准——MAGPIE, 包括158个实际生活高访问15个领域的情景。这些情景的设计使得完全排除私人数据会妨碍任务完成,但不受限制的信息共享可能导致重大损失。我们随后评估了当前最先进的LMM(a)对背景隐私数据的理解,以及(b)它们能够在不侵犯用户隐私的情况下进行完整合作。 诚恳的实验表明,当前模式,包括GPT-4o和Claude-2-Sont, 包括158个真实的高层访问情景,甚至缺乏对50个私基隐私、错误地将私人数据转换为25xx
Article 87
Title@2025-06-25 (3): MMSearch-R1: Incentivizing LMMs to Search
Title: MMSearch-R1: Incentivizing LMMs to Search | MMSearch-R1: LMMs zur Suche anregen | MMSearch- R1: 激励使用 LMM 搜索 2506.20670v1 |
Authors (8): Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
鉴于真实世界信息的复杂性和动态性质,在现实世界情景中强力部署大型多式联运模型(LMMM)需要获得外部知识来源,因为现实世界信息的复杂性和动态性质。现有方法,如检索增强的一代(RAG)和迅速设计的搜索代理器依靠僵硬的管道,往往导致低效或过度的搜索行为。我们介绍了MMSearch-R1,这是第一个端到端的强化学习框架,使LMMMS能够在现实世界互联网环境中按需进行多端搜索。我们的框架结合了图像和文本搜索工具,使模型能够解释何时以及如何在基于成果的奖励下援引这些工具,并处以搜索罚款。为了支持培训,我们通过半自动管道收集多式搜索VQA数据集,涵盖不同的视觉和文字知识需求,并用搜索要求和无搜索的样本来调整搜索的组群,这已证明对塑造高效和按需搜索行为至关重要。关于知识密集型和软件搜索VQA的广泛实验显示,我们的模型不仅超越了基于成果的奖励,而且我们不仅在搜索模型中超越了基于搜索的搜索模型,我们还将搜索模型的30AG的进度分析行动基准。
Article 88
Title@2025-06-25 (3): Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
Title: Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs | Im Inneren sind Sie viele Wölfe: Mit kognitiven Modellen, um Wert-Abwägungen in LLMs zu interpretieren | 使用认知模型来解释LLMM中的价值权衡 2506.20666v1 |
Authors (7): Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman
Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person’s feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.
管理日常社会状况往往需要调和相互矛盾的目标,例如传达严酷的真理,保持信任,同时仍然注意他人的感情。这些价值权衡是人类决策和语言使用的一个组成部分。然而,目前用于解释LLM中这种动态和多面价值概念的工具有限。在认知科学中,所谓的“认知模型”为人类的这些取舍提供了正式的描述,在选择行动或言论时模拟了演讲者竞相实用功能的权重。在这项工作中,我们使用一种主要的礼貌演讲的认知模型来解释LLMS代表人式交易的取舍的程度。我们用这个透镜来系统地评估价值权衡取舍,在两种包含的模式环境下:前沿黑箱模型中的推理“效果”和RL的开源模型后培训动态。我们的结果突出了在选择一种比社会功用率更高的信息模型,而在数字推理中显示的开放源模型中,我们从LMS的培训动态模型中得出的有关LMMS代表类似人的取人取价的程度。我们运用这个透视价值的大幅变化,在培训方法的早期分析方法中,我们用高调制数据,在培训方法中,我们用高调制方法中,我们用高调取数据。
Article 89
Title@2025-06-25 (3): The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
Title: The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind | Der Decrypto-Benchmark für multi-agente Vernunft und Theorie des Geistes | 多种代理理由和思想理论的Decrypto Decrypto基准 2506.20664v1 |
Authors (3): Andrei Lupu, Timon Willi, Jakob Foerster
As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi-agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the “mental” states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents.
随着大型语言模型(LLMS)获得代理能力,它们将不得不浏览复杂的多试样情景,与人类用户和其他代理人在合作和竞争性环境下互动。这将需要新的推理技能,其中最主要的是思维理论(ToM),或对其他代理人的“心理”状态进行解释的能力。然而,由于现有基准的范围狭窄,数据渗漏、饱和和和缺乏互动性,因此LMS的TOM和其他多试剂能力不甚为人理解,因此,我们提议了基于游戏的多试理基准Decrypto,即基于游戏的推理基准,TOM从认知科学、计算实用学和多剂强化学习中提取灵感。这需要新的推理技能,主要是思维理论理论理论(ToM),或其他代理人的“心理”状态。根据我们的知识,TOM和其他代理人的“精神”能力也是设计互动式TOM实验的第一个平台。我们通过对前沿LMS、坚固性研究以及人类-AI的交叉实验来验证基准设计。我们发现LM游戏能力落后于人类的游戏能力,以及简单的文字组合当前基准基线。我们随后在更精确的推理学的推理学上,我们创建了两个关键的推理到更精确的推理到更精确的推理。我们发现LM到更精确的推理的推理到更精确的推理。我们到了更深的推理。我们到了更深的推理到更深的推理到更深的推理。我们发现,在12进到更精确的推理。
Article 90
Title@2025-06-25 (3): OmniGen2: Exploration to Advanced Multimodal Generation
Title: OmniGen2: Exploration to Advanced Multimodal Generation | OmniGen2: Exploration zur fortgeschrittenen multimodalen Generation | OmniGen 2:探索先进的多式联运 2506.18871v2 |
Authors (22): Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
在这项工作中,我们引入了OmniGen2, 这是一种多用途和开放源码的基因化模型,旨在为多种生成任务提供统一的解决办法,包括文本到图像、图像编辑和文文本生成。与OmniGen2不同的是,OmniGen2为文本和图像模式提供了两种不同的解码路径,使用未共享的参数和分解的图像代谢器。这一设计使OmniGen2无需重新适应VAE投入的现有多式联运理解模型,从而保存原始文本生成能力。为便利培训OmniGen2,我们开发了全面的数据构建管道,包括图像编辑和文文本生成数据。此外,我们引入了一个为图像生成任务设计的反省机制,并基于OmniGen2, 尽管其参数相对较小,但OmniGen2在多个任务基准上取得了竞争性的结果,包括文本到图像和图像编辑。为了进一步评估Conblib生成,也称其为主题驱动的任务,我们引入了名为OniCentral-Creaude、OnCentreal Studueal deal destrual destrual ex、OniGrationalationalteral destrational destrutes.
Article 91
Title@2025-06-25 (3): Memento: Note-Taking for Your Future Self
Title: Memento: Note-Taking for Your Future Self | Memento: Notizen für Ihr zukünftiges Selbst | 纪念:为未来的自我记记记 2506.20642v1 |
Authors (6): Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger, Kilian Q. Weinberger
Large language models (LLMs) excel at reasoning-only tasks, but struggle when reasoning must be tightly coupled with retrieval, as in multi-hop question answering. To overcome these limitations, we introduce a prompting strategy that first decomposes a complex question into smaller steps, then dynamically constructs a database of facts using LLMs, and finally pieces these facts together to solve the question. We show how this three-stage strategy, which we call Memento, can boost the performance of existing prompting strategies across diverse settings. On the 9-step PhantomWiki benchmark, Memento doubles the performance of chain-of-thought (CoT) when all information is provided in context. On the open-domain version of 2WikiMultiHopQA, CoT-RAG with Memento improves over vanilla CoT-RAG by more than 20 F1 percentage points and over the multi-hop RAG baseline, IRCoT, by more than 13 F1 percentage points. On the challenging MuSiQue dataset, Memento improves ReAct by more than 3 F1 percentage points, demonstrating its utility in agentic settings.
大型语言模型( LLMS) 擅长于只进行推理的工作, 但当推理必须与检索紧密结合时, 就象多跳问题解答一样。 为了克服这些限制, 我们引入了一个快速战略, 首先将复杂的问题分解成小步, 然后动态地用LLMS构建一个事实数据库, 最后用LLMS将这些事实一起拼凑出来解决问题。 我们用我们称之为Memento的三阶段战略, 如何在不同环境中提升现有快速战略的绩效。 在“ PhantoomWiki” 9步基准上, Memento 在提供所有信息时将思考链(CoT)的性能加倍。 关于 2WikiMultiHopQA、 CoT-RAG 和 Memento 的开放式版本, 将Vanilla CoT-RAG 改进超过 20 F1 百分点, 并超过多霍普 RAG 基线, IRCoT 13 F1 百分以上 。 在具有挑战性的 MuSue 数据集中, Memento 将React 3 F1 百分 。
Article 92
Title@2025-06-25 (3): PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models
Title: PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models | PLoP: Präzise LoRA-Platzierung für effiziente Feinsteuerung großer Modelle | PLP: 高效微调大型模型的精确LORA定位 2506.20629v1 |
Authors (3): Soufiane Hayou, Nikhil Ghosh, Bin Yu
Low-Rank Adaptation (LoRA) is a widely used finetuning method for large models. Its small memory footprint allows practitioners to adapt large models to specific tasks at a fraction of the cost of full finetuning. Different modifications have been proposed to enhance its efficiency by, for example, setting the learning rate, the rank, and the initialization. Another improvement axis is adapter placement strategy: when using LoRA, practitioners usually pick module types to adapt with LoRA, such as Query and Key modules. Few works have studied the problem of adapter placement, with nonconclusive results: original LoRA paper suggested placing adapters in attention modules, while other works suggested placing them in the MLP modules. Through an intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a lightweight method that allows automatic identification of module types where LoRA adapters should be placed, given a pretrained model and a finetuning task. We demonstrate that PLoP consistently outperforms, and in the worst case competes, with commonly used placement strategies through comprehensive experiments on supervised finetuning and reinforcement learning for reasoning.
低兰克适应(LORA)是大型模型广泛使用的微调方法,其小记忆足迹使从业人员能够以完全微调成本的一小部分使大型模型适应具体任务。提出了不同的修改建议,以提高其效率,例如,通过设定学习率、等级和初始化。另一个改进轴是适配安置战略:当使用LORA时,从业人员通常选择模块类型与LORA(如Query和Key模块)相适应。很少有作品研究适应者安置问题,且没有得出结论性结果:原始LORA文件建议将适配者放在模块中,而其他作品则建议将其置于MLP模块中。通过直观的理论分析,我们引入了PLP(Precise LoRA Placis),这是一种轻量的方法,允许自动识别模块类型,在LORA适应者应放置的位置,我们先经过培训的模型和微调任务。我们证明,PLOP始终优异,在最坏的情况下,通过监督性微调和强化学习的全面实验,共同使用定位战略。
Article 93
Title@2025-06-25 (3): Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Title: Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models | Recycling the Web: Eine Methode zur Verbesserung der Vorschulung von Daten Qualität und Menge für Sprachmodelle | 网上再循环:提高语文模式培训前数据质量和数量的方法 2506.04689v2 |
Authors (7): Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the “data wall” of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.
缩放法律预测,随着模型规模和数据规模的增加,大型语言模型的性能会随着模型规模和数据规模的扩大而提高。在实践中,培训前一直依赖大规模网络爬行,使用互联网上迄今公开的几乎所有数据源。然而,这一自然数据库的生长速度与计算供应量的增长速度不同。此外,高质量文本的提供甚至更加有限:数据过滤管道往往会从初始网络废料中消除多达99%的绩效,以达到最新水平。为了解决培训前规模的“数据墙”问题,我们的工作一直在探索如何改造和再循环现有过滤过程中丢弃的数据。我们提议REWIRE, 将网络循环使用虚拟版的Rewret, 以这一方法来丰富低质量文件,这样就可以对培训有用。 数据过滤前的1B、 3B和 7B 测试显示,高质量的原始文本和我们重新编译文的文本将最终转化为1.0、 1.3 和 2.5 变异性指标将分别在22个不同的任务中进行, 培训数据将显示我们精选的精选前数据。
Article 94
Title@2025-06-25 (3): Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
Title: Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm | Modellbearbeitung als zweischneidiges Schwert: Lenker Ethisches Verhalten gegenüber Wohltätigkeit oder Schaden | 模版编辑为双字箭:指导剂道德行为促进福利或伤害 2506.20606v1 |
Authors (9): Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu
Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent’s global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
基于大语言模型(LLMS)的代理商在一系列广泛任务中表现出强大的能力。然而,在高目标领域部署LLM的代理商具有巨大的安全和道德风险。这些代理商的不道德行为可直接导致严重的真实世界后果,包括身体伤害和财政损失。为了有效地指导代理商的道德行为,我们把代理商的行为指导设计成一个示范编辑任务,我们称之为行为编辑。示范编辑是一个新兴的研究领域,它不仅能够精确和有效地修改LLMS,同时保持其总体能力。为了系统研究和评估这一方法,我们引入Behavior Bench,这是基于心理道德理论的多层承诺基准。这一基准既支持对代理人行为的评价和编辑,又支持在各种情景中产生严重的现实后果,包括身体伤害和财政损失。我们首先证明,Behavior编辑可以动态地引导代理商在特定情景中进行目标行为。此外,Behavior编辑不仅能够精确和高效地调整当地情况,而且还能够更广泛地改变一个代理商的全球道德调整。我们证明,我们证明,Beviorvioral devior devioral devioral ex ex devidual exal deviews 也可以制的模型可以用来进行新的道德和Breview 。
Article 95
Title@2025-06-25 (3): Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
Title: Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models | Ad-hoc-Konzept Formung in den Game-Codenamen als Mittel zur Bewertung großer Sprachmodelle | 形成游戏代号作为评价大语言模式的一种手段的游戏代码概念 2502.11707v2 |
Authors (3): Sherzod Hakimov, Lara Pfennigschmidt, David Schlangen
This study utilizes the game Codenames as a benchmarking tool to evaluate large language models (LLMs) with respect to specific linguistic and cognitive skills. LLMs play each side of the game, where one side generates a clue word covering several target words and the other guesses those target words. We designed various experiments by controlling the choice of words (abstract vs. concrete words, ambiguous vs. monosemic) or the opponent (programmed to be faster or slower in revealing words). Recent commercial and open-weight models were compared side-by-side to find out factors affecting their performance. The evaluation reveals details about their strategies, challenging cases, and limitations of LLMs.
本研究利用游戏代码(LLMs)作为评估特定语言和认知技能的大型语言模型的基准工具。 LLMs在游戏的每个侧面玩耍,其中一方产生一个包含几个目标字的线索单词,而另一方则猜测这些目标字。我们设计了各种实验,通过控制单词的选择(抽象词与具体词、模糊词与单词)或对手(在显示单词时程序要更快或更慢)来控制词的选择。最近的商业和开放重量模型是同时比较的,以找出影响其业绩的因素。评估揭示了它们的战略、挑战性案例和LMMs的局限性。
Article 96
Title@2025-06-25 (3): FluoroSAM: A Language-promptable Foundation Model for Flexible X-ray Image Segmentation
Title: FluoroSAM: A Language-promptable Foundation Model for Flexible X-ray Image Segmentation | FluoroSAM: Ein sprachförderndes Foundation-Modell für flexible Röntgenbild-Segmentierung | FluororosAM:灵活X射线图像分割语言快速基础模型 2403.08059v3 |
Authors (8): Benjamin D. Killeen, Liam J. Wang, Blanca Inigo, Han Zhang, Mehran Armand, Russell H. Taylor, Greg Osgood, Mathias Unberath
Language promptable X-ray image segmentation would enable greater flexibility for human-in-the-loop workflows in diagnostic and interventional precision medicine. Prior efforts have contributed task-specific models capable of solving problems within a narrow scope, but expanding to broader use requires additional data, annotations, and training time. Recently, language-aligned foundation models (LFMs) – machine learning models trained on large amounts of highly variable image and text data thus enabling broad applicability – have emerged as promising tools for automated image analysis. Existing foundation models for medical image analysis focus on scenarios and modalities where large, richly annotated datasets are available. However, the X-ray imaging modality features highly variable image appearance and applications, from diagnostic chest X-rays to interventional fluoroscopy, with varying availability of data. To pave the way toward an LFM for comprehensive and language-aligned analysis of arbitrary medical X-ray images, we introduce FluoroSAM, a language-promptable variant of the Segment Anything Model, trained from scratch on 3M synthetic X-ray images from a wide variety of human anatomies, imaging geometries, and viewing angles. These include pseudo-ground truth masks for 128 organ types and 464 tools with associated text descriptions. FluoroSAM is capable of segmenting myriad anatomical structures and tools based on natural language prompts, thanks to the novel incorporation of vector quantization (VQ) of text embeddings in the training process. We demonstrate FluoroSAM’s performance quantitatively on real X-ray images and showcase on several applications how FluoroSAM is a key enabler for rich human-machine interaction in the X-ray image acquisition and analysis context. Code is available at https://github.com/arcadelab/fluorosam.
在诊断和干预精度医学中,语言可即时X射线图像分解将使得在诊断和干预精度医学中,人行流流流流中具有更大的灵活性。先前的努力有助于建立能够解决范围狭小问题的任务特定模型,但扩大使用范围需要更多数据、说明和培训时间。最近,语言调合基础模型(LMM) – – 以大量可变性图像和文本数据培训的机器学习模型,从而能够广泛应用 – – 已成为自动图像分析的有希望的工具。现有医学图像分析基础模型侧重于在具有大量附加注释数据集的情景和模式。然而,X光成像模型中,从3M合成流流流流流流流流流流流流流图像的应用中,具有高度差异性真实的图像外观和应用程序,从诊断胸前X射线透镜到干预含含不同数据的含干预氟谱的透视模型。在X光流流流流流流流流流流流流流流流流流流流流流流流流流流流图像中,用于多种人类解、成影的快速流流流流流流流流流流流图像应用应用应用工具中,我们将获取和流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流流数据解的版本的模型的模型和流数据流数据流数据流流流流流流流数据流数据流数据流数据流数据系统流数据解数据解数据解数据化数据系统流数据解数据系统化数据流数据流数据流数据流数据流数据流数据流数据流数据流数据流数据流数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据系统系统数据解数据解数据解数据系统数据解数据解数据解数据解数据解数据解数据解数据解数据解数据分析系统数据解数据解数据解数据解数据系统数据系统数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据解数据分析系统数据解数据解数据系统数据解数据解数据解数据解数据解
Article 97
Title@2025-06-25 (3): On the Role of Context in Reading Time Prediction
Title: On the Role of Context in Reading Time Prediction | Zur Rolle des Kontexts bei der Lesezeitvorhersage | 关于在阅读时间预测方面背景作用的 2409.08160v4 |
Authors (4): Andreas Opedal, Eleanor Chodroff, Ryan Cotterell, Ethan Gotlieb Wilcox
We present a new perspective on how readers integrate context during real-time language comprehension. Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit (e.g., a word) is an affine function of its in-context information content. We first observe that surprisal is only one out of many potential ways that a contextual predictor can be derived from a language model. Another one is the pointwise mutual information (PMI) between a unit and its context, which turns out to yield the same predictive power as surprisal when controlling for unigram frequency. Moreover, both PMI and surprisal are correlated with frequency. This means that neither PMI nor surprisal contains information about context alone. In response to this, we propose a technique where we project surprisal onto the orthogonal complement of frequency, yielding a new contextual predictor that is uncorrelated with frequency. Our experiments show that the proportion of variance in reading times explained by context is a lot smaller when context is represented by the orthogonalized predictor. From an interpretability standpoint, this indicates that previous studies may have overstated the role that context has in predicting reading times.
我们对读者如何在实时语言理解中整合背景提出了一个新的观点。 我们的提案基于超正理论, 假设语言单位( 例如一个单词)的处理努力与其内置信息内容的功能是一对等的。 我们首先发现超普里萨只是从多种潜在方法中的一种, 来源预测器可以从语言模型中产生。 另一个是单位与其上下文之间点向的相互信息( PMI), 最终在控制单格频率时产生与超正比力相同的预测力。 此外, PMI 和 suprisal 都与频率相关。 这意味着PMI 和 suprisal 都不包含仅包含上下文信息。 对此, 我们提出一种技术, 我们投影到频率的正反向补补补码上, 产生与频率不相联的新的背景预测器。 我们的实验显示, 当背景在控制单格频率时, 所解释的时间差异的比例要小得多。 此外, PMI 和suprisal 都与频率相关。 这意味着 PMI 和 suprisal 都无法仅包含上下文信息。 对此做出解释性预测, 。 在解释性分析时, 解释性观点中, 显示, 之前的研究可能读取了这个作用。
Article 98
Title@2025-06-25 (3): Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling
Title: Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling | Entsperren des In-Context-Lernens für natürliche Datensätze jenseits der Sprachmodellierung | 解锁超出语言建模之外的自然数据集的文中学习 2501.06256v2 |
Authors (8): Jelena Bratulić, Sudhanshu Mittal, David T. Hoffmann, Samuel Böhm, Robin Tibor Schirrmeister, Tonio Ball, Christian Rupprecht, Thomas Brox
Large Language Models (LLMs) exhibit In-Context Learning (ICL), which enables the model to perform new tasks conditioning only on the examples provided in the context without updating the model’s weights. While ICL offers fast adaptation across natural language tasks and domains, its emergence is less straightforward for modalities beyond text. In this work, we systematically uncover properties present in LLMs that support the emergence of ICL for autoregressive models and various modalities by promoting the learning of the needed mechanisms for ICL. We identify exact token repetitions in the training data sequences as an important factor for ICL. Such repetitions further improve stability and reduce transiency in ICL performance. Moreover, we emphasise the significance of training task difficulty for the emergence of ICL. Finally, by applying our novel insights on ICL emergence, we unlock ICL capabilities for various visual datasets and a more challenging EEG classification task in a few-shot learning regime.
大型语言模型(LLMS)展览 “ 内文学习 “ ,使该模型能够执行新的任务,仅以上下文中提供的例子为条件,而不更新模型的权重。虽然ICL提供各种自然语言任务和领域的快速适应,但其出现对于文本之外的方式来说并不那么简单。在这项工作中,我们系统地发现LLMS中存在的特性,这些特性支持ICL出现自动递减模式和各种模式,促进学习ICL所需的机制。我们把培训数据序列中确切的象征性重复作为ICL的一个重要因素。这种重复进一步提高了ICL的稳定性并减少了其运行的短暂性。此外,我们强调培训任务困难对ICL的出现的重要性。最后,通过运用我们对ICL的新的洞察力,我们为各种视觉数据集解锁了ICL的能力,并在几门学习制度中增加了EG的分类任务。
Article 99
Title@2025-06-25 (3): When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
Title: When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs | Wenn Leben gibt Ihnen Proben: Die Vorteile der Skalierung Inferenz berechnen für mehrsprachige LLMs | 当生命赋予你样本时:扩大多语种LMM的推论计算的益处 2506.20544v1 |
Authors (5): Ammar Khairi, Daniel D’souza, Ye Shen, Julia Kreutzer, Sara Hooker
Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute, improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. However, work to date has focused on English and a handful of domains such as math and code. In contrast, we are most interested in techniques that generalize across open-ended tasks, formally verifiable tasks, and across languages. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy based on temperature variation and selection strategy must be adapted to account for diverse domains and varied language settings. We evaluate existing selection methods, revealing that strategies effective in English often fail to generalize across languages. We propose novel sampling and selection strategies specifically adapted for multilingual and multi-task inference scenarios, and show they yield notable gains across languages and tasks. In particular, our combined sampling and selection methods lead to an average +6.8 jump in win-rates for our 8B models on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At larger scale, Command-A (111B model) equipped with our methods, shows +9.0 improvement in win-rates on the same benchmark with just five samples against single-sample decoding, a substantial increase at minimal cost. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.
大型语言模型(LLMS)的近期进步已经将重点转向了扩大推算时间的计算范围,提高了业绩而不对模型进行再培训。一个共同的方法是平行抽样多种产出,并选择其中之一作为最后产出。然而,迄今为止的工作侧重于英语和数学和代码等少数领域。相反,我们最感兴趣的是,在各种开放任务、正式可核查的任务和各种语言之间推广通用的技巧。在这项工作中,我们研究如何在多语言、多任务设置中,大力按比例计算开放式基因化任务的时间。我们的调查结果显示,基于温度变异和选择战略的抽样战略必须适应不同的领域和不同语言环境。我们评估了现有的选择方法,显示英语有效战略往往无法在各种语言之间普遍化。我们提出了新的抽样和选择战略,专门适应了多种语言和多种任务的情景。我们的综合抽样和选择方法在多种语言和任务中取得了显著的收益。特别是,我们的综合抽样和选择方法导致在8B类的大幅改进模型中平均升6.8比,在8比值模型中,在MAR-Harval-aral-lades a siral prillalal prillal lade sal pressal precal sal sal sal sal sal sal sal sal grade sal sal sal sal des laveal laveal lax lax lax lax sal________ lax sal_ sal_ sal_ ial_____ ial______x sal_bal__bal______xxxxxxxxxxxxxxxxxxxxx______________xxx___bal_bal_bal_xxxxxxxxxxxxxxx_xxxxx
Article 100
Title@2025-06-25 (3): Attention with Trained Embeddings Provably Selects Important Tokens
Title: Attention with Trained Embeddings Provably Selects Important Tokens | Aufmerksamkeit bei trainierten Einbettungen wählt wahrscheinlich wichtige Token aus | 与经过训练的嵌入器的关注 2505.17282v3 |
Authors (4): Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli
Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots, E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_X$ capture the importance of tokens in the dataset by aligning with the output vector $v$ proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training $p$ via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.
Token 嵌入在语言建模中发挥着关键作用, 但是, 尽管这种实际相关性, 它们的理论理解仍然有限 。 我们的纸张通过描述通过渐渐下降获得的嵌入结构来弥补差距 。 具体地说, 我们考虑一个单层软式关注模型, 其二进制分类为线性头, 即 $\ textt{Softmax} (ptop E_ Xtoff) , E_ X v =\ frac{\ sum=1\\ t=1\ t\ t\ ex( littop Exx_ i}) 。 E_ ittop = ligle = ligal =1\ t=1\ t\ t= exm= exml= exmal exmession 。 $xtrealal deal deal ladeal dies a mission. $x dremodal lax the demodeal train listrations a mission extial listrations a mission extial extial extial extial extial 。 。 extial extial 数据, 我们 extial ex ex ex extal extal ex ex extal extal extal extal extrad ex ex ex ex extal ex ex ex ex ex ex =x ex ex ex ex =x =xxxxxxxxxx 。 。 extal 。 =xxxxxxx =xxxxx =x =xx =x =x = =xxxx =xxx =xxxx =x =x = =x =x ex = =x = = = === = ==========================
Article 101
Title@2025-06-25 (3): Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
Title: Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers | Zunge vom Gedanken trennen: Aktivierung Patching offenbart sprach-agnostische Konzeptdarstellungen in Transformern | 与思想分离的舌语:在变形器中进行语言-不可知概念代表 2411.08745v4 |
Authors (5): Clément Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West
A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address this question by analyzing latent representations (latents) during a word-translation task in transformer-based LLMs. We strategically extract latents from a source translation prompt and insert them into the forward pass on a target translation prompt. By doing so, we find that the output language is encoded in the latent at an earlier layer than the concept to be translated. Building on this insight, we conduct two key experiments. First, we demonstrate that we can change the concept without changing the language and vice versa through activation patching alone. Second, we show that patching with the mean representation of a concept across different languages does not affect the models’ ability to translate it, but instead improves it. Finally, we generalize to multi-token generation and demonstrate that the model can generate natural language description of those mean representations. Our results provide evidence for the existence of language-agnostic concept representations within the investigated models.
多语种建模的一个中心问题是,大型语言模型(LLMS)是否发展出一个通用的概念代表,与特定语言脱钩。在本文中,我们通过分析基于变压器的LMS的单词翻译任务中的潜在代表(litents)来解决这一问题。我们从战略上从源翻译中提取潜值,在目标翻译速度上将其插入前传。我们这样做,发现产出语言在比要翻译的概念更早的一层潜伏中被编码。我们根据这一洞察力进行两项关键实验。首先,我们证明我们可以通过单独启动补补丁来改变概念,而不改变语言,反之亦然。第二,我们表明,与不同语言概念的平均代表并不影响模型的翻译能力,而是加以改进。最后,我们概括了多语种的生成,并表明模型可以产生这些暗义表述的自然语言描述。我们的结果为在被调查模型中存在语言-不可知的概念代表提供了证据。
Article 102
Title@2025-06-25 (3): Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Title: Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards | Asymmetrisches REINFORCE für Off-Policy-Verstärkung-Lernen: Ausgleich positiver und negativer Belohnungen | 非政策加强学习的不对称REINFORCE对非政策加强学习的影响:平衡正与负的奖励 2506.20520v1 |
Authors (6): Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos
Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
强化学习(RL)越来越多地用于调整大型语言模型(LLM) 。非政策性方法比政策性技术更简单,数据效率更高,但往往导致低于最佳性能。在这项工作中,我们研究了非政策性RL之间的中间算法范围,并通过分析一个简单的REINFORCE非政策性REINFORCE算法来监督微调,该算法的优势被定义为$A=r-V$,其中的优势被定义为美元奖励和一些金枪鱼可获量的基线美元。直觉地说,降低V$强调高回报样本,同时提高低回报样本。我们首先对这一非政策性REINFORCE算法提供了理论分析,表明当基线值为美元,低限预期的回报时,该算法享有政策改进保证。我们的分析表明,虽然在政策上更新可以安全地利用正和负信号,但非政策性更新的好处是更多地侧重于正面的而不是负面的。我们从受控的悬殊的悬殊中验证了我们的调查结果。
Article 103
Title@2025-06-25 (3): OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Title: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling | OctoThinker: Mittleres Training fördert verstärktes Lernen Scaling | OctoThinker: 中级培训鼓励加强学习 2506.20512v1 |
Authors (4): Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).
Llama 和 Quwen 等不同基础语言模式家庭在强化学习(RL) 后培训期间表现出不同的行为, 特别是在强化学习( RL) 上。 是什么使得基础语言模式适合强化学习? 更深入地了解这一问题对于开发下一代RL可扩缩的基础模型至关重要。 在这项工作中, 我们调查中培训战略如何影响RL动态, 侧重于两个具有代表性的模式家庭: Quen 和 Llama。 我们的研究显示:(1) 高质量的数学公司, 如MegaMath- Web-Pro, 大大改进了基础模型和RL的绩效, 而现有的替代模式( 例如, FinalMath-4+) 却未能这样做; (2) 进一步增加QA类数据, 特别是长期思考( CoT) 推理学实例, 增强RL的结果, 指导数据数据数据数据进一步释放前的深度。
Article 104
Title@2025-06-25 (3): ReCode: Updating Code API Knowledge with Reinforcement Learning
Title: ReCode: Updating Code API Knowledge with Reinforcement Learning | ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen | ReCode:更新法规API知识与强化学习 2506.20495v1 |
Authors (5): Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
大型语言模型(LLMS)具有非凡的代码生成能力,但在适应外部图书馆API的频繁更新时却步履维艰。这一关键限制来自对培训数据中过时的 API 知识的依赖,即使能够查阅现有文件,从而在动态环境中阻碍可靠的代码生成。为了解决这一问题,我们提议ReCode(基于规则的加强学习以更新代码),这是一个模仿人类程序程序员适应API变化的新框架。具体地说,我们建立一个大约2 000个数据条目的数据集,以培训LLMS进行基于更新信息的版本的迁移。然后,我们引入一个修改后的代码评估字符串相似度指标,作为强化学习的奖励。我们的实验表明,ReCode大大提升了LPIS在动态API情景中的代码生成性能,特别是在隐蔽的代码AredateArena任务上。与监管的微调相比,ReCode对于LMS的一般代码生成能力影响较小。我们应用了一套LMS和强化学习算法(GPO和DAPO),所有这些都都实现了一致的改进。 值得注意的是,在培训后,Quender2.5-C-7BB的模型/Rebroughdaldroformax
Article 105
Title@2025-06-25 (3): Counterfactual Influence as a Distributional Quantity
Title: Counterfactual Influence as a Distributional Quantity | Gegenfaktischer Einfluss als Verteilungsmenge | 分发量的反事实影响 2506.20481v1 |
Authors (4): Matthieu Meeus, Igor Shilov, Georgios Kaissis, Yves-Alexandre de Montjoye
Machine learning models are known to memorize samples from their training data, raising concerns around privacy and generalization. Counterfactual self-influence is a popular metric to study memorization, quantifying how the model’s prediction for a sample changes depending on the sample’s inclusion in the training dataset. However, recent work has shown memorization to be affected by factors beyond self-influence, with other training samples, in particular (near-)duplicates, having a large impact. We here study memorization treating counterfactual influence as a distributional quantity, taking into account how all training samples influence how a sample is memorized. For a small language model, we compute the full influence distribution of training samples on each other and analyze its properties. We find that solely looking at self-influence can severely underestimate tangible risks associated with memorization: the presence of (near-)duplicates seriously reduces self-influence, while we find these samples to be (near-)extractable. We observe similar patterns for image classification, where simply looking at the influence distributions reveals the presence of near-duplicates in CIFAR-10. Our findings highlight that memorization stems from complex interactions across training data and is better captured by the full influence distribution than by self-influence alone.
机器学习模型可以记住其培训数据中的样本,引起人们对隐私和一般化的关切。反事实自我影响是一种研究记忆化的流行衡量标准,它量化了模型对样本变化的预测,取决于将样本纳入培训数据集的情况。然而,最近的工作表明,自我影响会受到自我影响以外的因素的影响,而其他培训样本,特别是(近(近)的复制品,具有很大影响。我们在这里研究将反事实影响作为分配数量处理的记忆化问题。我们考虑到所有培训样本如何影响一个样本的记忆化。对于一个小语言模型,我们计算模型对样本的抽样变化的全面影响分布,分析其特性。我们发现,仅仅研究自我影响可能会严重低估与记忆化相关的有形风险,特别是(近)的(近)复制品的存在会严重减少自我影响,而我们发现这些样本是(近(近)易)的。我们观察了相似的图像分类模式,仅仅看一个小语言模型对培训样本的全面影响分布会如何,而仅仅看一个小的图像分布会显示我们从综合分析中获取的数据分析的自我分析会更好。
Article 106
Title@2025-06-25 (3): GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
Title: GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching | GPTailor: Large Language Model Pruning durch Layer Cutting und Stitching | GPTAilor: 大语言模型 穿过层切切和缝合 2506.20480v1 |
Authors (6): Guinan Su, Li Shen, Lu Yin, Shiwei Liu, Yanwu Yang, Jonas Geiping
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model’s abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing $\sim25\%$ of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.
大型语言模型(LLMS)在语言理解和生成方面表现出了非凡的能力,然而,这种令人印象深刻的能力通常伴随着巨大的模型规模,这在部署和推论方面构成重大挑战。虽然结构化模型参数的剪切为降低部署时计算成本提供了很有希望的方法,但目前的方法主要侧重于单一模型剪裁。在这项工作中,我们制定了压缩模型的新战略,从战略上结合或合并了微调模型变异的层块,从而保持了原始模型的能力,将不同微调中突出的能力集中起来。我们将这些LLMS作为零级优化问题,采用一个搜索空间,支持三种不同的操作:(1) 清除层,(2) 从不同的候选模型中选择层,(3) 合并层。我们的实验表明,这一方法导致竞争性模型的裁剪切,例如,Llama2-13B模型家庭,我们的压缩模型保持了大约97.3的原始性能,同时删除了参数中的美元=25,大大超过以前的状态-艺术方法。代码可在 https://gith-github.com/Gugeinan-Sutolan-Sutoa.
Article 107
Title@2025-06-25 (3): Graph Linearization Methods for Reasoning on Graphs with Large Language Models
Title: Graph Linearization Methods for Reasoning on Graphs with Large Language Models | Graphische Linearisierungsmethoden zur Begründung von Graphen mit großen Sprachmodellen | 用于解释大语言模型图表的线性线性方法 2410.19494v3 |
Authors (9): Christos Xypolopoulos, Guokan Shang, Xiao Fei, Giannis Nikolentzos, Hadi Abdine, Iakovos Evdaimon, Michail Chatzianastasis, Giorgos Stamou, Michalis Vazirgiannis
Large language models have evolved to process multiple modalities beyond text, such as images and audio, which motivates us to explore how to effectively leverage them for graph reasoning tasks. The key question, therefore, is how to transform graphs into linear sequences of tokens, a process we term “graph linearization”, so that LLMs can handle graphs naturally. We consider that graphs should be linearized meaningfully to reflect certain properties of natural language text, such as local dependency and global alignment, in order to ease contemporary LLMs, trained on trillions of textual tokens, better understand graphs. To achieve this, we developed several graph linearization methods based on graph centrality and degeneracy. These methods are further enhanced using node relabeling techniques. The experimental results demonstrate the effectiveness of our methods compared to the random linearization baseline. Our work introduces novel graph representations suitable for LLMs, contributing to the potential integration of graph machine learning with the trend of multimodal processing using a unified transformer model.
大型语言模型已经演变为处理文本以外的多种模式,例如图像和音频,这促使我们探索如何有效地利用图像和音频来进行图形推理任务。因此,关键问题是如何将图形转换成线性象征序列,我们称之为“线性线性化”,这样LLMS就能自然地处理图形。我们认为,图表应该有意义地线化,以反映自然语言文本的某些特性,例如当地依赖性和全球对齐,以便方便当代LMS, 接受数万亿个文本符号的培训, 更好地理解图表。为了实现这一目标,我们开发了几种基于图形中心性和退化性的图形线性化方法。这些方法利用节点再标签技术得到进一步加强。实验结果显示了我们的方法相对于随机线性基准的有效性。我们的工作引入了适合LMS的新型图形表达方式,有助于将图形机学习与使用统一的变压模型进行多式联运的趋势结合起来。
Article 108
Title@2025-06-25 (3): Knowledge-Aware Diverse Reranking for Cross-Source Question Answering
Title: Knowledge-Aware Diverse Reranking for Cross-Source Question Answering | Knowledge-Aware Diverse Reranking für Cross-Source-Frageantworten | 跨源问题解答知识-知识-知识-知识-知识多样化重新排名 2506.20476v1 |
Authors (1): Tong Zhou
This paper presents Team Marikarp’s solution for the SIGIR 2025 LiveRAG competition. The competition’s evaluation set, automatically generated by DataMorgana from internet corpora, encompassed a wide range of target topics, question types, question formulations, audience types, and knowledge organization methods. It offered a fair evaluation of retrieving question-relevant supporting documents from a 15M documents subset of the FineWeb corpus. Our proposed knowledge-aware diverse reranking RAG pipeline achieved first place in the competition.
本文介绍了Marikarp团队对SIGIR 2025 LiveRAG竞争的解决方案,由互联网公司DataMorgana自动生成的竞争评价组包括了广泛的目标主题、问题类型、问题表述、受众类型和知识组织方法,对从《精密Web process》的15M文件子集检索与问题有关的辅助文件进行了公平评价,我们提议的了解知识的多样化RAG编审渠道在竞争中名列第一。
Article 109
Title@2025-06-25 (3): Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations
Title: Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations | Zeit ist auf meiner Seite: Dynamik des Gesprächs-Zeit-Sharing in Video-Chat-Gesprächen | 时间就在我身边:视频聊天中的谈话时间分享动态 2506.20474v1 |
Authors (3): Kaixiang Zhang, Justine Zhang, Cristian Danescu-Niculescu-Mizil
An intrinsic aspect of every conversation is the way talk-time is shared between multiple speakers. Conversations can be balanced, with each speaker claiming a similar amount of talk-time, or imbalanced when one talks disproportionately. Such overall distributions are the consequence of continuous negotiations between the speakers throughout the conversation: who should be talking at every point in time, and for how long? In this work we introduce a computational framework for quantifying both the conversation-level distribution of talk-time between speakers, as well as the lower-level dynamics that lead to it. We derive a typology of talk-time sharing dynamics structured by several intuitive axes of variation. By applying this framework to a large dataset of video-chats between strangers, we confirm that, perhaps unsurprisingly, different conversation-level distributions of talk-time are perceived differently by speakers, with balanced conversations being preferred over imbalanced ones, especially by those who end up talking less. Then we reveal that – even when they lead to the same level of overall balance – different types of talk-time sharing dynamics are perceived differently by the participants, highlighting the relevance of our newly introduced typology. Finally, we discuss how our framework offers new tools to designers of computer-mediated communication platforms, for both human-human and human-AI communication.
每一个对话的内在方面是多个发言者之间分享谈话时间的方式。 对话可以平衡, 每个发言者都声称有类似数量的谈话时间, 或者在一次谈话时不平衡。 这种总体分布是整个对话期间发言者之间不断谈判的结果: 谁应该在每个时间和多长时间发言? 在这项工作中,我们引入一个计算框架,以量化发言者之间谈话时间的谈话级别分配,以及导致对话时间的较低层次的动态。 我们从几个直觉的变异轴所构建的谈话时间共享动态中得出了一种类型。 通过将这一框架应用于陌生人之间大量视频聊天的数据集,我们确认,也许并不令人惊讶的是,发言者对不同对话时间的谈话级别分配有不同的看法,他们偏好于不平衡的谈话时间分配,尤其是那些结束较少交谈的人。 然后,我们揭示,即使它们导致同样的总体平衡,我们也从不同的对话时间共享动态中得出了不同的看法,参与者对新引入的人类通信工具和人类通信工具的关联性。最后,我们讨论了我们如何看待我们新引入的人类通信工具与人类通信结构的关联性。
Article 110
Title@2025-06-25 (3): Probing AI Safety with Source Code
Title: Probing AI Safety with Source Code | Nachweis der KI-Sicherheit mit Quellcode | 以《源代码》检验AI安全 2506.20471v1 |
Authors (7): Ujwal Narayan, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik Narasimhan, Ameet Deshpande, Vishvak Murahari
Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt “Make the statement more toxic: {text}” to: “make_more_toxic({text})”. We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo’s toxicity increases 16.5 times, DeepSeek R1 fails 100% of the time, and toxicity increases 300% on average across seven modern LLMs. Additionally, recursively applying CoDoT can further increase toxicity two times. Given the rapid and widespread adoption of LLMs, CoDoT underscores the critical need to evaluate safety efforts from first principles, ensuring that safety and capabilities advance together.
大型语言模型(LLMS)已变得无处不在,在众多安全关键应用中与人类互通。 这要求提高能力,但重要的是,还要采取更大的安全措施,使这些模型与人类价值观和喜好相一致。 在这项工作中,我们证明当代模型远未达到AI安全的目标,导致用户的不安全和有害经历。我们引入了一个称为“思想守则”(CoDoT)的快速战略,以评价LMs的安全性。CoDoT将自然语言投入转换为代表相同意图的简单代码。例如,CoDoT将自然语言快速转换为“使声明更加有毒 :{text} ” 的自然语言转换为“ make_more_毒性({text}) ” 。我们显示,CoDoT 导致一系列的新型LMS持续失败。例如,GPT-4 Turbo的毒性增加了16.5倍, DeepSeek R1 100%的时间不能达到100%,而毒性在7个现代LMs中平均增加300%。此外, 重新循环应用CDoT可以进一步提高毒性,因为CT公司的关键工作需要。
Article 111
Title@2025-06-25 (3): Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Title: Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning | Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen | 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v3 |
Authors (27): Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
科学发现日益依赖基于信息密集型科学数据和特定领域专门知识的复杂多式联运推理。专家级科学基准、科学多式大语言模型(MLLM)赋予了专家级科学基准、科学多式大语言模型(MLLM)的权能,因此有可能在现实工作流程中大大加强这一发现过程。然而,目前的科学基准主要侧重于评价MLLM的知识理解能力,导致对其认识和推理能力评估不足。为弥补这一差距,我们提出了科学家首届Exam(SFE)基准,目的是通过三个相互关联的层面评估MLLMS的科学认知能力:科学信号感知、科学属性理解、科学比较推理。具体地说,SFE由经过专家验证的VQA对组组成,分三个问题类型830对,跨越五个高价值学科的66项多式任务。广泛的实验显示,目前最先进的GPT-o3和InternVL-3号科学发现只有34.08%和26.52%的成绩,突出MLMS在科学领域改进的重大空间。我们希望SLLMS获得的洞察力将有助于AI加强科学发现。我们希望。
Article 112
Title@2025-06-25 (3): CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
Title: CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models | CogniBench: Ein gesetzlich inspirierter Rahmen und Datensatz zur Bewertung der kognitiven Treue großer Sprachmodelle | CogniBench:评估大语言模型认知性信仰的受法律启发的框架和数据集 2505.20767v4 |
Authors (8): Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie
Faithfulness hallucinations are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standards, existing benchmarks focus on “factual statements” that rephrase source materials while overlooking “cognitive statements” that involve making inferences from the given context. Consequently, evaluating and detecting the hallucination of cognitive statements remains challenging. Inspired by how evidence is assessed in the legal domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and introduce the CogniBench dataset where we reveal insightful statistics. To keep pace with rapidly evolving LLMs, we further develop an automatic annotation pipeline that scales easily across different models. This results in a large-scale CogniBench-L dataset, which facilitates training accurate detectors for both factual and cognitive hallucinations. We release our model and datasets at: https://github.com/FUTUREEEEEE/CogniBench
缺乏评估标准,现有基准侧重于“事实陈述”,对源材料进行重新表述,而忽略“认知陈述”,而忽略了从特定背景推断的“认知陈述”,因此,对认知陈述的幻觉进行评估和检测仍然具有挑战性。在法律领域如何评估证据的启发下,我们设计了一个严格的框架,以评估认知陈述的不同程度的忠实程度,并引入CogniBench数据集,让我们在此披露有见识的统计数据。为了跟上迅速变化的LLMS,我们进一步开发了一种自动批注管道,在不同的模型之间可以很容易地标出。这导致产生了一个大规模的Cognibench-L数据集,便于对事实和认知想象的准确探测器进行培训。我们在https://github.com/FUTUREEEEEE/CogniBench上公布了我们的模型和数据集:https://github.com/FUTUREEEEEEEE/Cognibench。
Article 113
Title@2025-06-25 (3): Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
Title: Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception | Auf dem Weg zur vollen Nutzung der LLM-internen Staaten zur Verbesserung der Wissensgrenzenwahrnehmung | 争取充分利用LLM 内部各国加强知识边界认识 2502.11677v2 |
Authors (6): Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, Xueqi Cheng
Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries, leading to confident yet incorrect responses. This paper explores leveraging LLMs’ internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives. We investigate whether LLMs can estimate their confidence using internal states before response generation, potentially saving computational resources. Our experiments on datasets like Natural Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant pre-generation perception, which is further refined post-generation, with perception gaps remaining stable across varying conditions. To mitigate risks in critical domains, we introduce Confidence Consistency-based Calibration ($C^3$), which assesses confidence consistency through question reformulation. $C^3$ significantly improves LLMs’ ability to recognize their knowledge gaps, enhancing the unknown perception rate by 5.6% on NQ and 4.9% on HotpotQA. Our findings suggest that pre-generation confidence estimation can optimize efficiency, while $C^3$ effectively controls output risks, advancing the reliability of LLMs in practical applications.
大型语言模型(LLMS)在各种任务中表现出令人印象深刻的成绩,但往往在努力精确地测量其知识界限,从而导致有信心但不正确的反应。本文件探讨利用LLMS的内部国家从效率和风险角度来提高他们对知识界限的认识。我们调查LLMS是否能够在产生反应前利用内部国家来估计其信任,可能节省计算资源。我们在自然问题、热点QA和MMMLU等数据集的实验显示,LLMS展示了重要的生成前认识,这种认识正在进一步得到完善,在不同的条件下,认知差距仍然保持稳定。为了降低关键领域的风险,我们采用了基于信任的校准(C3美元),通过重新拟订问题来评估信任一致性。3美元大大提高了LLMs认识其知识差距的能力,提高了NQ5.6%和HotpotQA4.9%的未知的认知率。我们的研究结果表明,创造信任前估计可以优化效率,同时以3美元有效控制产出风险,提高LMs在实际应用中的可靠性。
Article 114
Title@2025-06-25 (3): An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Title: An Agentic System for Rare Disease Diagnosis with Traceable Reasoning | Ein Agentisches System für die Diagnose seltener Krankheiten mit rückverfolgbarer Begründung | 利用可追踪理由进行罕见疾病诊断的制剂系统 2506.20430v1 |
Authors (12): Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie
Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence. DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser’s 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application http://raredx.cn/doctor.
- 但及时、准确的诊断仍是一个普遍的挑战。这在很大程度上是由于他们的临床异质性、个人流行率低以及大多数临床医生对罕见条件的熟悉程度有限。在这里,我们引入了DeepRare,这是第一个由大型语言模型(LLM)驱动的罕见疾病诊断毒剂系统,能够处理多种临床投入。这个系统生成了稀有疾病的诊断假设,每个系统都有一个透明的推理链,将中间分析步骤与可核查的医疗证据联系起来。DeepRare包含三个关键组成部分:一个具有长期记忆模块的中央主机体;专门代理服务器负责具体领域的分析任务,整合了40多种专门工具和网络规模的最新医学知识来源,确保了获取最新临床信息的机会。这个模块和可缩缩放设计使得复杂的诊断推理同时保持可追踪性和适应性。我们在8个数据集上对Deeparreare进行了评估。这个系统展示了2,919种疾病的特殊诊断性,实现了1013种疾病的100%的准确性。在基于HPO的评估中,DeepRare 明显地超前程,对其他15种成本模型进行了分析,在传统生物内部分析工具中实现了一种超标。
Article 115
Title@2025-06-25 (3): SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities
Title: SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities | SMAR: Soft Modality-Aware Routing-Strategie für MoE-basierte multimodale große Sprachmodelle Erhaltung von Sprachfähigkeiten | SMAR: 以教育部为基础的维护语言能力的多模式多语言模型软式模式-软件运行战略 2506.06406v2 |
Authors (7): Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang
Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance modality differentiation and language capabilities in multimodal MoE models.
专家混合结构已成为扩大大型语言模式的关键方法,人们越来越希望将其推广到多式联运任务中; 现有的建立多式联运模式模式的方法,在调整预先培训的模式时,要么引起高培训费用,要么会因语言能力退化而受到影响; 为了解决这一问题,我们提议采用软式模式软件软件软件软件(SMAR),这是一种新颖的正规化技术,使用Kullback Leibel差异来控制不同模式的路线概率分布,鼓励专家专业化而不修改模式结构或严重依赖文本数据。 视觉指导调整实验显示,SMAR将语言能力保持在86.6%的保留率,只有2.5%的纯文本,超过基准,同时保持强劲的多式联运绩效。 我们的方法为在多式联运模式中平衡模式差异和语言能力提供了实用而有效的解决方案。
Article 116
Title@2025-06-25 (3): Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content
Title: Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content | Biomed-angereichert: Ein biomedizinischer Datensatz mit LLMs für die Vorschulung und Extraktion seltener und versteckter Inhalte | 生物医学富含生物医学:生物医学数据集,配有预培训和提取稀有和隐藏内容的LLMMs 2506.20331v1 |
Authors (3): Rian Touchent, Nathan Godey, Eric de la Clergerie
We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.
我们引入了Biomed-Enried, 生物医学文本数据集, 由PubMed通过两个阶段的批注过程构建。 在第一阶段, 一个大型语言模型, 将PubMed科学文章中的400K段内容通知PubMed科学文章, 分配其类型( 审查、 研究、 临床案例、 其他) 的评分, 领域( 临床、 生物医学、 其他) 和教育质量。 教育质量评分( 评分1至5) 估计某段对大学一级的学习有用。 这些评分随后被用来微调一个小语言模型, 在整个PMC- OA文库中传播标签。 由此形成的元数据让我们能够提取精细的子集, 包括2M2的临床案例段, 其中有450K的高质量文章, 商业用途许可证的评分, 领域( ) 域( 临床、 生物医学、 生物医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 医学、 研究等等实验实验实验实验实验, 通常试验, 都难以获得这些进步的改进。 因此, 向PMMLM 提供了这些进步的学习实验实验, 进步的改进了这些进步 和 和 基础 质量 质量 。 基础 基础 , 基础 , , 质量 学习 学习 学习 学习 和 学习 基础 , , , 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础 基础
Article 117
Title@2025-06-25 (3): From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents
Title: From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents | Von der Kodikologie zum Code: Eine vergleichende Studie von Transformer und YOLO-basierten Detektoren zur Layoutanalyse in historischen Dokumenten | 从术语到代码:用于历史文件布局分析的变异器和以YOLO为基地的YOLO探测器比较研究 2506.20326v1 |
Authors (1): Sergio Torres Aguilar
Robust Document Layout Analysis (DLA) is critical for the automated processing and understanding of historical documents with complex page organizations. This paper benchmarks five state-of-the-art object detection architectures on three annotated datasets representing a spectrum of codicological complexity: The e-NDP, a corpus of Parisian medieval registers (1326-1504); CATMuS, a diverse multiclass dataset derived from various medieval and modern sources (ca.12th-17th centuries) and HORAE, a corpus of decorated books of hours (ca.13th-16th centuries). We evaluate two Transformer-based models (Co-DETR, Grounding DINO) against three YOLO variants (AABB, OBB, and YOLO-World). Our findings reveal significant performance variations dependent on model architecture, data set characteristics, and bounding box representation. In the e-NDP dataset, Co-DETR achieves state-of-the-art results (0.752 mAP@.50:.95), closely followed by YOLOv11X-OBB (0.721). Conversely, on the more complex CATMuS and HORAE datasets, the CNN-based YOLOv11x-OBB significantly outperforms all other models (0.564 and 0.568, respectively). This study unequivocally demonstrates that using Oriented Bounding Boxes (OBB) is not a minor refinement but a fundamental requirement for accurately modeling the non-Cartesian nature of historical manuscripts. We conclude that a key trade-off exists between the global context awareness of Transformers, ideal for structured layouts, and the superior generalization of CNN-OBB models for visually diverse and complex documents.
硬文件布局分析(DLA)对于与复杂的页面组织一起自动处理和理解历史文件至关重要。本文将三个附加说明的数据集中的五个最先进的天体检测结构(Co-DETR、Grounding DINO)作为基准,这三套数据组代表一系列的共科复杂性:e-NDP,巴黎中世纪登记册汇编(1326-1504);CATMOS,一个来自各种中世纪和现代来源(a12-17世纪)和HORAE的多种多级数据集,这是一套经过精细化的时数书籍(cal.13-16世纪)。我们评估了两个基于变压器的模型(Co-DETR、Ground DINO),这三种基于 YOLOO的变异体(AB、OB和YOL-OLOW-W-World-World) 。我们的调查结果显示,在模型中,ORC-OV-ODG-OFA中,所有较复杂的数据组的变异性数据组(OTO-GS-GS-GS-S-GOD-SAGSAGS-S-S-S-S-S-Slent Fl) 的普通一般总图则显示一个不甚高的GOB 和GOVOV-S-S-S-S-S-S-S-S-S-SAGOF-S-S-S-S-S-S-S-S-S-S-S-SODFlental-SODFlal-S-S-S-SOD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SB-SB-SB-SB-SB-SB-SB-S-SB-SB-I-I-I-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 118
Title@2025-06-25 (3): VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback
Title: VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback | VICCA: Visuelle Interpretation und Verständlichkeit von Röntgenanomalien im generierten Bericht ohne menschliches Feedback | VICC:在没有人类反馈的情况下生成的报告对胸透X射线异常现象的视觉解读和理解 2501.17726v2 |
Authors (3): Sayeh Gholipour Picha, Dawood Al Chanti, Alice Caplier
As artificial intelligence (AI) becomes increasingly central to healthcare, the demand for explainable and trustworthy models is paramount. Current report generation systems for chest X-rays (CXR) often lack mechanisms for validating outputs without expert oversight, raising concerns about reliability and interpretability. To address these challenges, we propose a novel multimodal framework designed to enhance the semantic alignment and localization accuracy of AI-generated medical reports. Our framework integrates two key modules: a Phrase Grounding Model, which identifies and localizes pathologies in CXR images based on textual prompts, and a Text-to-Image Diffusion Module, which generates synthetic CXR images from prompts while preserving anatomical fidelity. By comparing features between the original and generated images, we introduce a dual-scoring system: one score quantifies localization accuracy, while the other evaluates semantic consistency. This approach significantly outperforms existing methods, achieving state-of-the-art results in pathology localization and text-to-image alignment. The integration of phrase grounding with diffusion models, coupled with the dual-scoring evaluation system, provides a robust mechanism for validating report quality, paving the way for more trustworthy and transparent AI in medical imaging.
由于人工智能(AI)日益成为医疗保健的核心,因此对可解释和可信赖模型的需求至关重要。目前的报告生成系统往往缺乏在没有专家监督的情况下验证输出的机制,这就引起了对可靠性和可解释性的关切。为了应对这些挑战,我们提议了一个新的多式联运框架,目的是提高人工智能生成的医疗报告的语义一致性和本地化准确性。我们的框架整合了两个关键模块:一种表达基础模型,该模型根据文字提示确定CXR图像中的病理,并将其本地化;另一种是文本到图像化模块,该模块从速生成合成 CXR图像,同时保持解剖性对等性。我们通过比较原始图像和生成图像的特征,引入了一种双相色化系统:一个分数量化本地化准确性,而另一个则评估语义一致性。这一方法大大超越了现有方法,在病理本地化和文本到图像上取得了最新的结果。将语系与传播模型整合成合成 CXR图像,同时将稳健的双向性图像化系统提供可靠、可靠、可靠、可靠的医学质量报告。
Article 119
Title@2025-06-25 (3): Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning
Title: Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning | Konfuzius3-Math: Leichtes Hochleistungs-LLM für das chinesische K-12 Mathematik-Lernen | 剖析3- 数学: 中国 K-12 数学学习的轻量级高性能推理法LLMLM 2506.18330v2 |
Authors (5): Lixin Wu, Na Cai, Qiao Cheng, Jiachen Wang, Yitao Duan
We introduce Confucius3-Math, an open-source large language model with 14B parameters that (1) runs efficiently on a single consumer-grade GPU; (2) achieves SOTA performances on a range of mathematical reasoning tasks, outperforming many models with significantly larger sizes. In particular, as part of our mission to enhancing education and knowledge dissemination with AI, Confucius3-Math is specifically committed to mathematics learning for Chinese K-12 students and educators. Built via post-training with large-scale reinforcement learning (RL), Confucius3-Math aligns with national curriculum and excels at solving main-stream Chinese K-12 mathematical problems with low cost. In this report we share our development recipe, the challenges we encounter and the techniques we develop to overcome them. In particular, we introduce three technical innovations: Targeted Entropy Regularization, Recent Sample Recovery and Policy-Specific Hardness Weighting. These innovations encompass a new entropy regularization, a novel data scheduling policy, and an improved group-relative advantage estimator. Collectively, they significantly stabilize the RL training, improve data efficiency, and boost performance. Our work demonstrates the feasibility of building strong reasoning models in a particular domain at low cost. We open-source our model and code at https://github.com/netease-youdao/Confucius3-Math.
我们引入了开放源码的大型语言模型“孔子3-马特 ” ( Confurcius3-Math),这是一个开放源码的大型语言模型,有14B参数:(1) 高效地运行在单一消费级GPU上;(2) 实现SOTA在一系列数学推理任务方面的成绩,优于许多规模大得多的模型;特别是,作为我们与AI一起加强教育和知识传播的任务的一部分,孔子3-马特致力于中国K-12学生和教育工作者的数学学习。这些创新包括通过大规模强化学习(RL)、 Confurcius3-Math)的后培训,与国家课程保持一致,并优于低成本解决中流中国K-12数学问题。我们在本报告中分享了我们的发展秘诀、我们遇到的挑战以及我们为克服这些挑战而开发的技术。特别是,我们引入了三项技术创新:定向整治,最近的抽样恢复和政策特征易变。这些创新包括新的昆虫正规化,新的数据排期政策,以及改进的集团反向优势估测算器。集体,它们极大地稳定了RL培训,提高数据效率,提高数据效率,提升了我们在低成本模型上展示了我们的工作推理。
Article 120
Title@2025-06-25 (3): VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
Title: VAQUUM: Are Vague Quantifiers Grounded in Visual Data? | VAQUUM: Sind Vakuum-Quantifikatoren in visuellen Daten verankert? | VAQUUUM: 模糊量化器是否以视觉数据为基础? 2502.11874v3 |
Authors (3): Hugh Mee Wong, Rick Nouwen, Albert Gatt
Vague quantifiers such as “a few” and “many” are influenced by various contextual factors, including the number of objects present in a given context. In this work, we evaluate the extent to which vision-and-language models (VLMs) are compatible with humans when producing or judging the appropriateness of vague quantifiers in visual contexts. We release a novel dataset, VAQUUM, containing 20,300 human ratings on quantified statements across a total of 1089 images. Using this dataset, we compare human judgments and VLM predictions using three different evaluation methods. Our findings show that VLMs, like humans, are influenced by object counts in vague quantifier use. However, we find significant inconsistencies across models in different evaluation settings, suggesting that judging and producing vague quantifiers rely on two different processes.
诸如“ 少数” 和“ 许多” 等模糊的限定词受各种背景因素的影响, 包括特定背景下存在的物体数量。 在这项工作中, 我们评估视觉和语言模型( VLMs)在制作或判断视觉环境中模糊的限定词是否合适时, 在多大程度上与人类兼容。 我们发布了一套新颖的数据集, VAQUUM, 包含对总计1089张图像的量化声明的20,300个人类评级。 我们使用这个数据集, 用三种不同的评价方法比较人类的判断和VLM预测。 我们的研究结果显示, VLMs与人类一样, 受模糊量化词使用的对象数的影响。 然而, 我们发现不同评价环境中的不同模型之间有很大的不一致之处, 表明判断和生成模糊的量化符依赖于两种不同的过程。
Article 121
Title@2025-06-25 (3): FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment
Title: FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment | FundaQ-8: Ein klinisch inspiriertes Scoring-Framework für automatisierte Fundus-Bildqualitätsbewertung | FundaQ-8:自动基金图像质量评估临床启发的分类框架 2506.20303v1 |
Authors (6): Lee Qi Zun, Oscar Wong Jin Hao, Nor Anita Binti Che Omar, Zalifa Zakiah Binti Asnir, Mohamad Sabri bin Sinal Zainal, Goh Man Fye
Automated fundus image quality assessment (FIQA) remains a challenge due to variations in image acquisition and subjective expert evaluations. We introduce FundaQ-8, a novel expert-validated framework for systematically assessing fundus image quality using eight critical parameters, including field coverage, anatomical visibility, illumination, and image artifacts. Using FundaQ-8 as a structured scoring reference, we develop a ResNet18-based regression model to predict continuous quality scores in the 0 to 1 range. The model is trained on 1800 fundus images from real-world clinical sources and Kaggle datasets, using transfer learning, mean squared error optimization, and standardized preprocessing. Validation against the EyeQ dataset and statistical analyses confirm the framework’s reliability and clinical interpretability. Incorporating FundaQ-8 into deep learning models for diabetic retinopathy grading also improves diagnostic robustness, highlighting the value of quality-aware training in real-world screening applications.
由于图像获取和主观专家评价的差异,自动基金图像质量评估(FIQA)仍然是一项挑战。我们引入了FindaQ-8,这是利用八个关键参数,包括实地覆盖面、解剖可见度、照明和图像文物,系统评估基金图像质量的新颖专家验证框架。我们利用FundaQ-8作为结构化评分参考,开发了基于ResNet18的回归模型,以预测0至1范围的连续质量评分。该模型通过转移学习、平均平方错优化和标准化预处理,对来自真实世界临床来源和Kaggle数据集的1800个基金图像进行了培训。对EyeQ数据集和统计分析的验证证实了该框架的可靠性和临床可解释性。将FundaQ-8纳入糖尿病再基因分数的深学习模型中,也提高了诊断性强性,突出了真实世界筛选应用中高质量培训的价值。
Article 122
Title@2025-06-25 (3): Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning
Title: Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning | Ausbalancieren von Wahrhaftigkeit und Aufklärung mit unsicherer Anleitung Feintuning | 平衡真实和知情与不确定性软件指示 2502.11962v3 |
Authors (8): Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
Instruction fine-tuning (IFT) can increase the informativeness of large language models (LLMs), but may reduce their truthfulness. This trade-off arises because IFT steers LLMs to generate responses containing long-tail knowledge that was not well covered during pre-training. As a result, models become more informative but less accurate when generalizing to unseen tasks. In this paper, we empirically demonstrate how unfamiliar knowledge in IFT datasets can negatively affect the truthfulness of LLMs, and we introduce two new IFT paradigms, $UNIT_{cut}$ and $UNIT_{ref}$, to address this issue. $UNIT_{cut}$ identifies and removes unfamiliar knowledge from IFT datasets to mitigate its impact on model truthfulness, whereas $UNIT_{ref}$ trains LLMs to recognize their uncertainty and explicitly indicate it at the end of their responses. Our experiments show that $UNIT_{cut}$ substantially improves LLM truthfulness, while $UNIT_{ref}$ maintains high informativeness and reduces hallucinations by distinguishing between confident and uncertain statements.
教学微调(IFT)可以增加大型语言模型(LLM)的信息性,但可能降低其真实性。这种权衡之所以产生,是因为IFT引导LLMS生成包含培训前未充分涵盖的长尾知识的响应。结果,模型在概括到不可见的任务时,信息性更强,但更不准确。在本文中,我们从经验上表明IFT数据集中的不熟悉知识会如何对LMS的真实性产生消极影响,我们为解决这一问题引入了两种新的IFT模式($UNITcucuter}和$UNITref}。 $UNITcuter}发现并消除IFT数据集的不熟悉知识,以减轻其对模型真实性的影响,而$UNITref}则培训LMs在答复结束时认识到其不确定性并明确表明这一点。我们的实验表明,$UNITcucuter}将大大改善LM的诚实性,而$UNITre}则保持高的信息性,并通过区分自信和不确定的言论减少幻觉。
Article 123
Title@2025-06-25 (3): LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Title: LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems | LR^2Bench: Bewertung von langkettigen Reflektierenden Fähigkeiten von großen Sprachmodellen durch Einschränkungen Zufriedenheitsprobleme | LL_2Bench:通过限制满意度问题评价大语言模式的长链反思理由能力 2502.17848v4 |
Authors (5): Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang
Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. Our extensive evaluation on both conventional LLMs and LRMs reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs.
在大理由模型(LRMs)方面最近取得的进展大大提高了大语言模型(LLMs)的推理能力,使其有能力通过反省能力,例如作出假设、回溯跟踪和自我完善等反省能力,处理日益复杂的任务,然而,由于缺乏适当的基准,有效评价这种反省能力仍然具有挑战性。为了缩小这一差距,我们引入了2美元2先令,这是一个新的基准,旨在评价LLMs.LLMs.LR-R1和OpenAI o1-preview等长链反省能力,包括六个限制满意度问题(CSPs)的850个样本,这些样本的反向推理对得出满足所有特定限制的解决办法至关重要。每一种任务都侧重于不同的制约模式,例如基于知识、逻辑和空间的限制,对多种解决问题的情景进行全面评估。我们对传统的LMs和LRMs的深入评价显示,即使是最先进的LRMs,例如Deepeek-R1和OpenAI o1-preview),也与LS_2美元的任务相抗争,在Bench(LS)中,在目前只有20%和23.6的推理算能力方面达到20Ms。
Article 124
Title@2025-06-25 (3): LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
Title: LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs | LADM: Langzeit-Kontext-Trainingsdatenauswahl mit aufmerksamkeitsbasierter Abhängigkeitsmessung für LLMs | LADM: 长期培训数据选择,对LLMM进行以注意为基础的依赖性衡量 2503.02502v2 |
Authors (4): Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang
Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.
长文本模型在大语言模型领域引起越来越多的注意。 长文本数据的持续培训已成为使长文本模型具备处理长文本投入能力的脱法方法,然而,在衡量长文本培训数据的质量方面,它仍然是一项公开的挑战。为解决这一问题,我们提议了一个长文本数据选择框架,以关注为基础的依赖度测量(LADM),它能够有效地从大规模、多部培训前材料中找出高质量的长文本数据。LADM利用关注机制的检索能力来捕捉环境依赖性,确保全面衡量长文本数据的质量。实验结果表明,我们的LADM框架大大提升了LMS在多项长文本任务上的绩效,只有1B符号用于持续培训。
Article 125
Title@2025-06-25 (3): Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
Title: Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models | Narrative Shift Detection: Ein hybrider Ansatz von dynamischen Themenmodellen und großen Sprachmodellen | 叙述式转移探测:动态专题模型和大语言模型的混合方法 2506.20269v1 |
Authors (6): Kai-Robin Lange, Tobias Schmidt, Matthias Reccius, Henrik Müller, Michael Roos, Carsten Jentsch
With rapidly evolving media narratives, it has become increasingly critical to not just extract narratives from a given corpus but rather investigate, how they develop over time. While popular narrative extraction methods such as Large Language Models do well in capturing typical narrative elements or even the complex structure of a narrative, applying them to an entire corpus comes with obstacles, such as a high financial or computational cost. We propose a combination of the language understanding capabilities of Large Language Models with the large scale applicability of topic models to dynamically model narrative shifts across time using the Narrative Policy Framework. We apply a topic model and a corresponding change point detection method to find changes that concern a specific topic of interest. Using this model, we filter our corpus for documents that are particularly representative of that change and feed them into a Large Language Model that interprets the change that happened in an automated fashion and distinguishes between content and narrative shifts. We employ our pipeline on a corpus of The Wall Street Journal news paper articles from 2009 to 2023. Our findings indicate that a Large Language Model can efficiently extract a narrative shift if one exists at a given point in time, but does not perform as well when having to decide whether a shift in content or a narrative shift took place.
随着媒体叙事的迅速演变,不仅从某一文体中提取叙事,而且调查它们如何发展,就变得越来越重要了。虽然大语言模型等流行的叙事提取方法在捕捉典型叙事要素甚为成功,甚至叙事的复杂结构中,将其应用于整个文体时遇到重重障碍,例如财务或计算成本高。我们提议将大语言模型的语言理解能力与主题模型的大规模可适用性结合起来,用叙述性政策框架对时间进行动态模式式叙事转移。我们使用一个专题模型和相应的改变点探测方法来寻找与特定主题有关的变化。我们使用这个模型,过滤特别代表该变化的文件,把它们输入成一个大语文模型,以自动方式解释所发生的变化,区分内容和叙述性变化。我们用《华尔街日报》新闻论文2009年至2023年的文集集来提供我们的资料。我们的研究结果表明,如果某个时间点存在的话,大语言模型可以有效地提取叙事变换,但是在决定内容或叙述性变化发生于何时。
Article 126
Title@2025-06-25 (3): Why Robots Are Bad at Detecting Their Mistakes: Limitations of Miscommunication Detection in Human-Robot Dialogue
Title: Why Robots Are Bad at Detecting Their Mistakes: Limitations of Miscommunication Detection in Human-Robot Dialogue | Warum Roboter schlecht darin sind, ihre Fehler zu erkennen: Einschränkungen der Fehlkommunikationserkennung im Mensch-Roboter-Dialog | 机器人为何在发现错误时不善:人类机器人对话中通信错误探测的限制 2506.20268v1 |
Authors (6): Ruben Janssens, Jens De Bock, Sofie Labat, Eva Verhelst, Veronique Hoste, Tony Belpaeme
Detecting miscommunication in human-robot interaction is a critical function for maintaining user engagement and trust. While humans effortlessly detect communication errors in conversations through both verbal and non-verbal cues, robots face significant challenges in interpreting non-verbal feedback, despite advances in computer vision for recognizing affective expressions. This research evaluates the effectiveness of machine learning models in detecting miscommunications in robot dialogue. Using a multi-modal dataset of 240 human-robot conversations, where four distinct types of conversational failures were systematically introduced, we assess the performance of state-of-the-art computer vision models. After each conversational turn, users provided feedback on whether they perceived an error, enabling an analysis of the models’ ability to accurately detect robot mistakes. Despite using state-of-the-art models, the performance barely exceeds random chance in identifying miscommunication, while on a dataset with more expressive emotional content, they successfully identified confused states. To explore the underlying cause, we asked human raters to do the same. They could also only identify around half of the induced miscommunications, similarly to our model. These results uncover a fundamental limitation in identifying robot miscommunications in dialogue: even when users perceive the induced miscommunication as such, they often do not communicate this to their robotic conversation partner. This knowledge can shape expectations of the performance of computer vision models and can help researchers to design better human-robot conversations by deliberately eliciting feedback where needed.
检测人-机器人互动中的错误通信是保持用户参与和信任的关键功能。虽然人类不遗余力地通过口头和非口头提示检测谈话中的通信错误,但机器人在解读非口头反馈方面面临重大挑战,尽管在识别情感表达方式的计算机愿景方面有所进步。这项研究评估了机器学习模型在检测机器人对话中的错误通信方面的有效性。利用由240个人-机器人对话组成的多模式数据集,其中系统地引入了四种不同的对话失败,我们评估了最先进的计算机愿景模型的性能。每次谈话转弯后,用户都提供了他们是否认为错误的反馈,从而能够分析模型是否有能力准确检测机器人错误的反馈。尽管采用了最先进的模型,但是在识别机器人对话中的错误通信差异时,业绩几乎不会超过随机的机会,而在含有更清晰的情感内容的数据集中,他们成功地发现了混乱的状态。为了探究根本原因,我们请人-机器人研究人员也这样做。他们只能识别大约一半的诱发错误通信模式,类似于我们的模型。这些结果往往会揭示模型中的基本设计期望:在机器人对话中,他们需要这样的机器人对话中的错觉认识。
Article 127
Title@2025-06-25 (3): Language Modeling by Language Models
Title: Language Modeling by Language Models | Sprachmodellierung nach Sprachmodellen | 按语文模式建模的语文 2506.20249v1 |
Authors (3): Junyan Cheng, Peter Clark, Kyle Richardson
Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre-training, and downstream evaluation (verification). Using ideas from scaling laws, our system, Genesys, employs a Ladder of Scales approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M$\sim$350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e.g., $\sim$86\% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1,162 newly discovered designs (1,062 fully verified through pre-training) and find the best designs to be highly competitive with known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common benchmarks). We couple these results with comprehensive system-level ablations and formal results, which give broader insights into the design of effective autonomous discovery systems.
我们能否利用LLMS来模拟发现新语言模型(LM)结构的进程?在真正的研究的启发下,我们提议一种多试剂LM方法,模拟传统研究阶段,从思想和文献搜索(提案阶段)到设计实施(代码生成)、基因化培训前和下游评估(核查),利用规模法、我们的系统、基因系统、使用一个“缩放梯子”的方法等想法,提出新的设计、对立审查、实施和有选择地在日益扩大的模型规模(14M$\sim$350M参数)上核查?我们建议一种多剂LM方法,以缩小预算范围(我们可以在每个规模上培训的模型数目),模拟传统研究阶段(从思想和文献搜索(提案阶段)到设计(代码搜索阶段),Genesysysyss使用新的基因方案主干线,我们表明,这些主干法比通常使用的直接生成流程(例如,$\sim 86_百分点提高成功设计方法,一个关键的瓶颈 ) 。我们报告涉及1 162个新发现的设计的实验(1,062个(通过培训前充分核查),发现),发现最佳设计是最佳设计在已知结构上具有高度竞争力,使6个通用的层次上具有高度竞争力,Mexexexprismexbrofroforismismismismismismex,这些基准,使共同的系统具有高度。
Article 128
Title@2025-06-25 (3): CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment
Title: CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment | CBF-AFA: Multi-SSL-Fusion auf Chunk-Basis für automatische Fluency Assessment | CBF-AFA: 自动能力评估的基于整块的多SSL融合 2506.20243v1 |
Authors (4): Papa Séga Wade, Mihai Andries, Ioannis Kanellos, Thierry Moudenc
Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers. We introduce a chunk-based approach integrating self-supervised learning (SSL) models (Wav2Vec2, HuBERT, and WavLM) selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling, with a hierarchical CNN-BiLSTM framework. Speech is segmented into breath-group chunks using Silero voice activity detection (Silero-VAD), enabling fine-grained temporal analysis while mitigating over-segmentation artifacts. SSL embeddings are fused via a learnable weighted mechanism, balancing acoustic and linguistic features, and enriched with chunk-level fluency markers (e.g., speech rate, pause durations, n-gram repetitions). The CNN-BiLSTM captures local and long-term dependencies across chunks. Evaluated on Avalinguo and Speechocean762, our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo, surpassing Pyannote.audio-based segmentation baselines. These findings highlight chunk-based multi-SSL fusion for robust fluency evaluation, though future work should explore generalization to dialects with irregular prosody.
自动流利评估(AFA)仍然具有挑战性,特别是在获取语言节奏、暂停和非母语发言者的失常方面。我们采用了块状法,将自我监督的学习模式(Wav2Vec2、HuBERT和WavLM)结合在一起,这些模式因其在语音、预发和吵闹的语音模型方面的互补优势而得到选择(如语音、暂停时间和重复),并有一个有CNN-BILSTM的等级框架。发言分为呼吸组块,使用Silero语音活动探测(Silero-VAD),进行细度时间分析,同时减少过度隔离的文物。SLSL的嵌入通过可学习的加权机制结合(SSL)模式(Wav2V),平衡声学和语言特征,并用块级流利标记(例如语音率、暂停时间和重音重)加以充实。CNN-BILSTM在块的当地和长期依赖性关系中,用Avalio-VAS 762评估我们的方法将F1的成绩比分级,用2.8和Pearson-SLS-rocial-rocial-rocial-roisluslusional-resmusional laismismusional laismus labisal laisal labislup labisalisalisalisalisalisbisbismus)。
Article 129
Title@2025-06-25 (3): Enhancing Large Language Models through Structured Reasoning
Title: Enhancing Large Language Models through Structured Reasoning | Erweiterung großer Sprachmodelle durch strukturierte Vernunft | 通过结构化理由加强大语言模式 2506.20241v1 |
Authors (2): Yubo Dong, Hehe Fan
Recent Large Language Models (LLMs) have significantly advanced natural language processing and automated decision-making. However, these models still encounter difficulties when performing complex reasoning tasks involving logical deduction and systematic planning, primarily due to their reliance on implicit statistical relationships without structured knowledge representation.Inspired by cognitive science and neurosymbolic AI, we introduce a novel approach to enhance LLMs through explicit structured reasoning. First, we convert unstructured data into structured formats by explicitly annotating reasoning steps. We then employ this structured dataset to train LLMs through Supervised Fine-Tuning (SFT). Additionally, we enhance the structured reasoning capabilities of LLMs using Group Relative Policy Optimization (GRPO), incorporating two innovative algorithms–MAX-Flow and Longest Common Subsequence (LCS)–which notably improve reasoning effectiveness and reduce computational complexity. Experimental results from fine-tuning a DeepSeek-R1-Distill-Qwen-1.5B model demonstrate concise reasoning, robust performance across various scenarios, and improved compatibility with optimization techniques, validating the efficacy of structured reasoning integration in LLMs.
最近的大型语言模型(LLMS)在自然语言处理和自动化决策方面取得了显著的进步,然而,这些模型在执行涉及逻辑推算和系统规划的复杂推理任务时仍然遇到困难,主要原因是它们依赖隐含的统计关系,而没有结构化的知识代表。 在认知科学和神经同步学AI的启发下,我们引入了一种新的方法,通过明确的结构化推理来提升LMS。首先,我们通过明确说明推理步骤,将非结构化数据转换成结构化格式。然后,我们利用这个结构化数据集,通过监督的精度测试(SFT)来培训LMS。此外,我们利用集团相对政策优化(GROPO)来增强LMS的结构化推理能力,其中包括两种创新算法-MAX-Flow和Long Special Common sequence(LCS),显著地提高了推理效率并减少了计算的复杂性。从精细调整DeepSeek-R1-Dustilling-Qwen-1.5B模型中得出的实验结果显示了简明推理推理的精度。此外,改进了与优化技术的兼容性。此外,我们利用了优化技术,我们加强了LMSLMSLMSLMS。
Article 130
Title@2025-06-25 (3): LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models
Title: LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models | LLaVA-CMoE: Auf dem Weg zu einer kontinuierlichen Mischung von Experten für große Vision-Sprachenmodelle | LLavaVA-CMoE:建立大型视觉语言模型专家的连续混合体 2503.21227v3 |
Authors (8): Hengyuan Zhao, Ziqin Wang, Qixin Sun, Kaiyou Song, Yilin Li, Xiaolin Hu, Qingpei Guo, Si Liu
Mixture of Experts (MoE) architectures have recently advanced the scalability and adaptability of large language models (LLMs) for continual multimodal learning. However, efficiently extending these models to accommodate sequential tasks remains challenging. As new tasks arrive, naive model expansion leads to rapid parameter growth, while modifying shared routing components often causes catastrophic forgetting, undermining previously learned knowledge. To address these issues, we propose LLaVA-CMoE, a continual learning framework for LLMs that requires no replay data of previous tasks and ensures both parameter efficiency and robust knowledge retention. Our approach introduces a Probe-Guided Knowledge Extension mechanism, which uses probe experts to dynamically determine when and where new experts should be added, enabling adaptive and minimal parameter expansion tailored to task complexity. Furthermore, we present a Probabilistic Task Locator that assigns each task a dedicated, lightweight router. To handle the practical issue that task labels are unknown during inference, we leverage a VAE-based reconstruction strategy to identify the most suitable router by matching input distributions, allowing automatic and accurate expert allocation. This design mitigates routing conflicts and catastrophic forgetting, enabling robust continual learning without explicit task labels. Extensive experiments on the CoIN benchmark, covering eight diverse VQA tasks, demonstrate that LLaVA-CMoE delivers strong continual learning performance with a compact model size, significantly reducing forgetting and parameter overhead compared to prior methods. These results showcase the effectiveness and scalability of our approach for parameter-efficient continual learning in large language models. Our code will be open-sourced soon.
专家混合(MoE)架构最近提高了大型语言模型(LLMs)的可扩展性和适应性,用于持续多式联运学习;然而,高效扩展这些模型以适应连续任务,仍然具有挑战性;随着新任务的到来,天真的模型扩展导致快速参数增长,同时修改共享路线构件,往往造成灾难性的遗忘,破坏先前所学知识;为了解决这些问题,我们提议LALAVA-CMoE,这是一个LLMS的持续学习框架,不需要重复先前任务的数据,确保参数效率和强有力的知识保留。我们的方法引入了Probe-指导知识推广机制,利用专家来动态地确定何时和何处应增加新的专家,从而使得适应性和最小的参数扩展适合任务的复杂性。此外,我们提出了一个概率化任务定位定位器,为任务标签在推断过程中不为人所知,我们利用基于VAE的模型的重建战略,通过匹配投入分发,允许自动和准确的专家分配,从而确定最合适的路由最合适的路由。这个设计可以减缓路由专家选择的参数来决定何时和在哪里增加新的专家,从而根据任务的复杂性扩大参数来扩大扩大参数,使大型的参数能够持续学习。
Article 131
Title@2025-06-25 (3): Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems
Title: Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems | Perspektiven im Spiel: Ein multiperspektivischer Ansatz für integrativere NLP-Systeme | 游戏中的视角:对更具包容性的NLP系统采取多视角办法 2506.20209v1 |
Authors (4): Benedetta Muscato, Lucia Passaro, Gizem Gezici, Fosca Giannotti
In the realm of Natural Language Processing (NLP), common approaches for handling human disagreement consist of aggregating annotators’ viewpoints to establish a single ground truth. However, prior studies show that disregarding individual opinions can lead can lead to the side effect of underrepresenting minority perspectives, especially in subjective tasks, where annotators may systematically disagree because of their preferences. Recognizing that labels reflect the diverse backgrounds, life experiences, and values of individuals, this study proposes a new multi-perspective approach using soft labels to encourage the development of the next generation of perspective aware models, more inclusive and pluralistic. We conduct an extensive analysis across diverse subjective text classification tasks, including hate speech, irony, abusive language, and stance detection, to highlight the importance of capturing human disagreements, often overlooked by traditional aggregation methods. Results show that the multi-perspective approach not only better approximates human label distributions, as measured by Jensen-Shannon Divergence (JSD), but also achieves superior classification performance (higher F1 scores), outperforming traditional approaches. However, our approach exhibits lower confidence in tasks like irony and stance detection, likely due to the inherent subjectivity present in the texts. Lastly, leveraging Explainable AI (XAI), we explore model uncertainty and uncover meaningful insights into model predictions.
在自然语言处理(NLP)领域,处理人类分歧的共同办法包括:汇集说明者的观点,以确立单一的地面真相;然而,先前的研究显示,无视个人意见可能导致少数人观点的偏少,特别是在主观任务中,因为说明者可能因其偏好而系统地意见分歧;认识到标签反映不同背景、生活经历和个人价值观,本研究报告提出采用新的多视角办法,使用软标签鼓励发展下一代有视野的、更具包容性和多元性的模型;我们对多种主观文本分类任务进行广泛分析,包括仇恨言论、讽刺、滥用语言和姿态探测,强调捕捉人类分歧的重要性,而传统汇总方法往往忽略了这些分歧;结果显示,多视角办法不仅更好地接近人类标签的分布,如Jensen-Shannon Divergence(JSD)所衡量的那样,而且实现优等分类性业绩(高F1分),优于传统方法。然而,我们的方法在诸如讽刺和立场识别模型等任务中表现出信心较低,可能是因为我们所固有的主题性分析、对AX进行深刻的预测。
Article 132
Title@2025-06-25 (3): Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation
Title: Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation | Intrinsische vs. Extrinsische Bewertung tschechischer Satzeinbettungen: Semantische Relevanz hilft bei MT-Evaluierung nicht | 捷克判决嵌入式:语义相关性无助于MT评价 2506.20203v1 |
Authors (2): Petra Barančíková, Ondřej Bojar
In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the complex relationship between semantic property probes and downstream task, emphasizing the need for more research into ‘operationalizable semantics’ in sentence embeddings, or more in-depth downstream tasks datasets (here translation evaluation)
在本文中,我们比较了捷克特有的和多语言的句子嵌入模式,通过内在和外部的评价模式。在内在评价中,我们使用Costra(一个复杂的句式转换数据集)和若干语义文本相似性(STS)基准来评估嵌入能力,以捕捉语言现象,例如语义相似性、时间方面和文体差异。在外部评价中,我们用基于知识与技术的计量标准对每个嵌入模式进行微调,用于机器翻译评价。我们的实验揭示了一个有趣的脱节:在内在语义相似性测试中优异的模型在下游翻译评价任务中并不始终产生优异的性能。相反,具有看起来超移动的嵌入空间的模型可以通过微调而取得极好的结果。这些结论突出了语义属性探测和下游任务之间的复杂关系,强调需要对“可操作的语义”嵌入句或更深入的下游任务数据集(这里的翻译评价)进行更多的研究。
Article 133
Title@2025-06-25 (3): How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models?
Title: How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models? | Wie kann man Beispiele im In-Context-Lernen abrufen, um die Erkennung von Konversationsgefühlen mit großen Sprachmodellen zu verbessern? | 如何利用大语言模式获取学习内文中的实例, 2506.20199v1 |
Authors (3): Mengqi Wang, Tiantian Feng, Shrikanth Narayanan
Large language models (LLMs) have enabled a wide variety of real-world applications in various domains. However, creating a high-performing application with high accuracy remains challenging, particularly for subjective tasks like emotion recognition. Inspired by the SLT 2024 GenSER Challenge, this study investigates approaches to improving conversational emotion recognition (CER) by LLMs. Specifically, we explore how to retrieve high-quality examples in in-context learning (ICL) to enhance CER. We propose various strategies based on random and augmented example retrieval and also analyze the impact of conversational context on CER accuracy. Experiments were conducted on the three datasets including IEMOCAP, MELD and EmoryNLP. The results show that augmented example retrieval consistently outperforms other techniques under investigation across all datasets, highlighting the importance of retrieving coherent targeted examples and enhancing them through paraphrasing.
大型语言模型(LLMS)在不同领域促成了一系列广泛的现实应用。然而,创建高性能且精准的应用程序仍然具有挑战性,特别是对于情感识别等主观任务而言。在SLT 2024 GenSER挑战的启发下,本研究报告调查了提高LLMS对谈话情绪识别的方法。具体地说,我们探讨如何检索文中学习(ICL)中高质量的实例,以加强CER。我们提出了基于随机和强化实例检索的各种战略,并分析了谈话环境对CER准确性的影响。对三个数据集进行了实验,包括IEMOCAP、MELD和EmoryNLP。结果显示,强化实例检索始终超越了所有数据集正在调查的其他技术,突出了重新查找连贯的目标实例的重要性,并通过语音加强这些实例。
Article 134
Title@2025-06-25 (3): COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees
Title: COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees | COIN: Ungewissheitssicherung Selektive Frage-Beantwortung für Stiftungsmodelle mit wahrscheinlichen Risikogarantien | COIN: 可靠风险保障基础模型的不确定性保护选择性问题选择性回答 2506.20178v1 |
Authors (7): Zhiyuan Wang, Jinhao Duan, Qingni Wang, Xiaofeng Zhu, Tianlong Chen, Xiaoshuang Shi, Kaidi Xu
Uncertainty quantification (UQ) for foundation models is essential to identify and mitigate potential hallucinations in automatically generated text. However, heuristic UQ approaches lack formal guarantees for key metrics such as the false discovery rate (FDR) in selective prediction. Previous work adopts the split conformal prediction (SCP) framework to ensure desired coverage of admissible answers by constructing prediction sets, but these sets often contain incorrect candidates, limiting their practical utility. To address this, we propose COIN, an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question under user-specified FDR constraints. COIN estimates the empirical error rate on a calibration set and applies confidence interval methods such as Clopper-Pearson to establish a high-probability upper bound on the true error rate (i.e., FDR). This enables the selection of the largest uncertainty threshold that ensures FDR control on test data while significantly increasing sample retention. We demonstrate COIN’s robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data across both general and multimodal text generation tasks. Furthermore, we show that employing alternative upper bound constructions and UQ strategies can further boost COIN’s power performance, which underscores its extensibility and adaptability to diverse application scenarios.
用于基础模型的不确定性量化(UQ)对于确定和减少自动生成文本中潜在幻觉至关重要。然而,狂喜的 UQ 方法缺乏对关键指标(如选择性预测中的虚假发现率(FDR))的正式保障。先前的工作采用分裂一致预测(SCP)框架,以确保通过建立预测数据集(即FDR)对可接受的答案进行预期的覆盖,但这些数据集往往包含不正确的候选人,限制了其实际用途。为此,我们提议COIN,一个不确定性保护选择框架,根据用户指定的FDR限制,校准在统计上有效的阈值以过滤每个问题产生的单一答案。COIN估计校准组的经验误差率,并应用Clopper-Pearson等信任间隔方法,以建立对真实错误率(即FDR)的高度概率上限。这样可以选择最大的不确定性阈值,以确保FDR对测试数据的控制,同时大幅增加样本留存量。我们建议COIN在风险控制方面表现出稳健,在保留可接受答案方面的强大测试-时间能力,以及根据有限的校准数据预测效率,在通用和多式文本生成战略中,我们展示了可进一步提升其业绩。
Article 135
Title@2025-06-25 (3): Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation
Title: Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation | Conversational User-AI Intervention: Eine Studie zum Prompt Rewriting für verbesserte LLM Response Generation | 相互交流的用户-AI干预:关于为改进LLM反应生成迅速改写的研究报告 2503.16789v2 |
Authors (7): Rupak Sarkar, Bahareh Sarrafzadeh, Nirupama Chandrasekaran, Nagu Rangan, Philip Resnik, Longqi Yang, Sujay Kumar Jauhar
Human-LLM conversations are increasingly becoming more pervasive in peoples’ professional and personal lives, yet many users still struggle to elicit helpful responses from LLM Chatbots. One of the reasons for this issue is users’ lack of understanding in crafting effective prompts that accurately convey their information needs. Meanwhile, the existence of real-world conversational datasets on the one hand, and the text understanding faculties of LLMs on the other, present a unique opportunity to study this problem, and its potential solutions at scale. Thus, in this paper we present the first LLM-centric study of real human-AI chatbot conversations, focused on investigating aspects in which user queries fall short of expressing information needs, and the potential of using LLMs to rewrite suboptimal user prompts. Our findings demonstrate that rephrasing ineffective prompts can elicit better responses from a conversational system, while preserving the user’s original intent. Notably, the performance of rewrites improves in longer conversations, where contextual inferences about user needs can be made more accurately. Additionally, we observe that LLMs often need to – and inherently do – make \emph{plausible} assumptions about a user’s intentions and goals when interpreting prompts. Our findings largely hold true across conversational domains, user intents, and LLMs of varying sizes and families, indicating the promise of using prompt rewriting as a solution for better human-AI interactions.
人类-LLM对话日益在人们的专业和个人生活中日益普及,但许多用户仍在苦于从LLM Chatbots得到有益的回应。这个问题的原因之一是用户在设计能准确传递信息需求的有效提示方面缺乏了解。与此同时,现实世界的谈话数据集的存在,以及LLMs的文字理解能力,为研究这一问题及其规模上的潜在解决办法提供了一个独特的机会。因此,我们在本文件中介绍了以LLM为中心的对真正的人-AI聊天机对话的首次研究,重点是调查用户询问没有满足信息需求的问题,以及利用LLMs重新编辑亚最佳用户提示的可能性。我们的研究结果表明,重塑无效的提示能够从一个谈话系统得到更好的反应,同时保持用户的原始意图。值得注意的是,改写者的表现在较长的谈话中有所改进,可以更准确地推断用户的需求。此外,我们发现LLMS经常需要 – 和内在的用LMS来更好地解释用户的意向,在用户的视野上,在用户的迅速解读时,在用户的视野上做出一个精确的预期。
Article 136
Title@2025-06-25 (3): SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs
Title: SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs | SEED: Ein struktureller Encoder für die Einbettung-getriebener Decodierung in Time Series Prediction mit LLMs | SEED: 使用LLMs在时间序列预测中嵌入式分解编码结构编码器 2506.20167v1 |
Authors (6): Fengze Li, Yue Wang, Yangle Liu, Ming Huang, Dou Hong, Jieming Ma
Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remain incompatible with raw time series inputs. This gap limits the development of unified, transferable prediction systems. Therefore, we introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, a semantic reprogramming mechanism that maps patches to task-aware prototypes, and a frozen language model for prediction. This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning. Empirical results demonstrate that the proposed method achieves consistent improvements over strong baselines, and comparative studies on various datasets confirm SEED’s role in addressing the structural-semantic modeling gap.
多变量时间序列预测要求模型同时捕捉不同结构依赖性,并概括不同任务。虽然结构编码器在模型特征互动方面有效,但它们缺乏支持语义层面推理或任务适应的能力。相反,大语言模型(LLMS)具有很强的通用能力,但与原始时间序列投入不相容。这一差距限制了统一、可转移预测系统的开发。因此,我们引入了嵌入驱动解码的结构性编码器SEED,它包含四个阶段:补丁提取的代号编码器、与语言模型嵌入的补丁相匹配的预测模块、绘制任务认知原型的语义重组机制以及一个冻结的预测语言模型。这种模块化结构将从推断学中分离出来,使数字模式和语义学推理之间能够有效地保持一致。实证结果表明,拟议的方法在强大的基线上取得了一致的改进,而关于各种数据集的比较研究证实了SEECD在解决结构标识模型差距方面的作用。
Article 137
Title@2025-06-25 (3): AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control
Title: AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control | AALC: Großes Sprachmodell Effiziente Vernunft über adaptive Genauigkeits-Längen-Steuerung | AALC:通过适应性准确度-语言控制进行大语言模型高效解释 2506.20160v1 |
Authors (7): Ruosen Li, Ziming Luo, Quan Zhang, Ruochen Li, Ben Zhou, Ali Payani, Xinya Du
Large reasoning models (LRMs) achieve impressive reasoning capabilities by generating lengthy chain-of-thoughts, but this “overthinking” incurs high latency and cost without commensurate accuracy gains. In this work, we introduce AALC, a lightweight, accuracy-aware length reward integrated into reinforcement learning that dynamically balances correctness and brevity during training. By incorporating validation accuracy into the reward and employing a smooth, dynamically scheduled length penalty, AALC delays length penalty until target performance is met. Through extensive experiments across standard and out-of-distribution math benchmarks, we show that our approach reduces response length by over 50% while maintaining or even improving the original accuracy. Furthermore, qualitative analysis reveals that our method curbs redundant reasoning patterns such as excessive subgoal setting and verification, leading to structurally refined outputs rather than naive truncation. We also identify that efficiency gains are accompanied by reduced interpretability: models trained with AALC omit some narrative framing and explanatory context. These findings highlight the potential of reward-based strategies to guide LRMs toward more efficient, generalizable reasoning paths.
大型推理模型(LRMs)通过产生长长的思维链来取得令人印象深刻的推理能力,但这种“过度思考”导致高潜度和成本,而没有相应的准确性增益。在这项工作中,我们引入了AALC, 一种轻量、精度和觉悟长的奖励,纳入强化学习,动态地平衡了培训期间的正确性和简洁性。通过将验证的准确性纳入奖励中,并采用平稳、动态地排定的长度处罚,AALC延缓处罚,直到达到目标性能。通过在标准和分配以外的数学基准方面进行广泛的实验,我们表明我们的方法在维持甚至提高原有的准确性的同时,将响应时间缩短了50%以上。此外,定性分析表明,我们的方法抑制了多余的推理模式,如过度的次级目标设定和核查,导致结构上的改进产出而不是天真变。我们还发现,效率增益伴随着解释性下降:受过AALC培训的模型省了一些叙述性框架和解释性背景。这些结论突出表明了以奖励为基础的战略在指导LRM走向更高效、更普遍的推理路方面的潜力。
Article 138
Title@2025-06-25 (3): Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
Title: Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners | Belohnender Graph Reasoning Prozess macht LLMs mehr Generalized Reasoners | 奖励图表说明程序使LLMs公司更普遍化理由 2503.00845v2 |
Authors (4): Miao Peng, Nuo Chen, Zongrui Suo, Jia Li
Despite significant advancements in Large Language Models (LLMs), developing advanced reasoning capabilities in LLMs remains a key challenge. Process Reward Models (PRMs) have demonstrated exceptional promise in enhancing reasoning by providing step-wise feedback, particularly in the context of mathematical reasoning. However, their application to broader reasoning domains remains understudied, largely due to the high costs associated with manually creating step-level supervision. In this work, we explore the potential of PRMs in graph reasoning problems - a domain that demands sophisticated multi-step reasoning and offers opportunities for automated step-level data generation using established graph algorithms. We introduce GraphSILO, the largest dataset for graph reasoning problems with fine-grained step-wise labels, built using automated Task-oriented Trajectories and Monte Carlo Tree Search (MCTS) to generate detailed reasoning steps with step-wise labels. Building upon this dataset, we train GraphPRM, the first PRM designed for graph reasoning problems, and evaluate its effectiveness in two key settings: inference-time scaling and reinforcement learning via Direct Preference Optimization (DPO). Experimental results show that GraphPRM significantly improves LLM performance across 13 graph reasoning tasks, delivering a 9% gain for Qwen2.5-7B and demonstrating transferability to new graph reasoning datasets and new reasoning domains like mathematical problem-solving. Notably, GraphPRM enhances LLM performance on GSM8K and Math500, underscoring the cross-domain applicability of graph-based reasoning rewards. Our findings highlight the potential of PRMs in advancing reasoning across diverse domains, paving the way for more versatile and effective LLMs.
尽管大语言模型(LLMs)取得了显著进步,但在LLMs中发展先进的推理能力仍是一项关键挑战。进程评分模型(PRMs)通过提供渐进式反馈,特别是在数学推理方面,在加强推理方面表现出了非凡的希望。但是,在更广泛的推理领域的应用仍然未得到充分研究,这主要是因为与人工创建跨级监督有关的高成本。在这项工作中,我们探索PRM在图形推理问题方面的潜力――这个领域需要复杂的多步推理,并且提供了利用既定的图表算法自动生成跨级级数据的机会。我们引入了GapSILIL(GSLIL),这是图表推理学问题的最大数据集,用精细的分级分级分级分级分级标签和分级分级标签构建的图表推理问题。 实验结果显示,GLMMS-M(ODM)的高级分级推理(ODO)和跨级推理(PLM) 13号直径直径直径直径直径直径直径直径直径分析(O)。
Article 139
Title@2025-06-25 (3): CCRS: A Zero-Shot LLM-as-a-Judge Framework for Comprehensive RAG Evaluation
Title: CCRS: A Zero-Shot LLM-as-a-Judge Framework for Comprehensive RAG Evaluation | CCRS: Ein Null-Shot LLM-as-a-Richter-Rahmen für eine umfassende RAG-Bewertung | CCRS: 全面RAG综合评价框架 2506.20128v1 |
Authors (1): Aashiq Muhamed
RAG systems enhance LLMs by incorporating external knowledge, which is crucial for domains that demand factual accuracy and up-to-date information. However, evaluating the multifaceted quality of RAG outputs, spanning aspects such as contextual coherence, query relevance, factual correctness, and informational completeness, poses significant challenges. Existing evaluation methods often rely on simple lexical overlap metrics, which are inadequate for capturing these nuances, or involve complex multi-stage pipelines with intermediate steps like claim extraction or require finetuning specialized judge models, hindering practical efficiency. To address these limitations, we propose CCRS (Contextual Coherence and Relevance Score), a novel suite of five metrics that utilizes a single, powerful, pretrained LLM as a zero-shot, end-to-end judge. CCRS evaluates: Contextual Coherence (CC), Question Relevance (QR), Information Density (ID), Answer Correctness (AC), and Information Recall (IR). We apply CCRS to evaluate six diverse RAG system configurations on the challenging BioASQ dataset. Our analysis demonstrates that CCRS effectively discriminates between system performances, confirming, for instance, that the Mistral-7B reader outperforms Llama variants. We provide a detailed analysis of CCRS metric properties, including score distributions, convergent/discriminant validity, tie rates, population statistics, and discriminative power. Compared to the complex RAGChecker framework, CCRS offers comparable or superior discriminative power for key aspects like recall and faithfulness, while being significantly more computationally efficient. CCRS thus provides a practical, comprehensive, and efficient framework for evaluating and iteratively improving RAG systems.
RAG系统通过纳入外部知识来提升LLM,这对于要求事实准确性和最新信息的领域至关重要。然而,评估RAG产出的多方面质量,包括背景一致性、查询相关性、事实正确性和信息完整性等各个方面,都构成重大挑战。现有的评价方法往往依赖简单的词汇重叠度量,这些度量不足以捕捉这些细微差别,或涉及复杂的多阶段管道,具有中间步骤,如索赔提取或需要微调专业法官模型,从而妨碍实际效率。为克服这些限制,我们建议CCRS(Textaulity Conference and recernal Cality CRASQ),这是一套新型的五种衡量标准,利用单一的、强大的、预先训练的LLM(LM)作为零射分数、端对端至端法官。CCRS(C)评价:CLC(C)、问题相关性(Q)、信息密度(ID)、回答(AC(AC)和回调(IR)。我们运用CCRS(CCRS)来评估具有挑战性的RA系统六种不同的系统配置。我们的分析表明,CRS(CRS)有效地区分了系统的业绩,确认、可比较性、准确性、准确性、准确性、准确性、准确性统计,同时提供。
Article 140
Title@2025-06-25 (3): Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests
Title: Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests | Einsatz von KI-Gradern für fehlende Score-Imputation, um eine genaue Abschätzung der Fähigkeit in konstruierten Reaktionstests zu erreichen | 利用AI 级数来计算缺失计分数,以在建构反应测试中实现准确能力估算 2506.20119v1 |
Authors (2): Masaki Uto, Yuma Ito
Evaluating the abilities of learners is a fundamental objective in the field of education. In particular, there is an increasing need to assess higher-order abilities such as expressive skills and logical thinking. Constructed-response tests such as short-answer and essay-based questions have become widely used as a method to meet this demand. Although these tests are effective, they require substantial manual grading, making them both labor-intensive and costly. Item response theory (IRT) provides a promising solution by enabling the estimation of ability from incomplete score data, where human raters grade only a subset of answers provided by learners across multiple test items. However, the accuracy of ability estimation declines as the proportion of missing scores increases. Although data augmentation techniques for imputing missing scores have been explored in order to address this limitation, they often struggle with inaccuracy for sparse or heterogeneous data. To overcome these challenges, this study proposes a novel method for imputing missing scores by leveraging automated scoring technologies for accurate IRT-based ability estimation. The proposed method achieves high accuracy in ability estimation while markedly reducing manual grading workload.
评估学习者的能力是教育领域的一项基本目标,特别是越来越需要评估更高层次的能力,例如表达技巧和逻辑思维; 建立答复的测试,例如短期回答和论文问题,已被广泛用作满足这一需求的方法; 虽然这些测试是有效的,但它们需要大量的手工分级,使得它们既需要劳动密集型又需要昂贵; 项目反应理论(IRT)提供了有希望的解决办法,它使得能够从不完全的得分数据中估算能力,在这个数据中,人类评分者只分分了学习者在多个测试项目中提供的回答。然而,能力估算的准确性随着缺失得分的比例的增加而下降。尽管已经探索了计算缺失得分的数据增强技术,以解决这一局限性,但是它们往往与稀有或差异性数据不准确的问题作斗争。为克服这些挑战,本研究报告提出了一种新的方法,即利用自动评分技术来估算缺失的得分,用于精确的以劳动力测试为基础的能力估算。拟议方法在明显减少人工分数工作量的同时,在能力估算方面达到很高的准确度。
Article 141
Title@2025-06-25 (3): A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection
Title: A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection | Multi-Pass Large Language Model Framework für präzise und effiziente Radiologie Fehlererkennung melden | 精确和高效放射学报告误差探测多版本大语言示范框架 2506.20112v1 |
Authors (6): Songsoo Kim, Seungtae Lee, See Young Lee, Joonho Kim, Keechan Kan, Dukyong Yoon
Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3’s superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance.
背景:大型语言模型(LLM)基于大语言模型(LLM)校对的正预测值(PPV33)有限,原因是误差率低。目的:评估三通LLM框架是否增强PPV,并比基线方法降低业务费用。材料和方法:对MIMIC-III数据库的1,000份连续放射报告(各250份:射电、超声学、CT、MRI)进行了追溯分析。有两个外部数据集(CheXpert和Open-i)是校验组。测试了3个LLM框架:(1) 单式P-T探测器;(2) 提取器加探测器;(3) 提取器、检测器和假阳性校验器。精确度由PPV和绝对正正正正正正率(aTPR)进行测量。效率从模型电量收费和评审报酬计算出来,用CMcNEMAR12测试,以及HM-Bonferroni校正(精确的测试)两个校正数。结果:框架从95-CI、0.036-O-1010、框架1至0.00xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 142
Title@2025-06-25 (3): A Global Context Mechanism for Sequence Labeling
Title: A Global Context Mechanism for Sequence Labeling | Ein globaler Kontextmechanismus für die Sequenzkennzeichnung | 序列标签全球背景机制 2305.19928v5 |
Authors (4): Conglei Xu, Kun Shen, Hongguang Sun, Yang Xu
Global sentence information is crucial for sequence labeling tasks, where each word in a sentence must be assigned a label. While BiLSTM models are widely used, they often fail to capture sufficient global context for inner words. Previous work has proposed various RNN variants to integrate global sentence information into word representations. However, these approaches suffer from three key limitations: (1) they are slower in both inference and training compared to the original BiLSTM, (2) they cannot effectively supplement global information for transformer-based models, and (3) the high time cost associated with reimplementing and integrating these customized RNNs into existing architectures. In this study, we introduce a simple yet effective mechanism that addresses these limitations. Our approach efficiently supplements global sentence information for both BiLSTM and transformer-based models, with minimal degradation in inference and training speed, and is easily pluggable into current architectures. We demonstrate significant improvements in F1 scores across seven popular benchmarks, including Named Entity Recognition (NER) tasks such as Conll2003, Wnut2017 , and the Chinese named-entity recognition task Weibo, as well as End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) benchmarks such as Laptop14, Restaurant14, Restaurant15, and Restaurant16. With out any extra strategy, we achieve third highest score on weibo NER benchmark. Compared to CRF, one of the most popular frameworks for sequence labeling, our mechanism achieves competitive F1 scores while offering superior inference and training speed. Code is available at: https://github.com/conglei2XU/Global-Context-Mechanism
全球判决信息对于顺序标签任务至关重要, 必须在其中给每个词指定一个标签。 虽然 BilSTM 模型被广泛使用, 但它们往往无法捕捉到足够的内字全球背景。 以前的工作提出了各种 RNN 变量, 以便将全球判决信息纳入字面表达中。 然而, 这些方法有三大限制:(1) 与最初的 BILSTM 相比,在推论和培训方面都比较缓慢,(2) 它们无法有效地补充基于变压模型的全球信息,(3) 与重新实施这些定制的 RNNS 模型并将其纳入现有结构相关的时间成本很高。 我们的研究中引入了一个简单而有效的机制来应对这些局限性。 我们的方法有效地补充了BILSTM 和基于变压器的模型的全球判决信息,在推论和培训速度上最小化,并且很容易插入当前的结构。 我们展示了七种流行基准的F1分中的显著改进, 包括Conll2003, Wnut2017, 以及中国命名的网络识别任务 Weibo, 以及Ond- Estal- Estal- real- real deal deal deal dealalalal sqislational sal sqal deal sqional asional asional asional asionalbild.
Article 143
Title@2025-06-25 (3): What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning
Title: What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning | Was in LLM-generierten Daten zählt: Vielfalt und ihre Wirkung auf Modell Feintuning | LLM产生的数据中哪些重要:多样性及其对模拟微调的影响 2506.19262v2 |
Authors (9): Yuchang Zhu, Huazhen Zhong, Qunshu Lin, Haotong Wei, Xiaolong Sun, Zixuan Yu, Minghao Liu, Zibin Zheng, Liang Chen
With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.
由于大型语言模型(LLMs)具有非凡的遗传能力,利用LLM产生的数据来培训下游模型,这已成为减少特定领域数据稀缺和减少耗时说明的一个很有希望的方法,然而,最近的研究突出了一个关键问题:关于自发数据结果的迭代培训导致模型崩溃,模型性能逐渐下降。尽管对LLM产生的数据的影响进行了广泛的研究,但这些工作往往忽视了数据多样性的重要性,而数据多样性是数据质量的一个关键因素。在这项工作中,我们的目标是了解LLM产生的数据的多样性对下游模型性能的影响。具体地说,我们探索LLM产生的数据的多样性如何影响下游模型性能。此外,我们调查了在数据方面受过培训的模型的性能,这些模型混合了不同比例的LLMM产生的数据,我们称之为合成数据。我们的实验结果显示,随着最小的分布变化,LMMM数据的生成略有差异,可以在标签不足的情况下提高模型性能,而生成的数据则具有负面的影响。我们希望我们的经验调查结果将为今后LMMs作为数据生成者的研究提供宝贵的指导。
Article 144
Title@2025-06-25 (3): A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans
Title: A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans | Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen | 全面评价未受过训练语言模式和人文的语义关系知识 2412.01131v2 |
Authors (4): Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga
Recently, much work has concerned itself with the enigma of what exactly PLMs (pretrained language models) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Only one relation was considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that solved by the PLMs. This means that at this point in time, there is only an incomplete view of models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use six metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, asymmetry, prototypicality, and distinguishability and fairly compare humans and models on the same task. Our extensive experiments involve 16 PLMs, eight masked and eight causal language models. Up to now only masked language models had been tested although causal and masked language models treat context differently. Our results reveal a significant knowledge gap between humans and models for almost all semantic relations. Antonymy is the outlier relation where all models perform reasonably well. In general, masked language models perform significantly better than causal language models. Nonetheless, both masked and causal language models are likely to confuse non-antonymy relations with antonymy.
最近,许多工作都与PLMS(未受过训练的语言模式)所了解的语言不同方面及其如何学习的谜题有关。 这种研究的一流就是调查PLMs对语义关系的认识。 但是,语义关系的许多方面没有被探索。 仅考虑一种关系, 即超尼。 此外, 先前的工作没有测量人类在与PLMS所解决的相同任务方面的表现。 这意味着, 目前, 模型的语义关系知识的模糊性只是一种不完全的观点。 为了弥补这一差距, 我们引入了一个全面的评估框架, 涵盖我语义关系超过超尼的五种关系, 即低尼米、 holonymy、 meronymy、 antononnymy 和 yonny。 我们使用了六种衡量标准( 这里新引入了两个) 来测量语义关系知识的未经处理的方面, 即音义、 完整性、 对称性、 不对称、 直观性、 易辨、 和识别性, 以及同一任务上的人文和模式的可能比较。 我们的广泛实验模型涉及16个不完全的模型, 隐喻的模型, 几乎由8个解释的、 语言和反因果语言关系中的一种。
Article 145
Title@2025-06-25 (3): Misalignment of Semantic Relation Knowledge between WordNet and Human Intuition
Title: Misalignment of Semantic Relation Knowledge between WordNet and Human Intuition | Fehlausrichtung des semantischen Beziehungswissens zwischen WordNet und menschlicher Intuition | WordNet与人类直觉之间的语义关系知识失调 2412.02138v2 |
Authors (4): Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga
WordNet provides a carefully constructed repository of semantic relations, created by specialists. But there is another source of information on semantic relations, the intuition of language users. We present the first systematic study of the degree to which these two sources are aligned. Investigating the cases of misalignment could make proper use of WordNet and facilitate its improvement. Our analysis which uses templates to elicit responses from human participants, reveals a general misalignment of semantic relation knowledge between WordNet and human intuition. Further analyses find a systematic pattern of mismatch among synonymy and taxonomic relations~(hypernymy and hyponymy), together with the fact that WordNet path length does not serve as a reliable indicator of human intuition regarding hypernymy or hyponymy relations.
WordNet提供了一个由专家精心构建的语义关系储存库。但还有一个关于语义关系的信息源,即语言使用者的直觉。我们提出了关于这两个语义关系(语言使用者的直觉)的首次系统研究。调查不吻合的案例可以适当使用WordNet并促进其改进。我们使用模板来征求人类参与者的答复的分析揭示了WordNet与人类直觉之间语义关系知识的普遍不和。进一步的分析发现,同义和分类关系(体义和虚弱)之间存在着一种系统性的不匹配模式,同时WordNet路径长度不能作为人类对高尼或虚弱关系的直觉的可靠指标。
Article 146
Title@2025-06-25 (3): MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
Title: MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations | MIRAGE: Benchmark für multimodale Informationssuche und -vernunft in sachverständigen Gesprächen in der Landwirtschaft | MIRAGE:农业专家指导下的农业多模式信息查找和说明理由基准 2506.20100v1 |
Authors (7): Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayyeri, Gokhan Tur, Dilek Hakkani-Tür, Vikram S. Adve
We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io
我们引入了MIRAGE,这是在咨询性互动环境中多式联运专家层面推理和决策的新基准。为农业领域设计的MIRAGE将自然用户询问、专家撰写的答复和基于图像的背景结合起来,从而抓住专家磋商的全部复杂性,为基于依据的推理、澄清战略和在现实世界知识密集型领域长式生成模型的评价提供了一个高度忠诚的基准。基于35 000多个真正的用户-专家互动,并通过精心设计的多步管道加以调整,MIRAGE覆盖了多种作物健康、虫害诊断和作物管理设想。基准包括7 000多个独特的生物实体,涵盖植物物种、虫害和疾病,使其成为基于现实世界的愿景语言模型最多样化的分类基准之一。与现有基准不同,现有基准依赖于明确指定的用户投入和封闭式分类,MIRAGEG的特征未明确描述,环境丰富且具有开放世界环境,需要模型来推断潜在的知识差距,处理稀有实体,或者积极主动地指导互动或应对。项目页面: https://mirage-chingsmarging-chinggentpas。
Article 147
Title@2025-06-25 (3): PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
Title: PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models | PSALM-V: Symbolische Planung in interaktiven visuellen Umgebungen mit großen Sprachmodellen automatisieren | PSALM-V:在与大语言模型互动视觉环境中自动进行符号规划 2506.20097v1 |
Authors (5): Wang Bill Zhu, Miaosen Chai, Ishika Singh, Robin Jia, Jesse Thomason
We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and Overcooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot.
我们提出PSALM-V,这是第一个能够通过互动在视觉环境中诱发象征性动作语义学(即预视和后修饰)的自主神经-精神学习系统。PSALM-V靴套可靠的象征性规划,没有专家行动定义,使用LLMs来生成超动计划和候选象征性语义。以前的工作探索了使用大语言模型来生成用于规划定义语言(PDDL)的动作语义。然而,这些方法主要侧重于基于文字的域,或者依赖不切实际的假设,例如访问预先定义的问题文件、完全可视性或明确的错误信息。相比之下,PSALM-V动态地分析PDDL问题文件和域行动语义,分析执行结果并合成可能的错误解释。系统反复生成和执行计划,同时保持对每种行动可能的行动语义结构的信念,反复地修正这些信念,直到达到目标状态。在 ALFREDD中模拟任务完成实验显示PSAL-M-LL在双流操作级别上显示PSAL-L-L-LL 将成功率在双向上提升PRODM-ral-ral-ral-rass-rass-rass-ral-rass-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-L-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-
Article 148
Title@2025-06-25 (3): PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
Title: PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding | PP-DocBee2: Verbesserte Baselines mit effizienten Daten für multimodales Dokumentenverständnis | PP-Docbee2:改进基线,提供高效数据,促进多式文件理解 2506.18023v2 |
Authors (6): Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu
This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4\%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0\%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
本报告介绍PP-Docbee2,这是PP-DocBee2的先进版本,旨在加强对多式联运文件的理解。在大型多式联运模型结构、PP-Docbee2 的基础上,通过关键的技术改进,包括提高合成数据质量、改进视觉特征融合战略和优化推论方法,解决其前身的局限性。这些提高产生了11.4美元对中国商业文件内部基准的绩效提升,并将香草版本的推导延率减少73.0美元。我们工作的一个关键创新是多式联运文件任务的数据质量优化战略。通过采用大型多式联运预培训模型来评估数据,我们采用新的统计标准过滤外端,确保高质量的培训数据。根据对多式联运模型中未充分利用的中间特征的深刻认识,我们通过将其分层分解和采用新的特征融合战略来提高维特代表能力,从而改进复杂推理。源代码和预培训模式可在以下网站查阅:https://github.com/PadlyPad/Mdlix@Mddlix=Mdsad=Mdx=Mds=
Article 149
Title@2025-06-25 (3): ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset
Title: ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset | ITFormer: Überbrückungszeitreihe und natürliche Sprache für Multi-Modal QA mit großformatigem Multitask-Datensatz | ITFremer:具有大型多任务数据集的多模式QA的连接时间序列和自然语言 2506.20093v1 |
Authors (9): Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, Zhongyu Wei
Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1\% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at: https://pandalin98.github.io/itformer_site/
在工业监测、医疗诊断和气候研究等多种应用中,时间序列数据至关重要。然而,将这些高维时间信号与自然语言有效地结合,以动态和互动任务为天然语言,这仍然是一个重大挑战。为了解决这个问题,我们引入了时间-系列问答(时间-系列QA)任务,并发布SINGMT-QA,这是第一个大型、多任务、时间-文字QA数据库,旨在捕捉时间序列信号和自然语言之间的复杂互动。基于此资源,我们提议了时间变换指导器(ITFormer),这是一个将时间序列编码与冻结的大型语言模型(LLLMS)连接起来的新框架。ITFormer有效地提取、对齐和连接时间和文字特性,使QA的精度大大高于强的基线,而附加的参数小于1的额外培训参数。通过将计算效率与稳健的跨模式建模,我们的工作为将时间数据与自然语言融合起来,为新研究和应用多模式铺平道路。关于该项目的更多细节,包括数据设置和代码。
Article 150
Title@2025-06-25 (3): Understanding World or Predicting Future? A Comprehensive Survey of World Models
Title: Understanding World or Predicting Future? A Comprehensive Survey of World Models | Welt verstehen oder Zukunft voraussagen? Eine umfassende Übersicht über Weltmodelle | 了解世界或预测未来?世界模式综合概览 2411.14499v2 |
Authors (12): Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, Yong Li
The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in https://github.com/tsinghua-fib-lab/World-Model.
世界模型的概念由于在诸如GPT-4和Sora等多式大型语言模型等对追求人工一般智能至关重要的多式大型语言模型和Sora等视频生成模型方面的进展而引起人们的极大关注。本调查全面审查了世界模型的文献。一般而言,世界模型被视为了解世界现状或预测其未来动态的工具。本审查对世界模型进行系统分类,强调两个主要功能:(1) 建立内部代表机构以了解世界机制,(2) 预测未来国家模拟和指导决策。最初,我们审查这两个类别目前的进展。然后,我们探索世界模型在关键领域的应用,包括自主驾驶、机器人和社会模拟领域,重点是每个领域如何利用这些方面。最后,我们概述了主要挑战,并对潜在的未来研究方向提出见解。我们总结了代表性文件及其代码储存库,载于https://github.com/tsinghua-fib-lab/Worl-Model。
Article 151
Title@2025-06-25 (3): Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models
Title: Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models | Aufmerksamkeitsentropie ist ein Schlüsselfaktor: Eine Analyse des Parallelkontexts Kodierung mit Full-Attention-basierten vortrainierten Sprachmodellen | 注意信封是一个关键因素:分析平行背景编码,采用以充分注意为基础的预先培训前语文模式 2412.16545v2 |
Authors (8): Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, Dong Yu
Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling. The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers. Although powerful, this method can be inefficient for long sequences and may overlook inherent input structures. To address these problems, an alternative approach is parallel context encoding, which splits the context into sub-pieces and encodes them parallelly. Because parallel patterns are not encountered during training, naively applying parallel encoding leads to performance degradation. However, the underlying reasons and potential mitigations are unclear. In this work, we provide a detailed analysis of this issue and identify that unusually high attention entropy can be a key factor. Furthermore, we adopt two straightforward methods to reduce attention entropy by incorporating attention sinks and selective mechanisms. Experiments on various tasks reveal that these methods effectively lower irregular attention entropy and narrow performance gaps. We hope this study can illuminate ways to enhance context modeling mechanisms.
大型语言模型由于其在背景建模方面的特殊能力,在多种语言任务中表现出了显著的成绩。最常用的背景建模方法是完全自我注意,正如标准解码器专用变异器所显示的那样。这种方法虽然很有力,但对于长序列来说效率低,而且可能忽视固有的输入结构。为了解决这些问题,另一种办法是平行的环境编码,将上下文分为子体,并平行地编码。由于在培训期间没有遇到平行的模式,天真地应用平行编码会导致性能退化。然而,基本原因和潜在的缓解措施并不明确。在这项工作中,我们提供了对这一问题的详细分析,并查明异常高的注意诱变器可能是一个关键因素。此外,我们采取了两种直接的方法,通过吸收注意汇和选择性机制来减少注意力的诱变。在各种任务上进行的实验表明,这些方法有效地降低了不规则的注意激素和缩小性能差距。我们希望这一研究能够说明加强环境建模机制的方法。
Article 152
Title@2025-06-25 (3): Therapy as an NLP Task: Psychologists’ Comparison of LLMs and Human Peers in CBT
Title: Therapy as an NLP Task: Psychologists’ Comparison of LLMs and Human Peers in CBT | Therapie als NLP-Aufgabe: Psychologenvergleich von LLMs und menschlichen Peers im CBT | 作为国家实验室规划任务的一项治疗任务:心理学家对科威特中央银行的LLMs和人类同侪的比较 2409.02244v2 |
Authors (5): Zainab Iftikhar, Sean Ransom, Amy Xiao, Nicole Nugent, Jeff Huang
Large language models (LLMs) are being used as ad-hoc therapists. Research suggests that LLMs outperform human counselors when generating a single, isolated empathetic response; however, their session-level behavior remains understudied. In this study, we compare the session-level behaviors of human counselors with those of an LLM prompted by a team of peer counselors to deliver single-session Cognitive Behavioral Therapy (CBT). Our three-stage, mixed-methods study involved: a) a year-long ethnography of a text-based support platform where seven counselors iteratively refined CBT prompts through self-counseling and weekly focus groups; b) the manual simulation of human counselor sessions with a CBT-prompted LLM, given the full patient dialogue and contextual notes; and c) session evaluations of both human and LLM sessions by three licensed clinical psychologists using CBT competence measures. Our results show a clear trade-off. Human counselors excel at relational strategies – small talk, self-disclosure, and culturally situated language – that lead to higher empathy, collaboration, and deeper user reflection. LLM counselors demonstrate higher procedural adherence to CBT techniques but struggle to sustain collaboration, misread cultural cues, and sometimes produce “deceptive empathy,” i.e., formulaic warmth that can inflate users’ expectations of genuine human care. Taken together, our findings imply that while LLMs might outperform counselors in generating single empathetic responses, their ability to lead sessions is more limited, highlighting that therapy cannot be reduced to a standalone natural language processing (NLP) task. We call for carefully designed human-AI workflows in scalable support: LLMs can scaffold evidence-based techniques, while peers provide relational support. We conclude by mapping concrete design opportunities and ethical guardrails for such hybrid systems.
大型语言模型(LLMS)正在用作高级治疗师。研究表明,LLMS在产生单一的、孤立的、同情性的反应时,优于人类顾问;然而,他们的届会级别行为仍然没有得到充分研究。在本研究中,我们比较了人类顾问的届会级别行为与LLM的届会级别行为,这是由同行顾问小组推动的,目的是提供单一会期认知行为疗法。我们的三个阶段、混合方法的研究涉及:(a) 一个基于文本的支持平台的民族学年久,在这个平台上,7名顾问通过自我咨询及每周焦点小组对CBT进行迭代式精细化的CBT进行催化反应;(b) 将人类顾问会议与CBT-PLPLM的届会级别行为进行人工模拟,这是由3名获得许可的临床心理学家使用CBT的功能支持。我们的成果可以显示一种明确的交易。 人类顾问们在关系战略上表现了 – 小的、自我披露的和符合文化背景的 RBLALM 提高的学习能力课程, —— 能够导致更高级的用户在程序上进行更深入的合作。
Article 153
Title@2025-06-25 (3): Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
Title: Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder | Bridging Compositional and Distributional Semantics: Eine Umfrage zur latenten Semantischen Geometrie über AutoEncoder | 搭桥构成和分布式语义学:通过自动 Encder 进行边端语义几何测量调查 2506.20083v1 |
Authors (3): Yingji Zhang, Danilo S. Carvalho, André Freitas
Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textit{semantic representation learning}. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.
将构成和象征属性整合到当前分布式语义空间中,可以增强基于变换器的自动递减语言模型(LMs)的可解释性、可控制性、可构成性和一般化能力。在本次调查中,我们通过构成式语义学透镜(我们称之为 \ textit{semantic resulational legation learning} ) 对潜在空间几何提供了新视角。这个方向可以连接符号和分布式语义,有助于缩小它们之间的差距。我们审查并比较了三种主流自动编码结构-变异自动 Encoder(VAE)、矢量 VQVAE(VAE)和Sprass AutE(SAE),并考察它们与语义结构和可解释性相关的独特潜在潜在几何特征。
Article 154
Title@2025-06-25 (3): Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective
Title: Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective | Quantifizierung von Fairness in LLMs jenseits von Tokens: Eine semantische und statistische Perspektive | 量化 Tokens 之后的LLLMs的公平性:语义和统计观点 2506.19028v2 |
Authors (7): Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy
Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSco more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.
大型语言模型(LLMS)往往产生具有内在偏见的反应,损害其真实应用的可靠性。现有的评价方法往往忽视长式反应中的偏见和LLM产出的内在变异性。为了应对这些挑战,我们提议FisCo(Finegraide Semantic Computation),这是一个新的统计框架,通过发现不同人口群体在长式反应中的微妙的语义差异来评价LMS的集团公平性。不同于以往侧重于情感或象征性比较的工作,FisSco通过在索赔水平上操作,超越地表一级分析,利用测试来评估不同反应的含义的一致性。我们将模型产出分解成分义性索赔,并应用统计假设测试来比较不同群体之间和群体内部的相似性,从而能够有力地发现微妙的偏差。我们正式确定一个新的群体反事实公平定义,并验证合成和人类附加说明的跨越性别、种族和年龄的数据集。实验表明,FisSco更可靠地识别了细微的偏见,同时减少分辨的LM可变性的影响,超越各种评价指标。
Article 155
Title@2025-06-25 (3): mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
Title: mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks | mSTEB: Massive mehrsprachige Bewertung von LLMs zu Sprach- und Textaufgaben | mSTEB: 对关于发言和文本任务LLM女士进行大规模多语种评价 2506.08400v2 |
Authors (7): Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani
Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.
大型语言模型(LLMS)在包括演讲等多式联运环境中等多种任务方面表现出了令人印象深刻的业绩,然而,其评价往往仅限于英语和少数高资源语言;对于低资源语言,没有标准化的评价基准;在本文件中,我们通过采用MSTEB来弥补这一差距,这是评价LLMS在包括语言识别、文本分类、回答问题和语言和文本模式翻译任务等广泛任务方面的业绩的新基准;我们评价了主要LMS的绩效,如Gemini 2.0 Flash和GPT-4o(Audio)以及Qwen 2 Audio和Gemma 3 27B等最先进的开放模型。我们的评价显示,高资源和低资源语言在业绩上存在很大差距,尤其是在非洲和美洲/大洋洲语言方面。我们的调查结果显示,需要更多投资来解决其在LMS覆盖面中代表不足的问题。
Article 156
Title@2025-06-25 (3): A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs
Title: A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs | Ein modulares Multitask-Reasoning-Framework Integrating Spatio-temporal Models und LLMs | 纳入时空空间模型和LLMs的模块多任务解释框架 2506.20073v1 |
Authors (6): Kethmi Hirushini Hettige, Jiahao Ji, Cheng Long, Shili Xiang, Gao Cong, Jingyuan Wang
Spatio-temporal data mining plays a pivotal role in informed decision making across diverse domains. However, existing models are often restricted to narrow tasks, lacking the capacity for multi-task inference and complex long-form reasoning that require generation of in-depth, explanatory outputs. These limitations restrict their applicability to real-world, multi-faceted decision scenarios. In this work, we introduce STReason, a novel framework that integrates the reasoning strengths of large language models (LLMs) with the analytical capabilities of spatio-temporal models for multi-task inference and execution. Without requiring task-specific finetuning, STReason leverages in-context learning to decompose complex natural language queries into modular, interpretable programs, which are then systematically executed to generate both solutions and detailed rationales. To facilitate rigorous evaluation, we construct a new benchmark dataset and propose a unified evaluation framework with metrics specifically designed for long-form spatio-temporal reasoning. Experimental results show that STReason significantly outperforms advanced LLM baselines across all metrics, particularly excelling in complex, reasoning-intensive spatio-temporal scenarios. Human evaluations further validate STReason’s credibility and practical utility, demonstrating its potential to reduce expert workload and broaden the applicability to real-world spatio-temporal tasks. We believe STReason provides a promising direction for developing more capable and generalizable spatio-temporal reasoning systems.
Spatio-时空数据挖掘在各领域知情决策中发挥着关键作用,然而,现有模型往往局限于狭隘的任务,缺乏多任务性推断能力和复杂的长式推理能力,需要产生深入的解释性产出。这些局限性限制了其对现实世界多面决策情景的适用性。在这项工作中,我们引入了STReason,这是一个新颖的框架,将大型语言模型(LLMS)的推理力量与多任务性判断和执行的时空模型分析能力结合起来。在不要求任务性微调的情况下,STReason利用可应用性文体学习将复杂的自然语言查询转换到模块化、可解释性以及复杂的程序,然后系统地加以执行,以产生解决方案和详细的理由。为了便利严格的评估,我们建立一个新的基准数据集,并提议一个统一的评价框架,与专门为长式STReason-时空逻辑推理推理设计的指标相结合。实验结果表明,STReason大大超越了LM的高级基准,超越了所有指标性精确性、特别是超前、推理学性推理学和演化性推理学上的潜在工作量评估。
Article 157
Title@2025-06-25 (3): Computation Mechanism Behind LLM Position Generalization
Title: Computation Mechanism Behind LLM Position Generalization | Berechnungsmechanismus hinter LLM-Positionsverallgemeinerung | LLM 职位普遍化背后的计算机制 2503.13305v3 |
Authors (2): Chi Han, Heng Ji
Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs’ computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs’ position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs’ internal mechanisms.
与人类似, 大型语言模型(LLMS)在处理文本位置时表现出灵活性, 这是一种我们称之为定位一般化的现象。 他们可以理解带有位置扰动的文本, 并且比在培训最新技术过程中遇到的文本要长。 这些现象表明, LLMS可以容忍地处理位置, 但LLMS计算过程位置相关性如何基本上没有被探索。 这项工作将语言现象与LLMS的计算机制联系起来。 我们展示了LLMS如何对定位扰动的上述容忍性实施某些计算机制。 尽管自我注意机制的设计复杂, 这项工作表明LLMS学会了一种反直觉的注意力记录脱节。 它们的价值显示了0. 959线性相关性, 与定位相关性和语义重要性的计算总和相近。 此外, 我们确定了中间特征中的一种普遍模式, 我们证明这种模式在理论上能够产生这种效果。 与随机初始参数将表现如何不同, 这表明LMS是学习的行为而不是模型结构的自然结果。 在模型中, 我们提供了一种将模型的模型上的灵活性和跨级模型的计算方法。
Article 158
Title@2025-06-25 (3): Thought Anchors: Which LLM Reasoning Steps Matter?
Title: Thought Anchors: Which LLM Reasoning Steps Matter? | Thought Anchors: Welche LLM-Gründungsschritte sind wichtig? | 何为理据步骤? 2506.19143v2 |
Authors (4): Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy
Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence’s counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified “broadcasting” sentences that receive disproportionate attention from all future sentences via “receiver” attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence’s tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (www.thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.
理性语言模型最近在许多领域取得了最先进的成绩。然而,它们的长式思维逻辑链推理造成了解释性的挑战,因为每个生成的符号都取决于先前的所有符号,因此难以分解。我们认为,在句级分析推理痕迹是理解推理过程的一个很有希望的方法。我们提出了三种互补的归因方法:(1)黑箱方法,衡量每个句子的反事实重要性,比较了100个版本的最后答案,其条件是生成该句子的模型或具有不同含义的模型;(2)白箱法,将各对句子之间的注意模式集中起来,这种方法确定了通过“接收”引力头从所有未来判决中得到不相称注意的“广播”句子;(3)因减少对一个句子的注意和衡量对每个未来句子的影响而衡量各句子之间逻辑联系的因果关系的因果归属方法。每种方法都提供了证据,说明是否存在思想支柱,推理步骤的重要性过大,对随后的推理过程有不相称的影响;(2)这些思想库通常是规划或反向后跟踪的量刑模式。我们提供了一个透视源分析方法,用以展示一种透视源分析方法,展示了我们多层次分析。
Article 159
Title@2025-06-24 (2): Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models
Title: Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models | Lernen von Instruction-Following-Richtlinien durch offenes Instruction-Relabeling mit großen Sprachmodellen | 通过不限名额指令与大语言模式重新标签 2506.20061v1 |
Authors (4): Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang
Developing effective instruction-following policies in reinforcement learning remains challenging due to the reliance on extensive human-labeled instruction datasets and the difficulty of learning from sparse rewards. In this paper, we propose a novel approach that leverages the capabilities of large language models (LLMs) to automatically generate open-ended instructions retrospectively from previously collected agent trajectories. Our core idea is to employ LLMs to relabel unsuccessful trajectories by identifying meaningful subtasks the agent has implicitly accomplished, thereby enriching the agent’s training data and substantially alleviating reliance on human annotations. Through this open-ended instruction relabeling, we efficiently learn a unified instruction-following policy capable of handling diverse tasks within a single policy. We empirically evaluate our proposed method in the challenging Craftax environment, demonstrating clear improvements in sample efficiency, instruction coverage, and overall policy performance compared to state-of-the-art baselines. Our results highlight the effectiveness of utilizing LLM-guided open-ended instruction relabeling to enhance instruction-following reinforcement learning.
制定有效的强化学习指导政策仍然具有挑战性,因为依赖广泛的人类标签指导数据集,难以从微薄的回报中学习。在本文件中,我们提出一种新的方法,利用大型语言模型(LLMs)的能力,自动生成从先前收集的代理轨迹中回溯的不限指令。我们的核心想法是利用LLMs,通过查明代理人暗中完成的有意义的次级任务,重新标出不成功的轨迹,从而丰富该代理人的培训数据,并大大减轻对人类说明的依赖。通过这一开放式的重新标签,我们有效地学习了一种统一的遵循指令的政策,能够在一个单一的政策中处理不同的任务。我们实证地评估了我们在具有挑战性的手工艺环境的拟议方法,展示了抽样效率、指导覆盖面和总体政策绩效与最新基线相比的明显改进。我们的成果突出了使用LLMM指南指导的开放式重新标注的有效性,以加强教学强化学习。
Article 160
Title@2025-06-24 (2): Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Title: Cross-Layer Discrete Concept Discovery for Interpreting Language Models | Cross-Layer Discrete Concept Discovery für Interpretationssprachmodelle | 解释语言模型的跨语言监听概念发现 2506.20040v1 |
Authors (4): Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou
Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose \gls{clvqvae}, a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-$k$ temperature-based sampling during quantization with EMA codebook updates, providing controlled exploration of the discrete latent space while maintaining code-book diversity. We further enhance the framework with scaled-spherical k-means++ for codebook initialization, which clusters by directional similarity rather than magnitude, better aligning with semantic structure in word embedding space.
由于剩余流线性混合和重复信息,掩盖了大语言模型中各种特征的演变方式,因此这些未覆盖的变压层新出现概念仍是一项重大挑战。当前研究工作主要检查单层神经显示,从而忽略了这种跨层的叠加和它带来的冗余。这些表示通常或者直接分析激活模式,或者传递给将它们映射成有限一组预设概念的探测分类者。为了解决这些局限性,我们提议 \gls{clvqvae} , 该框架使用矢量定量来绘制各层之间和整个过程的显示方式, 将重复的残余流特性转化为紧凑的、 可解释的概念矢量。 我们的方法在四分化过程中将基于温度的上- $ 的取样与 EMA 代码簿更新结合起来, 提供对离散潜伏空间的有控制的探索, 同时维护代码簿的多样性。 我们进一步加强框架, 以规模大的球基量 k- 工具+ 来进行代码簿初始化, 它按方向相似性而不是规模进行分组, 更好地与文字嵌入空间中的语体结构相协调。
Article 161
Title@2025-06-24 (2): The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research
Title: The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research | Der lärmende Pfad von der Quelle zur Zitation: Wie Wissenschaftler sich mit vergangener Forschung auseinandersetzen | 《从来源到引用的噪音路径:衡量学者如何参与过去的研究》 2502.20581v3 |
Authors (3): Hong Chen, Misha Teplitskiy, David Jurgens
Academic citations are widely used for evaluating research and tracing knowledge flows. Such uses typically rely on raw citation counts and neglect variability in citation types. In particular, citations can vary in their fidelity as original knowledge from cited studies may be paraphrased, summarized, or reinterpreted, possibly wrongly, leading to variation in how much information changes from cited to citing paper. In this study, we introduce a computational pipeline to quantify citation fidelity at scale. Using full texts of papers, the pipeline identifies citations in citing papers and the corresponding claims in cited papers, and applies supervised models to measure fidelity at the sentence level. Analyzing a large-scale multi-disciplinary dataset of approximately 13 million citation sentence pairs, we find that citation fidelity is higher when authors cite papers that are 1) more recent and intellectually close, 2) more accessible, and 3) the first author has a lower H-index and the author team is medium-sized. Using a quasi-experiment, we establish the “telephone effect” - when citing papers have low fidelity to the original claim, future papers that cite the citing paper and the original have lower fidelity to the original. Our work reveals systematic differences in citation fidelity, underscoring the limitations of analyses that rely on citation quantity alone and the potential for distortion of evidence.
学术引文被广泛用于评估研究和追踪知识流。这类使用通常依赖原始引文的计数和引文类型中的忽视差异。特别是,引文在真实性上可能各有不同,因为引用研究的原始知识可能会被引言、摘述或重新解释,可能错误地导致从引用到引用论文的信息变化多少。在本研究中,我们引入了计算管道,以量化规模的引用真伪。利用文件全文,管道识别引文中的引文和引文中的相应主张中的引文,并应用受监督的模式衡量判决一级的忠贞度。分析大约1 300万对引文的大规模多学科数据集,我们发现当作者引用1件较近期和知识上接近、2更便于查阅和3)的文件时,引言的忠真性会更高,第一作者的H指数较低,而作者的团队则为中等规模。在引用论文时,我们确定了“电话效应”——当引用文件对原始主张的忠诚度较低时,我们发现引用原真性分析的原始证据,未来文件显示原始的精确性差异。
Article 162
Title@2025-06-24 (2): Evaluating Long Range Dependency Handling in Code Generation LLMs
Title: Evaluating Long Range Dependency Handling in Code Generation LLMs | Bewertung der Langzeitabhängigkeitsbehandlung in LLMs der Code-Generation | 评估代码生成中的长期依赖性处理 2407.21049v2 |
Authors (2): Yannick Assogba, Donghao Ren
As language models support larger and larger context sizes, evaluating their ability to make effective use of that context becomes increasingly important. We analyze the ability of several code generation models to handle long range dependencies using a suite of multi-step key retrieval tasks in context windows up to 8k tokens in length. The tasks progressively increase in difficulty and allow more nuanced evaluation of model capabilities than tests like the popular needle-in-the-haystack test. We find that performance degrades significantly for many models (up to 2x) when a function references another function that is defined later in the prompt. We also observe that models that use sliding window attention mechanisms have difficulty handling references further than the size of a single window. We perform simple prompt modifications using call graph information to improve multi-step retrieval performance up to 3x. Our analysis highlights ways that long-context performance needs deeper consideration beyond retrieval of single facts within a document.
由于语言模型支持更大和更大的上下文大小,因此评价其有效利用该背景的能力变得日益重要。我们分析数个代码生成模型的能力,以便利用上下文窗口中长达8k个符号的一组多步关键检索任务处理长距离依赖性。任务难度逐渐增加,比流行的针头在河内试验等测试对模型能力进行更细致的评估。我们发现,对于许多模型(高达2x)来说,当函数引用了较晚在提示中界定的另一个函数时,性能会显著退化。我们还注意到,使用滑动窗口注意机制的模型比单一窗口的大小更难以进一步处理引用。我们利用呼唤图信息进行简单的快速修改,以提高多步检索性能,达到3x。我们的分析突出表明,除了检索文件中的单一事实之外,长文字性表现需要更深入的考虑。
Article 163
Title@2025-06-24 (2): Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs
Title: Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs | Sprachmodelle lernen seltene Phänomene aus weniger seltenen Phänomenen: Der Fall der vermissten AANNs | 语言模型从少见少见的神话中学习罕见现象:失踪的AANNs案例 2403.19827v3 |
Authors (2): Kanishka Misra, Kyle Mahowald
Language models learn rare syntactic phenomena, but the extent to which this is attributable to generalization vs. memorization is a major open question. To that end, we iteratively trained transformer language models on systematically manipulated corpora which were human-scale in size, and then evaluated their learning of a rare grammatical phenomenon: the English Article+Adjective+Numeral+Noun (AANN) construction (a beautiful five days''). We compared how well this construction was learned on the default corpus relative to a counterfactual corpus in which AANN sentences were removed. We found that AANNs were still learned better than systematically perturbed variants of the construction. Using additional counterfactual corpora, we suggest that this learning occurs through generalization from related constructions (e.g.,
a few days’’). An additional experiment showed that this learning is enhanced when there is more variability in the input. Taken together, our results provide an existence proof that LMs can learn rare grammatical phenomena by generalization from less rare phenomena. Data and code: https://github.com/kanishkamisra/aannalysis.
语言模型学会了稀有的合成现象,但是,这在多大程度上可以归结于一般化与记忆化是一个主要的未决问题。为此,我们反复培训了关于人类规模规模的系统操纵的巨体的变压器语言模型,然后评价了它们学习罕见的语法现象:英语条款+目标+Numeral+Noumeral+Nouun(ANN)的构造(“一个美丽的五天”)。我们比较了这一构造在默认的元素中与取消ANN判决的反事实材料相比的学习程度。我们发现,AANNS仍然比系统地绕绕过构造的变异体学习得更好。我们利用额外的反现实的复合体,我们建议这种学习是通过相关构造的概括化(例如“几天”)进行的。另一个实验表明,当投入的变异性更大时,这种学习会得到加强。我们的结果一起提供了一种存在的证据,即LMS能够从不那么罕见的现象中学习稀有的语法现象。数据和代码: https://gistrambas/kamas.commas.
Article 164
Title@2025-06-24 (2): Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
Title: Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning | Persona-Assigned Große Sprachmodelle zeigen Menschen-wie motivierte Vernunft | 人与人之间签署的大型语言模型 2506.20020v1 |
Authors (4): Saloni Dash, Amélie Reymond, Emma S. Spiro, Aylin Caliskan
Reasoning in humans is prone to biases due to underlying motivations like identity protection, that undermine rational decision-making and judgment. This motivated reasoning at a collective level can be detrimental to society when debating critical issues such as human-driven climate change or vaccine safety, and can further aggravate political polarization. Prior studies have reported that large language models (LLMs) are also susceptible to human-like cognitive biases, however, the extent to which LLMs selectively reason toward identity-congruent conclusions remains largely unexplored. Here, we investigate whether assigning 8 personas across 4 political and socio-demographic attributes induces motivated reasoning in LLMs. Testing 8 LLMs (open source and proprietary) across two reasoning tasks from human-subject studies – veracity discernment of misinformation headlines and evaluation of numeric scientific evidence – we find that persona-assigned LLMs have up to 9% reduced veracity discernment relative to models without personas. Political personas specifically, are up to 90% more likely to correctly evaluate scientific evidence on gun control when the ground truth is congruent with their induced political identity. Prompt-based debiasing methods are largely ineffective at mitigating these effects. Taken together, our empirical findings are the first to suggest that persona-assigned LLMs exhibit human-like motivated reasoning that is hard to mitigate through conventional debiasing prompts – raising concerns of exacerbating identity-congruent reasoning in both LLMs and humans.
由于身份保护等基本动机,如身份保护、破坏合理决策和判断等,人类的理性理由容易受到偏见。在集体一级,这种动机性推理在集体一级辩论人类驱动的气候变化或疫苗安全等关键问题时可能有害于社会,并可能进一步加剧政治两极分化。先前的研究报告说,大型语言模式(LLMS)也容易受到人性认知偏见的影响,然而,选择性的LLMS对身份和谐结论的认知程度在很大程度上仍未得到探讨。在这里,我们调查在4个政治和社会人口特征中分配8人是否引发LMS的动机性推理。测试8 LLMS(开放源和专有)在人类主题研究的两项推理任务 – – 误差头条的真实辨别和数字科学证据的评价 – – 我们发现,人性LMS相对于没有人性的模型,其真实性辨识度高达9%。具体来说,当地面真相与其诱导的政治身份一致时,我们更可能更准确地评价枪支控制的科学证据。在人类的推理学上,基于准确性裁断法的8LMS-在减轻人类的推理学上的推理上的推理上的推论,在减轻了人类的推理上的推理上的推理上的推理上的推理的推理的推理都是不力。
Article 165
Title@2025-06-24 (2): Accurate and Energy Efficient: Local Retrieval-Augmented Generation Models Outperform Commercial Large Language Models in Medical Tasks
Title: Accurate and Energy Efficient: Local Retrieval-Augmented Generation Models Outperform Commercial Large Language Models in Medical Tasks | Präzise und energieeffizient: Modelle der lokalen Retrieval-Augmented Generation übertreffen kommerzielle Großsprachenmodelle in medizinischen Aufgaben | 准确性和能源效率:当地检索和推荐的一代模型在医疗任务中超效商业大语言模型 2506.20009v1 |
Authors (2): Konstantinos Vrettos, Michail E. Klontzas
Background The increasing adoption of Artificial Intelligence (AI) in healthcare has sparked growing concerns about its environmental and ethical implications. Commercial Large Language Models (LLMs), such as ChatGPT and DeepSeek, require substantial resources, while the utilization of these systems for medical purposes raises critical issues regarding patient privacy and safety. Methods We developed a customizable Retrieval-Augmented Generation (RAG) framework for medical tasks, which monitors its energy usage and CO2 emissions. This system was then used to create RAGs based on various open-source LLMs. The tested models included both general purpose models like llama3.1:8b and medgemma-4b-it, which is medical-domain specific. The best RAGs performance and energy consumption was compared to DeepSeekV3-R1 and OpenAIs o4-mini model. A dataset of medical questions was used for the evaluation. Results Custom RAG models outperformed commercial models in accuracy and energy consumption. The RAG model built on llama3.1:8B achieved the highest accuracy (58.5%) and was significantly better than other models, including o4-mini and DeepSeekV3-R1. The llama3.1-RAG also exhibited the lowest energy consumption and CO2 footprint among all models, with a Performance per kWh of 0.52 and a total CO2 emission of 473g. Compared to o4-mini, the llama3.1-RAG achieved 2.7x times more accuracy points per kWh and 172% less electricity usage while maintaining higher accuracy. Conclusion Our study demonstrates that local LLMs can be leveraged to develop RAGs that outperform commercial, online LLMs in medical tasks, while having a smaller environmental impact. Our modular framework promotes sustainable AI development, reducing electricity usage and aligning with the UNs Sustainable Development Goals.
在医疗保健中越来越多地采用人工智能(AI)的背景已经引起人们对其环境和道德影响的日益关切。商业大语言模型(LLM),如ChattGPT和DeepSeek等,需要大量的资源,而将这些系统用于医疗目的,则提出了有关病人隐私和安全的关键问题。我们为医疗任务开发了一个定制的检索增强一代(RAG)框架,用于监测其能源使用情况和CO2排放。这个系统随后被用来在各种公开源代码LLM的基础上创建RA。经过测试的模型既包括一般用途模型,如:Lama33.1:8b和Medgem4ma-4b-it,这是医学领域特有的。而将这些系统的最佳性能和能源消耗量与DeepSeggV3-R1和OpAIs o4-Mis模型相比较。我们为医疗任务定制的RAG模型在精确度和能源消耗量方面优于更小的商业模型。RAGF3:8B达到最高精度(58.5 % ),而且每模型都比其他模型(包括O4-MLM)的性成本成本),同时,我们的所有O4-Mex-MSDMLMSD 也显示所有的性能模型可以降低的性能能模型。
Article 166
Title@2025-06-24 (2): Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’
Title: Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’ | Können Sprachmodelle Programmierer für Coding ersetzen? REPOCOD sagt ‘Noch nicht’ | 语言模式能替换编码程序程序员吗? REPOCOD 说“ 还没有” 。 2410.21647v4 |
Authors (4): Shanchao Liang, Yiran Hu, Nan Jiang, Lin Tan
Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like HumanEval and MBPP. Thus, a natural question is, would LLMs have similar performance in real world coding tasks as their performance in these benchmarks? Unfortunately, one cannot answer this question, since these benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks. To address these challenges, we create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects and appropriate metrics for evaluating source code. It includes 980 whole-function generation tasks from 11 popular projects, 50.8% of which require repository-level context. REPOCOD includes 314 developer-written test cases per instance for better evaluation. We evaluate ten LLMs on REPOCOD and find that none achieves more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. In addition, we found that retrieval-augmented generation achieves better results than using target function dependencies as context.
最近,一些仓库级代码生成基准(如CoderEval、DevEval、RepoEval、RepoEval、Repobench、RepoBench和LongCodeArena)已经出现,以评价大型语言模型(LLMs)的能力,而这种模型超出了人类经济学和MBP等独立基准的范围。因此,一个自然的问题是,LLOMs在现实世界的编码任务中是否具有与这些基准的业绩相类似的性能?不幸的是,无法回答这一问题,因为这些基准包括短期完成、合成实例或侧重于有限规模的储存库,无法代表现实世界的编码任务。为了应对这些挑战,我们创建了REPOCOD,这是一个Python 代码生成基准,其中包含在现实世界大型项目中具有实际依赖性依赖性的复杂任务以及评估源码的适当指标。它包括11个流行项目中的980项全功能生成任务,其中50.8%需要具备储存级别的背景。REPOCD包括314个开发者编写的测试案例,以更好地评估。我们评估了REOCDD,发现在REPD上没有取得超过30%的成绩,我们在Sreal-OD上能够更有力地建立更强有力的软件开发公司。
Article 167
Title@2025-06-24 (2): A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior
Title: A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior | Ein Spatio-Temporal-Punkt-Verfahren zur feinkörnigen Modellierung des Leseverhaltens | 阅读行为精细模拟模型的斯帕迪奥时点进程 2506.19999v1 |
Authors (5): Francesco Ignazio Re, Andreas Opedal, Glib Manaiev, Mario Giulianelli, Ryan Cotterell
Reading is a process that unfolds across space and time, alternating between fixations where a reader focuses on a specific point in space, and saccades where a reader rapidly shifts their focus to a new point. An ansatz of psycholinguistics is that modeling a reader’s fixations and saccades yields insight into their online sentence processing. However, standard approaches to such modeling rely on aggregated eye-tracking measurements and models that impose strong assumptions, ignoring much of the spatio-temporal dynamics that occur during reading. In this paper, we propose a more general probabilistic model of reading behavior, based on a marked spatio-temporal point process, that captures not only how long fixations last, but also where they land in space and when they take place in time. The saccades are modeled using a Hawkes process, which captures how each fixation excites the probability of a new fixation occurring near it in time and space. The duration time of fixation events is modeled as a function of fixation-specific predictors convolved across time, thus capturing spillover effects. Empirically, our Hawkes process model exhibits a better fit to human saccades than baselines. With respect to fixation durations, we observe that incorporating contextual surprisal as a predictor results in only a marginal improvement in the model’s predictive accuracy. This finding suggests that surprisal theory struggles to explain fine-grained eye movements.
读取是一个跨越时空的过程, 由读者关注空间特定点的固定点与读者快速将焦点转向新点的累进式交错交错。 心理语言学的解说式是, 模拟读者的固定点和累进式能够让读者对在线句子处理过程有洞察力。 但是, 这种建模的标准方法依赖于集成的眼跟踪测量和模型, 并会强加强烈的假设, 忽略阅读过程中出现的许多时空动态。 在本文中, 我们提出一个更普遍的阅读行为概率模型, 其基础是明确的瞬时空点, 以明确的预入点进程为基础, 不仅捕捉到固定点的固定点持续多久, 而且还捕捉到它们降落在空间的位置和时间上的位置。 建模式模型的模型模型使用每个固定在时间和空间附近进行新的固定的概率。 固定事件的时间时间只能被建模成一个固定具体预测模型的功能, 在时间上进行精确的精确度变化, 从而捕捉到我们排序前期的周期, 显示一个更精确的预进过程 。 将我们观察的观察到一个模拟的预进式观察过程, 。 观察到一个更精确地观察到一个比我们更精确的预进进进进的预结果。
Article 168
Title@2025-06-24 (2): WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development
Title: WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development | WAFFLE: Feinsteuerungs-Multi-Modal-Modell für automatisierte Front-End-Entwicklung | WAFFLE: 自动前端开发的微调多模式模型 2410.18362v2 |
Authors (4): Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan
Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML’s hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML’s hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs’ understanding of HTML’s structure and a contrastive fine-tuning approach to align LLMs’ understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.
由于HTML的等级结构和风格的复杂性,对于初学者和有经验的开发者来说,将UI设计转换成功能性网页可能是困难的。虽然大语言模型在生成源代码方面显示出希望,但在UI-HTML代码生成方面仍然存在两大挑战:(1) 有效代表HTML的LLM的等级结构,(2) 缩小UI设计视觉性质与HTML代码基于文本的格式之间的差距。为了应对这些挑战,我们引入了Waffle(Waffle)这一新的微调战略,它使用一种结构觉悟的注意机制来提高LAM对HTML结构的理解,以及一种对比性微调方法来调整LLMMS对UI图像和 HTML代码的理解。用Waffle(百分比)微调的模型显示高达9.00 pp(百分比) 更高HTML匹配值,0.0982 更高CW-SSIM, 32.99 更高CLIP和27.12 pp(LEM)在我们的新基准WebSight-ight Test和现有基准设计2Code, 优于当前的微调方法。
Article 169
Title@2025-06-24 (2): Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation
Title: Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation | Doc2Agent: Skalierbare Generierung von Tool-Using Agents aus API-Dokumentation | Doc2Agency: 从 API 文件中可缩放生成工具使用代理器 2506.19998v1 |
Authors (5): Xinyi Ni, Haonan Jian, Qiuyang Wang, Vedanshi Chetan Shah, Pengyu Hong
REST APIs play important roles in enriching the action space of web agents, yet most API-based agents rely on curated and uniform toolsets that do not reflect the complexity of real-world APIs. Building tool-using agents for arbitrary domains remains a major challenge, as it requires reading unstructured API documentation, testing APIs and inferring correct parameters. We propose Doc2Agent, a scalable pipeline to build agents that can call Python-based tools generated from API documentation. Doc2Agent generates executable tools from API documentations and iteratively refines them using a code agent. We evaluate our approach on real-world APIs, WebArena APIs, and research APIs, producing validated tools. We achieved a 55\% relative performance improvement with 90\% lower cost compared to direct API calling on WebArena benchmark. A domain-specific agent built for glycomaterial science further demonstrates the pipeline’s adaptability to complex, knowledge-rich tasks. Doc2Agent offers a generalizable solution for building tool agents from unstructured API documentation at scale.
REST APIs在丰富网络代理商的行动空间方面发挥着重要作用,然而,大多数基于API的代理商依赖没有反映真实世界API复杂性的整理和统一工具。为任意领域建立工具使用代理商仍是一项重大挑战,因为它需要阅读无结构的 API 文件,测试API 并推断正确的参数。我们提议Doc2Agency,这是一个可扩缩的管道,用于建立可以称为以Python为基础的工具的代理商。 doc2Agency通过API文件生成可执行的工具,并使用代码代理对工具进行迭代完善。我们评估了我们关于实际世界API、WebArena API 和研究API 的方法,并制作了经验证的工具。我们取得了55-相对的绩效改进,比在WebArena基准上直接要求的API成本低90。为精密材料科学而建造的域特定代理商进一步证明了管道对复杂、丰富知识的任务的适应性。Doc2Agency为从非结构化的ATI文件中构建工具提供了一种通用的解决方案。
Article 170
Title@2025-06-24 (2): When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour
Title: When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour | Wenn große Sprachmodelle Menschen widersprechen? Das sykophantische Verhalten von Large Language Models | 当大语言模型与人类相矛盾时? 2311.09410v4 |
Authors (2): Leonardo Ranaldi, Giulia Pucci
Large Language Models have been demonstrating broadly satisfactory generative abilities for users, which seems to be due to the intensive use of human feedback that refines responses. Nevertheless, suggestibility inherited via human feedback improves the inclination to produce answers corresponding to users’ viewpoints. This behaviour is known as sycophancy and depicts the tendency of LLMs to generate misleading responses as long as they align with humans. This phenomenon induces bias and reduces the robustness and, consequently, the reliability of these models. In this paper, we study the suggestibility of Large Language Models (LLMs) to sycophantic behaviour, analysing these tendencies via systematic human-interventions prompts over different tasks. Our investigation demonstrates that LLMs have sycophantic tendencies when answering queries that involve subjective opinions and statements that should elicit a contrary response based on facts. In contrast, when faced with math tasks or queries with an objective answer, they, at various scales, do not follow the users’ hints by demonstrating confidence in generating the correct answers.
大型语言模型展示了用户广泛令人满意的基因化能力,这似乎是由于大量使用改进反应的人类反馈的结果。然而,通过人类反馈留下的建议性提高了产生与用户观点相对应的答案的倾向。这种行为被称为偏执,并描绘了LLMs只要与人类一致就会产生误导反应的倾向。这种现象诱发偏见,降低这些模型的稳健性,从而降低这些模型的可靠性。在本文件中,我们研究了大语言模型(LLMs)对共性行为的建议性,通过系统性的人类干预分析这些趋势,对不同任务进行了敏锐的分析。我们的调查表明,LLMs在回答涉及主观意见和声明的询问时,具有共性倾向,而这种主观意见和声明应当根据事实引起相反的反应。相反,在面对数学任务或有客观答案的询问时,它们在不同尺度上不遵循用户的提示,通过显示对得出正确答案的信心。
Article 171
Title@2025-06-24 (2): FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Title: FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs | FactCheckmate: Vorbeugend Halluzinationen in LMs erkennen und abschwächen | 事实:先发制人地探测和减轻LMM的幻觉 2410.02899v2 |
Authors (4): Deema Alnuhait, Neeraja Kirtane, Muhammad Khalifa, Hao Peng
Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model’s hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM’s hidden states such that the model will produce more factual outputs. FactCheckmate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both its detection and mitigation models are lightweight, adding little inference overhead; FactCheckmate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckmate over LMs of different scales and model families (including Llama, Mistral, Qwen and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of FactCheckmate, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without.
语言模型(LMS) 幻觉。 我们询问 : 我们能否在幻觉发生之前发现并减轻幻觉? 这项工作以正面的方式回答这个研究问题, 显示LMS的内部表现提供了可用于此目的的丰富信号。 我们引入了“ 事实检查” , 通过学习一个分类器先先先先检测幻觉, 该分类器预测LM(LM)是否会产生幻觉, 该分类器在开始解码之前, 预测它是否会产生幻觉。 如果检测到幻觉, 事实检查员然后通过调整LM( 包括Llama、Mistral、Qwen和Gemma) 的隐藏状态进行干预, 这样模型就能产生更多真实的输出结果。 事实检查长提供了新的洞察, 它的检测和缓解模型都是轻量的, 增加了间接的推论; 事实检查结果证明, 与许多热后替代品相比, 减轻幻觉效果的方法更有效率。 我们评估不同规模和模范家庭( 包括Llama、Mistral、Qwen和Gemma) 跨越各种QAAA前的精确度, 从不同领域取得的70- chest- pactal结果。
Article 172
Title@2025-06-24 (2): Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs
Title: Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs | Inferenz-Skalierte GraphRAG: Verbesserung der Multi-Hop-Fragebeantwortung auf Wissensgraphen | 推推缩比例图RAG:改进知识图多位数问题的回答 2506.19967v1 |
Authors (5): Travis Thompson, Seung-Hwan Lim, Paul Liu, Ruoying He, Dongkuan Xu
Large Language Models (LLMs) have achieved impressive capabilities in language understanding and generation, yet they continue to underperform on knowledge-intensive reasoning tasks due to limited access to structured context and multi-hop information. Retrieval-Augmented Generation (RAG) partially mitigates this by grounding generation in retrieved context, but conventional RAG and GraphRAG methods often fail to capture relational structure across nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel framework that enhances LLM-based graph reasoning by applying inference-time compute scaling. Our method combines sequential scaling with deep chain-of-thought graph traversal, and parallel scaling with majority voting over sampled trajectories within an interleaved reasoning-execution loop. Experiments on the GRBench benchmark demonstrate that our approach significantly improves multi-hop question answering performance, achieving substantial gains over both traditional GraphRAG and prior graph traversal baselines. These findings suggest that inference-time scaling is a practical and architecture-agnostic solution for structured knowledge reasoning with LLMs
大型语言模型(LLMS)在语言理解和生成方面取得了令人印象深刻的能力,但是由于获得结构化背景和多光速信息的渠道有限,它们仍然在知识密集型推理任务上表现不佳。 检索原始一代(RAG)通过在检索环境下进行地面生成,部分缓解了这一点,但常规RAG和GreagraG方法往往未能在知识图表中捕捉到跨节点的关联结构。 我们引入了推论Serence-Squald GrapRAG,这是一个通过应用推论-时间计算比例来增强基于LLM的图形推理的新框架。 我们的方法将顺序缩放与深层次的深层深层图解剖和平行缩幅与多数人投票相结合,在跨断式推理-执行循环中对抽样轨迹进行投票。 GRBEBENC基准实验表明,我们的方法大大改进了多光度回答问题的业绩,在传统的 GrapRAG和先前的图式推理学基线上都取得了重大收益。这些结论表明,推判时间的推算是一种实用和结构化的建筑学解决办法。
Article 173
Title@2025-06-24 (2): CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation
Title: CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation | CycleDistill: Bootstrapping Maschine Übersetzung mit LLMs mit Cyclical Destillation | 循环蒸馏:用具有周期蒸馏作用的LLMML 制动机械翻译 2506.19952v1 |
Authors (3): Deepon Halder, Thanmay Jayakumar, Raj Dabre
Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose CycleDistill, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating synthetic parallel corpora from monolingual corpora via zero- or few-shot MT, which is then used to fine-tune the model that was used for generating said data for MT. CycleDistill does not need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments focusing on three Indian languages, by relying solely on monolingual corpora, it can achieve high-quality machine translation, improving upon a few-shot baseline model by over 20-30 chrF points on average in the first iteration. We also study the effect of leveraging softmax activations during the distillation process and observe mild improvements in translation quality.
大型语言模型(LLMS)尽管有能力进行几发机器翻译(MT),但往往落后于在平行公司(MT)方面接受过培训的专用MT系统,而这些系统对于高质量机器翻译(MT)至关重要,然而,对低资源语言而言,平行公司往往很少或根本不存在。在本文中,我们建议采用循环蒸馏(CyleDistilling)方法,利用LMS和几发翻译(LLMMS)来获得高质量的MT系统。循环蒸馏(LLMS)涉及通过零发或几发MT,从单语公司中迭接地生成合成平行平行的合成平行公司,然后用于微调用于生成MT所述数据的模型。循环蒸馏(Cocora)不需要超过1至4发光谱实例的平行公司,在我们以三种印度语言为重点的实验中,它只依靠单语类公司,就可以实现高质量的机器翻译,在平均的蒸馏过程中通过20至30分的分法改进了几发基准模型。我们还研究利用软式激活作用的效果,并观察了轻度翻译质量的改进了。
Article 174
Title@2025-06-24 (2): Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation
Title: Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation | Aug2Search: Verbesserung der Facebook-Marktplatzsuche mit LLM-generierter Synthetischer Datenvergrößerung | Oug2Search:利用LLM光化合成数据增强功能,加强Facebook市场搜索 2505.16065v3 |
Authors (7): Ruijie Xi, He Ba, Hao Yuan, Rishu Agrawal, Yuxin Tian, Ruoyan Kong, Arul Prakash
Embedding-Based Retrieval (EBR) is an important technique in modern search engines, enabling semantic match between search queries and relevant results. However, search logging data on platforms like Facebook Marketplace lacks the diversity and details needed for effective EBR model training, limiting the models’ ability to capture nuanced search patterns. To address this challenge, we propose Aug2Search, an EBR-based framework leveraging synthetic data generated by Generative AI (GenAI) models, in a multimodal and multitask approach to optimize query-product relevance. This paper investigates the capabilities of GenAI, particularly Large Language Models (LLMs), in generating high-quality synthetic data, and analyzing its impact on enhancing EBR models. We conducted experiments using eight Llama models and 100 million data points from Facebook Marketplace logs. Our synthetic data generation follows three strategies: (1) generate queries, (2) enhance product listings, and (3) generate queries from enhanced listings. We train EBR models on three different datasets: sampled engagement data or original data ((e.g., “Click” and “Listing Interactions”)), synthetic data, and a mixture of both engagement and synthetic data to assess their performance across various training sets. Our findings underscore the robustness of Llama models in producing synthetic queries and listings with high coherence, relevance, and diversity, while maintaining low levels of hallucination. Aug2Search achieves an improvement of up to 4% in ROC_AUC with 100 million synthetic data samples, demonstrating the effectiveness of our approach. Moreover, our experiments reveal that with the same volume of training data, models trained exclusively on synthetic data often outperform those trained on original data only or a mixture of original and synthetic data.
嵌入式检索( EBR) 是现代搜索引擎的重要技术, 使得搜索查询和相关结果之间能够进行语义匹配。 然而, Facebook 市场网 等平台上搜索合成记录数据缺乏有效的 EBR 模型培训所需的多样性和细节, 限制了模型捕捉细微搜索模式的能力。 为了应对这一挑战, 我们提议 Aug2Search, 以EBR 为基础的框架, 利用General AI (GenAI) 模型生成的合成数据, 以多种模式和多任务方式优化原始的查询产品相关性。 本文调查GenAI, 特别是大语言模型( LLMS) 生成高质量合成数据的能力, 分析其对增强 EBR 模型的影响。 我们使用8个Llama 模型和来自 Facebook 市场日志日志的1亿个数据点进行了实验。 我们的合成数据生成遵循三种战略:(1) 生成查询, (2) 强化产品列表, 以及(3) 生成相同的查询。 我们用三种不同的数据模式对 EBRBR模式进行原始的样本化和原始数据( 例如, “ Clik” 和“LILILAD Adal dest ex train readdate dest lading the the the the dal deal dal dealddate lading the the the the the the sal dal dal dal dal daldaldaldaldaldaldaldaldal dal daldaldddaldddaldaldddddaldddddddddaldddddddddddddddddddddddalddddddddddddddddaldddddalddddddddddddddddddddddaldddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddds), , ladalddd
Article 175
Title@2025-06-24 (2): GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models
Title: GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models | GlyphPattern: Ein abstrakter Mustererkennungs-Benchmark für Vision-Language-Modelle | Glyph Patter:愿景语言模型摘要模式识别基准 2408.05894v2 |
Authors (3): Zixuan Wu, Yoolim Kim, Carolyn Jane Anderson
Vision-Language Models (VLMs) building upon the foundation of powerful large language models have made rapid progress in reasoning across visual and textual data. While VLMs perform well on vision tasks that they are trained on, our results highlight key challenges in abstract pattern recognition. We present GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of visual patterns from 40 writing systems with three visual presentation styles. GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models to understand and judge natural language descriptions of visual patterns. GlyphPattern patterns are drawn from a large-scale cognitive science investigation of human writing systems; as a result, they are rich in spatial reference and compositionality. Our experiments show that GlyphPattern is challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with marginal gains from few-shot prompting. Our detailed error analysis reveals challenges at multiple levels, including visual processing, natural language understanding, and pattern generalization.
在强大的大型语言模型基础上建立的视觉-语言模型(VLM)在视觉和文字数据的推理方面取得了迅速的进展。虽然VLM在所培训的视觉任务方面表现良好,但我们的成果凸显了抽象模式识别方面的主要挑战。我们展示了954个项目数据集GlyphPartern,该数据集将40个书写系统的318个视觉模式的人文描述与三种视觉演示风格相配。GlyphPatter 评估了VLM的抽象模式识别,要求模型理解和判断视觉模式的自然语言描述。GlyphPartern模式来自人类书写系统的大规模认知科学调查,因此它们具有丰富的空间参考和构成性。我们的实验显示,GlyphPatern对最新艺术VLMs(GPT-4o)的精度(只有55%的精度)具有挑战性,微小的提示带来了边际收益。我们的详细错误分析揭示了多个层面的挑战,包括视觉处理、自然语言理解和模式的概括化。
Article 176
Title@2025-06-24 (2): ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Title: ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing | ScaleCap: Inferenzzeitskalierbare Bildunterschriften über Dual-Modality-Debiasing | Cap Cap: 通过双式拆分法进行可缩放图像的推断-时间 2506.19848v1 |
Authors (13): Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin
This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.
本文展示了Scapec Cap 的可变缩缩缩缩图像说明策略, 以生成全面、 详细图像说明。 高质量图像说明的关键挑战在于LVLM的内在偏差: 多式联运偏差导致描述性粒子不平衡, 详细描述某些元素, 而只是擦拭其它元素; 语言偏差导致对不存在的物体进行幻化描述。 为了解决这些问题, 我们提议了一个可变缩缩缩缩缩略图说明策略, 不断丰富和校准标题, 并增加推移预算。 具体地说, 我们提议了两个新颖的构件: 过度回答问题和对比性句的评分等级。 前者根据图像生成特定内容问题, 并回答它们逐渐将相关信息输入到标题中。 后者使用句级脱缩缩缩缩缩缩缩缩略图, 并用Squalformormoral A/CBSBSBAAAAAA 和Sloial CABAAAAAAAAAAAAAA级前的缩缩略图。 和SilADBADBADLABBA, 和Silal ASGBBBABABBAUBBA, 和SBBAA 上, 的缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的缩缩缩缩缩缩缩缩缩缩图图图图。
Article 177
Title@2025-06-24 (2): Orthogonal Finetuning Made Scalable
Title: Orthogonal Finetuning Made Scalable | Orthogonale Feinsteuerung aus skalierbarem Material | 可缩放 2506.19847v1 |
Authors (4): Zeju Qiu, Weiyang Liu, Adrian Weller, Bernhard Schölkopf
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
为了克服这一点,我们建议采用高参数效率的调整,同时防止灾难性的遗忘,但高运行时间和内存要求限制实际部署。我们确定KOT中的核心计算瓶颈是其重心执行,它依赖于成本高昂的矩阵-矩阵乘数乘数和立方复杂度。为了克服这一缺陷,我们建议采用以输入为中心的调整,而不是使用矩阵-矢量乘数(即不使用矩阵计算),降低四面形计算成本。我们进一步引入了Cayley-Neumann参数化,这是一个高效的或体向参数化,通过快速的Neumann系列来接近Cayley变换的矩阵。这些修改使得OFTv2能够在不减损性能的情况下完成多达10x的快速培训和3x较低的GPU内存用量。此外,我们扩大OTv2来支持微调四分立基础模型,并显示它在培训稳定性、效率和记忆使用方面超过了流行的 QLORA。
Article 178
Title@2025-06-24 (2): MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration
Title: MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration | MAM: Modulares Multi-Agent Framework für multi-Modale medizinische Diagnose über rollenspezialisierte Zusammenarbeit | MAM:通过作用专业化协作进行多模式医学诊断的模块多机构框架 2506.19835v1 |
Authors (3): Yucheng Zhou, Lingran Song, Jianbing Shen
Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.
医学大语言模型(LLMS)最近的进展展示了它们强有力的推理和诊断能力。尽管取得了成功,但目前统一的多式联运医学大模型在知识更新成本、全面性和灵活性方面都面临限制。为了应对这些挑战,我们引入了多式医学诊断模块(MAM)。我们的经验性结论突出表明了在LMS中角色分配和诊断诊断发现的好处,MAM将医疗诊断过程分解为专门角色:一个普通医生、专家小组、放射师、医疗助理和主任,每个都由LLLM代理机构组成。这个模块化和协作框架使得能够有效地更新知识并利用现有的医疗大医学大模型和知识库。在广泛的可公开查阅的多式医疗数据集上进行了广泛的实验性评估,其中包括了文本、图像、音频和视频模式,表明MAM一贯地超过具体模式LMS的性能。值得注意的是,MAM取得了显著的业绩改进,从18%到365 %到基线模型。我们的代码在 https://github.com/yczh001/MAM.M.
Article 179
Title@2025-06-24 (2): How Effectively Can BERT Models Interpret Context and Detect Bengali Communal Violent Text?
Title: How Effectively Can BERT Models Interpret Context and Detect Bengali Communal Violent Text? | Wie effektiv können BERT-Modelle den Kontext interpretieren und Bengali gemeinschaftlichen gewalttätigen Text erkennen? | BERT模型如何有效地解释背景和检测孟加拉社区暴力文本? 2506.19831v1 |
Authors (5): Abdullah Khondoker, Enam Ahmed Taufik, Md. Iftekhar Islam Tashik, S M Ishtiak Mahmud, Farig Sadeque
The spread of cyber hatred has led to communal violence, fueling aggression and conflicts between various religious, ethnic, and social groups, posing a significant threat to social harmony. Despite its critical importance, the classification of communal violent text remains an underexplored area in existing research. This study aims to enhance the accuracy of detecting text that incites communal violence, focusing specifically on Bengali textual data sourced from social media platforms. We introduce a fine-tuned BanglaBERT model tailored for this task, achieving a macro F1 score of 0.60. To address the issue of data imbalance, our dataset was expanded by adding 1,794 instances, which facilitated the development and evaluation of a fine-tuned ensemble model. This ensemble model demonstrated an improved performance, achieving a macro F1 score of 0.63, thus highlighting its effectiveness in this domain. In addition to quantitative performance metrics, qualitative analysis revealed instances where the models struggled with context understanding, leading to occasional misclassifications, even when predictions were made with high confidence. Through analyzing the cosine similarity between words, we identified certain limitations in the pre-trained BanglaBERT models, particularly in their ability to distinguish between closely related communal and non-communal terms. To further interpret the model’s decisions, we applied LIME, which helped to uncover specific areas where the model struggled in understanding context, contributing to errors in classification. These findings highlight the promise of NLP and interpretability tools in reducing online communal violence. Our work contributes to the growing body of research in communal violence detection and offers a foundation for future studies aiming to refine these techniques for better accuracy and societal impact.
网络仇恨的蔓延导致了社区暴力,助长了各种宗教、族裔和社会群体之间的侵略和冲突,对社会和谐构成了重大威胁。尽管社区暴力文本的分类至关重要,但其分类仍然是现有研究中一个未得到充分探讨的领域。本研究的目的是提高检测煽动社区暴力的文本的准确性,特别侧重于来自社交媒体平台的孟加拉文本数据。我们引入了为这项任务量身定制的经微调的孟加拉语文本数据模型,得出了0.60分的宏观F1分。为了解决数据不平衡问题,我们的数据集增加了1 794个案例,这促进了精心调整的混合模型的发展和评价。这一组合模型展示了一种更好的业绩,实现了0.63分的宏观F1分,从而突出了其在这一领域的有效性。除了定量绩效衡量外,定性分析还揭示了一些实例,这些模型与背景相悖,导致偶尔的分类错误,即使有高度信心地作出预测,也导致偶尔的分类。通过分析各种词汇的相似性,我们发现了在事先经过培训的BangBERTRE模型前阶段存在的某些限制,有助于完善社区理解技术的准确性,特别是在不断加深的实地研究中,从而推理算出我们与社区暴力相关的研究中,从而推算出了在社区研究中,从而更精确地解释了中,我们更精确地解释了。我们更精确地理解了这些研究领域中,我们更有助于了这些研究领域中,从而推了这些研究中,我们更精确地解释了了这些基础。
Article 180
Title@2025-06-24 (2): Scaling Speculative Decoding with Lookahead Reasoning
Title: Scaling Speculative Decoding with Lookahead Reasoning | Spekulative Dekodierung mit Blick auf die Vernunft skalieren | 带有 “ 眼前 “ 理由的 投机替代 2506.19830v1 |
Authors (5): Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang
Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $\gamma$-token guess is correct falls exponentially as $\gamma$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling – making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning
理性模型通过产生长的思维链而出,但解码由此产生的数千个象征物则缓慢。 托肯级的投机解码( SD) 帮助, 但它的好处却被封住了, 因为随着美元的增长, 整个美元- 美元- 吨的猜测的正确率会随着美元- 伽玛元的增加而成倍下降。 这意味着为更长的象征性草案分配更多的计算公式会面临一个算法上限, 使速度适中和硬件- 不可知性。 我们用 Lookahead 解释来提高这个上限, 它利用了第二, 级的平行层。 我们的关键洞察觉是, 推理模型产生一步的推理模型, 每一步的推理模型都会产生一步的分步法, 而每个步骤只需要精度正确, 而不是精确的匹配。 在 Lookahead 解释, 一个轻量的草案模型提出未来几个步骤; 目标模型将每个建议扩展成一个分批的通道, 验证器会保持精度正确的步骤, 同时让目标再生成任何失败。 托肯级的SDD仍然在每一个推理步骤内操作, 因此, 两个平行的两层的平行的平行的平行结构。 我们展示在SDIS/ 上显示, 的升级的升级的升级的升级的SDB 和升级的升级的升级的升级, 和升级的升级的升级的SDLIB8 。
Article 181
Title@2025-06-24 (2): Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models
Title: Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models | Bewertung der Einhaltung der Visualisierungsrichtlinien in Diagrammen für wissenschaftliche Publikationen mit großen Visions-Sprachmodellen | 评价使用大愿景语言模型的科学出版物图表中视觉化准则的遵守情况 2506.19825v1 |
Authors (3): Johannes Rückert, Louise Bloch, Christoph M. Friedrich
Diagrams are widely used to visualize data in publications. The research field of data visualization deals with defining principles and guidelines for the creation and use of these diagrams, which are often not known or adhered to by researchers, leading to misinformation caused by providing inaccurate or incomplete information. In this work, large Vision Language Models (VLMs) are used to analyze diagrams in order to identify potential problems in regards to selected data visualization principles and guidelines. To determine the suitability of VLMs for these tasks, five open source VLMs and five prompting strategies are compared using a set of questions derived from selected data visualization guidelines. The results show that the employed VLMs work well to accurately analyze diagram types (F1-score 82.49 %), 3D effects (F1-score 98.55 %), axes labels (F1-score 76.74 %), lines (RMSE 1.16), colors (RMSE 1.60) and legends (F1-score 96.64 %, RMSE 0.70), while they cannot reliably provide feedback about the image quality (F1-score 0.74 %) and tick marks/labels (F1-score 46.13 %). Among the employed VLMs, Qwen2.5VL performs best, and the summarizing prompting strategy performs best for most of the experimental questions. It is shown that VLMs can be used to automatically identify a number of potential issues in diagrams, such as missing axes labels, missing legends, and unnecessary 3D effects. The approach laid out in this work can be extended for further aspects of data visualization.
数据可视化的研究领域涉及确定创建和使用这些图表的原则和准则,这些图表往往不为研究人员所知或遵循,从而导致不准确或不完整的信息导致错误信息。在这项工作中,使用大型视觉语言模型(VLMS)来分析图表,以便查明在选定数据可视化原则和准则方面的潜在问题。为了确定VLMS是否适合执行这些任务,5个开放源VLMs和5个提示战略使用一组来自选定数据可视化准则的问题进行比较。结果显示,使用的VLMS视觉数据非常有助于准确分析图表类型(F1-核心82.49%)、3D效果(F1-核心98.55 %)、轴标签(F1-核心76.74%)、线(RMSE 1.16)、颜色(RMSE 1.60)和图例(F1-核心96.64%、RMSE 0.70),而它们无法可靠地提供关于图像质量的反馈(F1-核心 0.74 % ) 和用于最佳的VLMS 运行(F mass ) 的图像/laudals 中显示最佳的F mass) 3MS 。
Article 182
Title@2025-06-24 (2): Entropy and type-token ratio in gigaword corpora
Title: Entropy and type-token ratio in gigaword corpora | Entropie- und Typ-Token-Verhältnis in Gigaword-Korpora | 千兆字词公司中的 英文比率和类型-吨比 2411.10227v3 |
Authors (3): Pablo Rosillo-Rodes, Maxi San Miguel, David Sanchez
There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.
衡量复杂系统多样性的方式各有不同,特别是在语言方面,词汇多样性的特点是类型对位率和字母对位率。我们在这里调查英语、西班牙语和土耳其语六种大规模语言数据集的多样性指标,包括书籍、新闻文章和推特。这些千篇一律的词团与具有不同形态特征的语文相对应,在登记册和类型上各有不同,从而构成对语言多样性量化方法的不同检验标准。我们揭示了特定主体和语言文本的书本和类型对位率之间的实证功能关系,这是用自然语言观察的统计法的结果。此外,在长长的文字长度的限制下,我们发现这种关系的分析表达方式依赖于Zipf和Heaps两种与我们的经验调查结果一致的法律。
Article 183
Title@2025-06-24 (2): KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Title: KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality | KnowRL: Erforschendes Wissenswertes Verstärktes Lernen für die Realität | KnowRL:探索知识强化学习促进事实质量 2506.19807v1 |
Authors (5): Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
大型语言模型(LLMs),特别是低思维模式,往往表现出严重的幻觉,由于在推理过程中无法准确识别知识界限,输出不正确的内容,结果模型(LLMs),特别是低思维模式,往往表现出严重的幻觉,由于在推理过程中无法准确识别知识界限,结果不正确的内容。虽然加强学习(RL)可以增强复杂的推理能力,但其注重结果的奖励机制往往缺乏对思维过程的实际监督,从而进一步加剧幻觉问题。为解决低思维模型中高幻觉的问题,我们提议“知识强化RL”(KnowRL.KnowRL)指导模型进行基于事实的缓慢思维,在知识核查的基础上,将事实质量奖赏纳入RL培训过程,从而进行基于事实的缓慢思维模式。在RL.Sintinking/Rsurgrass中,“KondRL”系统在慢思维中有效地减少原始/原始逻辑。
Article 184
Title@2025-06-24 (2): LLM-Based Social Simulations Require a Boundary
Title: LLM-Based Social Simulations Require a Boundary | LLM-basierte soziale Simulationen erfordern eine Grenze | 以LLM为基础的社会模拟需要边界 2506.19806v1 |
Authors (4): Zengqing Wu, Run Peng, Takayuki Ito, Chuan Xiao
This position paper argues that large language model (LLM)-based social simulations should establish clear boundaries to meaningfully contribute to social science research. While LLMs offer promising capabilities for modeling human-like agents compared to traditional agent-based modeling, they face fundamental limitations that constrain their reliability for social pattern discovery. The core issue lies in LLMs’ tendency towards an ``average persona’’ that lacks sufficient behavioral heterogeneity, a critical requirement for simulating complex social dynamics. We examine three key boundary problems: alignment (simulated behaviors matching real-world patterns), consistency (maintaining coherent agent behavior over time), and robustness (reproducibility under varying conditions). We propose heuristic boundaries for determining when LLM-based simulations can reliably advance social science understanding. We believe that these simulations are more valuable when focusing on (1) collective patterns rather than individual trajectories, (2) agent behaviors aligning with real population averages despite limited variance, and (3) proper validation methods available for testing simulation robustness. We provide a practical checklist to guide researchers in determining the appropriate scope and claims for LLM-based social simulations.
这份立场文件认为,基于大语言模式(LLM)的社会模拟应该为对社会科学研究做出有意义的贡献确定明确的界限。虽然LLM公司提供了与传统代理模型相比,在模拟人种制剂方面有希望的能力,但它们面临着限制其社会模式发现可靠性的基本限制。核心问题在于LLM公司倾向于“平均人种”,这种人种缺乏足够的行为异质性,这是模拟复杂社会动态的关键要求。我们研究了三个关键的边界问题:协调(模拟行为与现实世界模式相匹配)、一致性(长期保持连贯的代理行为)和稳健(在不同条件下可减少)。我们提出了确定基于LLMM的模拟何时能可靠地推进社会科学理解的超自然界限。我们认为,这些模拟在侧重于(1)集体模式而不是个人轨迹时更有价值,(2)尽管差异有限,但与实际人口平均数相一致的代理行为;(3)用于测试模拟稳健性的适当验证方法。我们提供了一个实用的核对清单,指导研究人员确定LM公司基于社会模拟的适当范围和索赔范围。
Article 185
Title@2025-06-24 (2): Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study | Warum kämpfen Open Source LLMs mit Datenanalyse? Eine systematische empirische Studie | 开放源码LLMs为何要与数据分析斗争?系统的经验研究 2506.19794v1 |
Authors (10): Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities.
大型语言模型(LLMs)在数据分析任务自动化方面很有希望,然而,开放源代码模型在这类推理密集型假设情景中面临重大限制。在这项工作中,我们调查了提高开放源代码LLMs数据分析能力的战略。我们通过整理一套多样、现实的假设情景的种子数据集,评估了三个方面的模型:数据理解、代码生成和战略规划。我们的分析揭示了三个主要结论:(1)战略规划质量是模型绩效的主要决定因素;(2)互动设计和任务复杂性极大地影响推理能力;(3)数据质量显示在实现最佳绩效方面的影响大于多样性。我们利用这些洞见来开发数据综合方法,展示了开放源代码LLMs分析推理能力的重大改进。
Article 186
Title@2025-06-24 (2): Words as Trigger Points in Social Media Discussions: A Large-Scale Case Study about UK Politics on Reddit
Title: Words as Trigger Points in Social Media Discussions: A Large-Scale Case Study about UK Politics on Reddit | Worte als Auslöser Punkte in Social Media Diskussionen: Eine groß angelegte Fallstudie über britische Politik auf Reddit | 作为社会媒体讨论的触发点的词句:关于联合王国重新适用政治的大规模案例研究 2405.10213v3 |
Authors (5): Dimosthenis Antypas, Christian Arnold, Jose Camacho-Collados, Nedjma Ousidhoum, Carla Perez Almendros
Political debates on social media sometimes flare up. From that moment on, users engage much more with one another; their communication is also more emotional and polarised. While it has been difficult to grasp such moments with computational methods, we suggest that trigger points are a useful concept to understand and ultimately model such behaviour. Established in qualitative focus group interviews to understand political polarisation (Mau, Lux, and Westheuser 2023), trigger points represent moments when individuals feel that their understanding of what is fair, normal, or appropriate in society is questioned. In the original studies, individuals show strong and negative emotional responses when certain triggering words or topics are mentioned. Our paper finds that these trigger points also exist in online debates. We examine online deliberations on Reddit between 2020 and 2022 and collect >100 million comments from subreddits related to a set of words identified as trigger points in UK politics. Analysing the comments, we find that trigger words increase user engagement and animosity, i.e., more negativity, hate speech, and controversial comments. Introducing trigger points to computational studies of online communication, our findings are relevant to researchers interested in affective computing, online deliberation, and how citizens debate politics and society in light of affective polarisation.
有关社交媒体的政治辩论有时会突然爆发。从那时起,用户就更多地相互接触;他们的沟通也更加情绪化和两极化。虽然很难用计算方法来理解这些时刻,但我们认为触发点是一个有用的概念,可以理解并最终模拟这种行为。通过质焦点小组访谈建立起来,以理解政治两极化(Maau, Lux, 和Westheuser 2023),触发点代表了个人感到他们对什么是公平的、正常的或适当的社会的理解受到质疑的时刻。在最初的研究中,当提到某些引发言论或话题时,个人表现出强烈和消极的情感反应。我们的论文发现,这些触发点也存在于在线辩论中。我们在2020至2022年期间对Redddit的在线讨论中,从一组被确定为联合王国政治触发点的词汇(Mau, Lux, Lux, 和Westheuser, 2023)中收集了超过1亿次修改的评论。分析这些评论后,我们发现, 触发语言会增加用户的接触和敌意,比如, 更多的消极性,仇恨言论和争议性评论。在计算在线交流时,我们的发现, 我们的发现,对于研究人员对极权极权的争论和极权的争论有影响。
Article 187
Title@2025-06-24 (2): A Foundational individual Mobility Prediction Model based on Open-Source Large Language Models
Title: A Foundational individual Mobility Prediction Model based on Open-Source Large Language Models | Ein grundlegendes individuelles Mobilitätsvorhersagemodell basierend auf Open-Source großen Sprachmodellen | 基于开放源码大语言模式的基础性个人流动预测模型 2503.16553v2 |
Authors (4): Zhenlin Qin, Leizhen Wang, Francisco Camara Pereira, Zhenliang Ma
Large Language Models (LLMs) are widely applied to domain-specific tasks due to their massive general knowledge and remarkable inference capacities. Current studies on LLMs have shown immense potential in applying LLMs to model individual mobility prediction problems. However, most LLM-based mobility prediction models only train on specific datasets or use single well-designed prompts, leading to difficulty in adapting to different cities and users with diverse contexts. To fill these gaps, this paper proposes a unified fine-tuning framework to train a foundational open source LLM-based mobility prediction model. We conducted extensive experiments on six real-world mobility datasets to validate the proposed model. The results showed that the proposed model achieved the best performance in prediction accuracy and transferability over state-of-the-art models based on deep learning and LLMs.
大型语言模型(LLMS)由于其广泛的一般知识和非凡的推论能力,被广泛应用于具体领域的任务。目前对LLMS的研究显示,在应用LLMS模拟个人流动预测问题方面,LLMS具有巨大的潜力。然而,大多数LLM流动预测模型只对具体数据集进行培训,或使用单一设计良好的提示,导致难以适应不同城市和不同背景的用户。为填补这些空白,本文件提议了一个统一的微调框架,以培训一个基础开放源LLM流动预测模型。我们对六个真实世界流动数据集进行了广泛的实验,以验证拟议的模型。结果显示,拟议的模型在预测精确性和可转让性方面,在基于深层次学习和LMS的先进模型方面,取得了最佳的绩效。
Article 188
Title@2025-06-24 (2): Large language models for automated scholarly paper review: A survey
Title: Large language models for automated scholarly paper review: A survey | Große Sprachmodelle für automatisierte wissenschaftliche Papierrezension: Eine Umfrage | 用于自动学术性纸质审查的大型语言模型:调查 2501.10326v2 |
Authors (5): Zhenzhen Zhuang, Jiandong Chen, Hongfeng Xu, Yuwen Jiang, Jialiang Lin
Large language models (LLMs) have significantly impacted human society, influencing various domains. Among them, academia is not simply a domain affected by LLMs, but it is also the pivotal force in the development of LLMs. In academic publication, this phenomenon is represented during the incorporation of LLMs into the peer review mechanism for reviewing manuscripts. LLMs hold transformative potential for the full-scale implementation of automated scholarly paper review (ASPR), but they also pose new issues and challenges that need to be addressed. In this survey paper, we aim to provide a holistic view of ASPR in the era of LLMs. We begin with a survey to find out which LLMs are used to conduct ASPR. Then, we review what ASPR-related technological bottlenecks have been solved with the incorporation of LLM technology. After that, we move on to explore new methods, new datasets, new source code, and new online systems that come with LLMs for ASPR. Furthermore, we summarize the performance and issues of LLMs in ASPR, and investigate the attitudes and reactions of publishers and academia to ASPR. Lastly, we discuss the challenges and future directions associated with the development of LLMs for ASPR. This survey serves as an inspirational reference for the researchers and can promote the progress of ASPR for its actual implementation.
大型语言模型(LLMS)对人类社会产生了重大影响,影响到各个领域,其中,学术界不仅仅是受LLMS影响的领域,也是发展LLMS的关键力量。在学术出版物中,将LLMS纳入同行审议机制以审查手稿时,就体现了这一现象。LLMS具有全面实施自动学术论文审查的变革潜力,但也带来了需要解决的新问题和挑战。在本调查文件中,我们旨在提供ASPRS在LMS时代的整体观点。我们首先进行调查,找出哪些LMS用于开展ASPR。然后,我们审查与ASPR有关的技术瓶颈随着LMM技术的纳入而得到解决。之后,我们着手探索新方法、新数据集、新源代码和与ASPRLMS有关的新的在线系统。此外,我们总结了ASPR在AS的绩效和问题,并调查出版商和学术界对ASPRS的态度和反应。我们讨论了与ASPR相关的挑战和未来方向,并讨论了与MAPR技术相关的挑战和未来方向,以便推进ASPR的发展。
Article 189
Title@2025-06-24 (2): Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
Title: Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation | Kling-Foley: Multimodaler Difffusionstransformator für hochwertige Video-zu-Audio-Generation | Kling-Foley:高质量视频到视听一代的多式联运变异器 2506.19774v1 |
Authors (23): Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai
We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.
我们提议Kling-Foley,这是一个大型多式联运视频到Audio生成模型,将高质量的音频同步与视频内容合成。在Kling-Foley中,我们引入了多式联运扩散变异器,以模拟视频、音频和文本模式之间的互动,并与视觉语义表达模块和视听同步模块相结合,以提高一致性能力。具体地说,这些模块将视频条件与框架层面的潜在音频元素相匹配,从而改善语义一致性和视听同步性。这一综合方法与文本条件一起,使得能够精确生成视频匹配的音效效应。此外,我们提出了一种通用的隐性视听编码器,可以在声音效应、语音、歌唱和音乐等各种情景下实现高质量的建模。我们采用了一种立体法,将音频与空间存在相结合。同时,为了弥补开放源基准的不完整类型和描述,我们还将一个工业级基准Kling-Audio-Eval开源。我们的实验显示,Kling-Foley在公共模型中经过了培训,与流动目标相匹配的音频质量分配模式相匹配,实现了新音频-视觉术语和视听术语校准。
Article 190
Title@2025-06-24 (2): SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning
Title: SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning | SRFT: Einstufige Methode mit überwachter und verstärkter Feinsteuerung für die Vernunft | SRFT: 单一标准方法,以监督和加固为理由的罚款 2506.19767v1 |
Authors (10): Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.
大型语言模型(LLMs)在推理任务方面取得了显著进展,然而,将监督的精细调试(SFT)和强化学习(RL)的最佳整合仍然是一个根本性的挑战。通过全面分析象征分布、学习动态和基于英特罗比的整合机制,我们揭示了这些范式之间的关键差异:SFT诱发全球对LLM政策分布的粗微变化,而RL则进行细微的选择性优化,以英特普作为关键的培训效果指标。我们基于这些观察,提议了Survised Engement Final-Turning(SRFT),这是一种单阶段方法,既通过读作觉加权机制统一微调模式,又统一了精细调模式。我们的方法同时运用SFT和RL,利用演示和自我开发推出的推出直接优化LMM,而不是通过两阶段的顺序方法。广泛的实验显示,SRFT达到59.1%的平均准确率,比零-RL方法高出9.0%的5项数学推理基准和10.9 %。
Article 191
Title@2025-06-24 (2): Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation
Title: Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation | Sensitive Inhaltsklassifikation in den sozialen Medien: Eine ganzheitliche Ressource und Bewertung | 社会媒体中的敏感内容分类:综合资源和评价 2411.19832v3 |
Authors (5): Dimosthenis Antypas, Indira Sen, Carla Perez-Almendros, Jose Camacho-Collados, Francesco Barbieri
The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language, leaving gaps in detecting other sensitive categories such as substance abuse or self-harm. In this paper, we put forward a unified dataset tailored for social media content moderation across six sensitive categories: conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam. By collecting and annotating data with consistent retrieval strategies and guidelines, we address the shortcomings of previous focalised research. Our analysis demonstrates that fine-tuning large language models (LLMs) on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which underperform by 10-15% overall. This limitation is even more pronounced on popular moderation APIs, which cannot be easily tailored to specific sensitive content categories, among others.
大型数据集中敏感内容的检测对于确保共享和分析的数据不受有害材料的影响至关重要,然而,目前的节制工具,如外部API,在定制、不同敏感类别准确性和隐私关切方面受到限制,而且,现有数据集和开放源模式主要侧重于有毒语言,在检测药物滥用或自我伤害等其他敏感类别方面留下空白。在本文件中,我们提出了一个统一的数据集,专门针对社会媒体中温的六类敏感内容:冲突语言、亵渎、性明确材料、毒品相关内容、自我伤害和垃圾邮件。通过以一致的检索战略和准则收集和说明数据,我们解决了先前重点研究的缺点。我们的分析表明,对这一新数据集的大型语言模型(LLMM)的微调,与LLAMAMA等开放的现成模型相比,甚至自制开放型开放型的OpenAI模型(总体达10-15%)的性能显著改善。这一限制更明显地表现于大众的调控性API,因为后者不易适应特定敏感内容类别。
Article 192
Title@2025-06-24 (2): Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR
Title: Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR | Genau, schnell, günstig: Drei auswählen. Multi-Head-Achtung mit bidirektionaler Wiederholungs-Achtung für langformige ASR ersetzen | 准确、快速、廉价:选择 3 : 选择 3 。 以双向的经常关注取代多头保护 。 2506.19761v1 |
Authors (3): Martin Ratajczak, Jean-Philippe Robichaud, Jennifer Drexler Fox
Long-form speech recognition is an application area of increasing research focus. ASR models based on multi-head attention (MHA) are ill-suited to long-form ASR because of their quadratic complexity in sequence length. We build on recent work that has investigated linear complexity recurrent attention (RA) layers for ASR. We find that bidirectional RA layers can match the accuracy of MHA for both short- and long-form applications. We present a strong limited-context attention (LCA) baseline, and show that RA layers are just as accurate while being more efficient. We develop a long-form training paradigm which further improves RA performance, leading to better accuracy than LCA with 44% higher throughput. We also present Direction Dropout, a novel regularization method that improves accuracy, provides fine-grained control of the accuracy/throughput trade-off of bidirectional RA, and enables a new alternating directions decoding mode with even higher throughput.
长式语音识别是一个应用领域,其研究重点日益突出。基于多端关注(MHA)的ASR模型不适合长式 ASR模型,因为它们在序列长度上具有二次复杂性。我们以最近对ASR的线性复杂反复关注(RA)层进行调查的工作为基础。我们发现双向RA层可以与MHA的精度匹配短式和长式应用。我们提出了强有力的有限文本关注基线,并表明RA的层次在效率更高的同时也同样准确。我们开发了一个长式培训模式,进一步提高RA的性能,导致比LCA的精度更高的通过量44%。我们还介绍了“Direct Drout”这一新的正规化方法,它提高了精度,为双向RA的精度/吞入交易提供了精细细的控制权,并使得新的交替方向解码模式能够达到更高的吞出量。
Article 193
Title@2025-06-24 (2): Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis
Title: Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis | Arabische Dialektklassifikation mit RNNs, Transformern und großen Sprachmodellen: Eine vergleichende Analyse | 使用RNN、变换器和大语言模式的阿拉伯语方言分类:比较分析 2506.19753v1 |
Authors (4): Omar A. Essameldin, Ali O. Elbeih, Wael H. Gomaa, Wael F. Elsersy
The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries. In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets. RNN models, Transformer models, and large language models (LLMs) via prompt engineering are created and tested. Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score. Through the use of state-of-the-art preprocessing techniques and the latest NLP models, this paper identifies the most significant linguistic issues in Arabic dialect identification. The results corroborate applications like personalized chatbots that respond in users’ dialects, social media monitoring, and greater accessibility for Arabic communities.
阿拉伯语是世界上最受欢迎的语言之一,在22个国家使用各种各样的方言。在本研究报告中,我们探讨了对QADI阿拉伯文推特数据集中的18种阿拉伯语方言进行分类的问题。通过快速工程,创建和测试了RNN模型、变换模型和大型语言模型(LLMs),其中MARBERTv2表现最佳,精度达到65%,F1-score为64%。通过使用先进的预处理技术和最新的NLP模型,本文确定了阿拉伯语方言识别中最重要的语言问题。结果证实了个人化聊天机等应用,这些应用在用户方言、社交媒体监控和阿拉伯语社区更大程度的无障碍性方面有所反应。
Article 194
Title@2025-06-24 (2): “I know myself better, but not really greatly”: How Well Can LLMs Detect and Explain LLM-Generated Texts?
Title: “I know myself better, but not really greatly”: How Well Can LLMs Detect and Explain LLM-Generated Texts? | “Ich kenne mich selbst besser, aber nicht wirklich sehr gut”: Wie gut können LLMs LLM-generierte Texte erkennen und erklären? | “我更了解自己,但并不十分了解”:“LLMs”如何能探测和解释LLM创世纪的文字? 2502.12743v2 |
Authors (9): Jiazhou Ji, Jie Guo, Weidong Qiu, Zheng Huang, Yang Xu, Xinru Lu, Xiaoyu Jiang, Ruizhe Li, Shujun Li
Distinguishing between human- and LLM-generated texts is crucial given the risks associated with misuse of LLMs. This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an ``undecided’’ class). We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs), though both remain suboptimal. Introducing a ternary classification framework improves both detection accuracy and explanation quality across all models. Through comprehensive quantitative and qualitative analyses using our human-annotated dataset, we identify key explanation failures, primarily reliance on inaccurate features, hallucinations, and flawed reasoning. Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.
鉴于与滥用LLMs有关的风险,区分人与LLM制成的文本至关重要。本文件调查当前LLMs在两种情况下的探测和解释能力:二进制(人与LLM制成)和永恒分类(包括“未定型”类)。我们评估了6个大小不等的近源和开源的LMs,发现自我探测(LLMs确定自己的产出)始终优于交叉探测(确定来自其他LMs的产出),尽管两者都仍然不够理想。引入长期分类框架可以提高所有模型的检测准确性和解释质量。我们通过使用人工附加说明数据集进行全面的定量和定性分析,找出关键的解释失败,主要是依赖不准确的特征、幻觉和错误的推理。我们的调查结果强调了当前LLMs在自我探测和自我勘测中的局限性,并强调需要进一步研究,以解决过分匹配和增强通用性的问题。
Article 195
Title@2025-06-24 (2): NEAR$^2$: A Nested Embedding Approach to Efficient Product Retrieval and Ranking
Title: NEAR$^2$: A Nested Embedding Approach to Efficient Product Retrieval and Ranking | NEAR$^2$: Ein verschachtelter Einbettungsansatz für effizientes Produkt-Retrieval und Ranking | 2美元:高效产品回收和排序的内嵌嵌入方法 2506.19743v1 |
Authors (7): Shenbin Qian, Diptesh Kanojia, Samarth Agrawal, Hadeel Saadany, Swapnil Bhosale, Constantin Orasan, Zhe Wu
E-commerce information retrieval (IR) systems struggle to simultaneously achieve high accuracy in interpreting complex user queries and maintain efficient processing of vast product catalogs. The dual challenge lies in precisely matching user intent with relevant products while managing the computational demands of real-time search across massive inventories. In this paper, we propose a Nested Embedding Approach to product Retrieval and Ranking, called NEAR$^2$, which can achieve up to $12$ times efficiency in embedding size at inference time while introducing no extra cost in training and improving performance in accuracy for various encoder-based Transformer models. We validate our approach using different loss functions for the retrieval and ranking task, including multiple negative ranking loss and online contrastive loss, on four different test sets with various IR challenges such as short and implicit queries. Our approach achieves an improved performance over a smaller embedding dimension, compared to any existing models.
电子商务信息检索系统力求在解释复杂的用户查询时同时达到高度准确性,并保持对大量产品目录的高效处理。双重挑战在于精确地将用户意向与有关产品相匹配,同时管理大量库存的实时搜索的计算需求。在本文中,我们建议对产品检索和排序采用内嵌嵌式方法,称为NEAR$2美元,在推断时间嵌入规模方面达到最高12美元的两倍效率,同时在培训方面不增加额外费用,提高各种以编码器为基础的变异器模型的准确性。我们验证了我们使用不同损失功能进行检索和排序的方法,包括多重负排位损失和在线对比损失,在四种不同的测试组合中,面临诸如短期和隐含性查询等各种IR挑战。我们的方法在比任何现有模型更小的嵌入层面取得更好的业绩。
Article 196
Title@2025-06-24 (2): Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?
Title: Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains? | Breaking Barriers: Gewinnt die Verstärkung von Posttrainings die Übertragung auf ungesehene Domains? | 突破障碍:加强培训后收益是否转移到未知领域? 2506.19733v1 |
Authors (7): Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit, Jason Benn, Daniel Kang
Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning. To understand the generalizability of RPT, we conduct two studies. (1) Observational: We compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.
强化后培训(RPT)最近在提高大型语言模型(LLMs)的推理能力方面显示出了希望,然而,这些改进在推广到新领域方面仍然有多么出色,因为先前的工作评价了RPT关于用于微调的同一领域数据的模型。为了理解RPT的可概括性,我们进行了两项研究。 (1)观察:我们比较了多种领域、包括其微调数据的可见和看不见领域的各种开放的RPT模型与其相应的基准模型。(2)干预:我们在单一领域与RPT一道微调LMs和RPT,并评价其跨多个领域的绩效。这两项研究都得出相同的结论,即尽管RPT在与微调数据相似的任务上带来大量收益,但所取得的收益却前后不一致,并且可能以不同推理模式消失在领域。
Article 197
Title@2025-06-24 (2): jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Title: jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval | jina-embeddings-v4: Universelle Einbettungen für multimodale Mehrsprachigkeit | jina-embeddings-v4:多语种多式联运回收通用嵌入式 2506.18902v2 |
Authors (11): Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
我们引入了jina-embeeddings-v4, 一个38亿元参数多式嵌入模型,通过支持单一矢量和多矢量嵌入晚期互动风格的新结构,统一文本和图像表达形式,该模型包含特定任务低兰克适应(LORA)适应器,以优化不同检索情景的性能,包括查询文件检索、语义文本相似性和代码搜索。 全面评估表明,jina-emeddings-v4在单式和跨式检索任务上都取得了最先进的性能,在处理表、图表、图表和混合媒体格式等视觉丰富内容方面特别有力量。为了便于评估这一能力,我们还引入了Jina-VDR,这是专门为视觉丰富图像检索设计的新基准。
Article 198
Title@2025-06-24 (2): Detecting Machine-Generated Texts: Not Just “AI vs Humans” and Explainability is Complicated
Title: Detecting Machine-Generated Texts: Not Just “AI vs Humans” and Explainability is Complicated | Maschinengenerierte Texte erkennen: Nicht nur “AI vs Menschen” und Erklärbarkeit ist kompliziert | 检测机器生成的文字: 不只是“ AI 和 人类 ” , 解释复杂 2406.18259v2 |
Authors (9): Jiazhou Ji, Ruizhe Li, Shujun Li, Jie Guo, Weidong Qiu, Zheng Huang, Chiyu Chen, Xiaoyu Jiang, Xinru Lu
As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentiating human from AI. Instead, we introduce a novel ternary text classification scheme, adding an “undecided” category for texts that could be attributed to either source, and we show that this new category is crucial to understand how to make the detection result more explainable to lay users. This research shifts the paradigm from merely classifying to explaining machine-generated texts, emphasizing need for detectors to provide clear and understandable explanations to users. Our study involves creating four new datasets comprised of texts from various LLMs and human authors. Based on new datasets, we performed binary classification tests to ascertain the most effective SOTA detection methods and identified SOTA LLMs capable of producing harder-to-detect texts. We constructed a new dataset of texts generated by two top-performing LLMs and human authors, and asked three human annotators to produce ternary labels with explanation notes. This dataset was used to investigate how three top-performing SOTA detectors behave in new ternary classification context. Our results highlight why “undecided” category is much needed from the viewpoint of explainability. Additionally, we conducted an analysis of explainability of the three best-performing detectors and the explanation notes of the human annotators, revealing insights about the complexity of explainable detection of machine-generated texts. Finally, we propose guidelines for developing future detection systems with improved explanatory power.
随着LLMS的快速进步,人们对实际编写我们在线和现实世界所见文本的风险日益感到担忧。区分LLM所编写的文本的任务由于机器和人类的细微和重叠行为而变得复杂。在本文中,我们质疑目前的做法,即考虑LLM产生的文本检测二进分类任务,将人类与AI区分开来。相反,我们引入了一个新颖的永恒文本分类办法,增加一个可以归责于任一来源的文本的“未确定”类别,我们表明,这一新类别对于理解如何使检测结果更便于向非专业用户解释。这一研究将模式从仅仅分类到解释机器产生的文本,强调需要探测器向用户提供清楚和可理解的解释。我们的研究涉及创建四个新的数据集,由各种LLMS和人类作者的文本组成。根据新的数据集,我们进行了二进制系统最有效的SOTA检测方法,并确定了能够产生更清晰到更精确的解析的文本。我们从两个顶级的LMSODMs和人类的高级作者们提出了新的解释。我们所使用的最新数据,我们用三个高级的解算方法来了一个新的解的“最精确的、最精确的解析的解算结果”
Article 199
Title@2025-06-24 (2): Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
Title: Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving | Lokale Look-Ahead-Anleitung über Verifier-in-the-Loop für automatisierte Theorem-Proving | 通过自动理论验证人在线验证人进行自动理论验证,指导当地目视中心 2503.09730v2 |
Authors (4): Sara Rajaee, Kumar Pratik, Gabriele Cesa, Arash Behboodi
The most promising recent methods for AI reasoning require applying variants of reinforcement learning (RL) either on rolled out trajectories from the LLMs, even for the step-wise rewards, or large quantities of human-annotated trajectory data. The reliance on the rolled-out trajectory renders the compute cost and time prohibitively high. In particular, the correctness of a reasoning trajectory can typically only be judged at its completion, leading to sparse rewards in RL or requiring expensive synthetic data generation in expert iteration-like methods. In this work, we focus on the Automatic Theorem Proving (ATP) task and propose a novel verifier-in-the-loop design, which, unlike existing approaches that leverage feedback on the entire reasoning trajectory, employs an automated verifier to give intermediate feedback at each step of the reasoning process. Using Lean as the verifier, we empirically show that the step-by-step local verification produces a global improvement in the model’s reasoning accuracy and efficiency.
最有希望的大赦国际最新推理方法要求应用各种强化学习(RL)的变体,要么在从LLMS推出的轨迹上应用(RL),即使是为了逐步的奖励,要么是为了大量附带人文的轨迹数据。对滚出轨迹的依赖使得计算成本和时间高得令人望而却步。特别是,推理轨的正确性通常只能在其完成时才能判断,导致RL的奖赏微乎其微,或者要求以专家迭代方法生成昂贵的合成数据。在这项工作中,我们侧重于自动理论验证(ATP)的任务,并提出新的“滚动验证器”设计,这与现有利用整个推理轨反馈的方法不同,使用自动验证器在推理过程的每一步都提供中间反馈。我们用Lean作为验证器,经验显示,逐步的地方核查使模型推理准确性和效率得到全球改进。
Article 200
Title@2025-06-24 (2): Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
Title: Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models | Ausreißer-sicheres Pre-Training für robuste 4-Bit Quantisierung großer Sprachmodelle | 大语言模式强力四比四比四的量化培训前培训 2506.19697v1 |
Authors (5): Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
大型语言模型(LLMS)的极端激活异常值严重地降低了量化绩效,阻碍了高效的配置。尽管频道操作和适应性梯度缩放规模是公认的原因,但实际的缓解仍然具有挑战性。我们引入了外部安全预培训(OSP)这一实用指南,积极主动地防止超值形成,而不是依赖热后缓解。OSP结合了三项关键创新:(1) Muon优化,消除特权基础,同时保持培训效率;(2) 单一规模RMSNARMNorm,防止频道错位扩展;(3) 可学习的嵌入投影,重新分配嵌入矩阵产生的启动量。我们通过培训1.4B参数模型来验证OSSP,这是第一个在没有这种外端的情况下培训的大规模LMM。在4位平方位前,我们的OSP模型在10个基准中平均得35.7分(而亚当培训模式为26.5分),只有2%的培训起点。值得注意的是,OSP模型在嵌入矩阵中展示接近零超端的启动量级启动量。04/404,而我们内部的IMS-ral-real-restration 战略(1856)比我们基本地展示了O-ral-ral-ral-ral-reval-ral-ral-ral-ral-ral-ral-ral-restration)的标准结果。
Article 201
Title@2025-06-24 (2): Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation
Title: Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation | Recurrent Visual Feature Extraction und Stereo-Warnungen für CT-Report-Generierung | 为编写CT报告提供经常性的《CT报告》经常性《视觉特征、采掘和立体关注》 2506.19665v1 |
Authors (3): Yuanhe Tian, Lei Mao, Yan Song
Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.
计算断层成像(CT)图像的生成报告是一项艰巨的任务,虽然与医学成像报告生成的现有研究相似,但具有其独特性,例如多图像的空间编码、图像量和文本的校正等。 现有解决方案通常使用一般 2D 或 3D 图像处理技术从CT 音量中提取特征,首先压缩音量,然后将压缩的CT 切片分割成可视编码的补丁。这些方法没有明确计算CT 切片的变异,也没有有效地整合多级图像特征,特别是含有特定器官变异的图像,以指导CT 报告生成(CTRG)。在考虑CT 扫描中连续切片之间的紧密关联性时,在本文件中,我们建议基于 CTRG 方法的大型语言模型(LLM) , 其反复的视觉特征提取和立体注意用于等级特征建模。 具体地说,我们使用视觉变形器对每个切片的经常过程进行计算,并且从不同角度对编码的切片片段进行一套关注,有选择地获得重要的直观信息资料,并将它们与文本基值调整为基准分析。 因此,我们用实验模型显示了磁模模型,从而更精确地展示了C- C- C- C- C- 显示基准结果。
Article 202
Title@2025-06-24 (2): Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
Title: Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager | Maßgeschneiderte Gespräche über LLMs hinaus: Ein RL-basierter Dialogmanager | 超出LLLM 的定制对话:基于 RL 的对话管理器 2506.19652v1 |
Authors (3): Lucie Galland, Catherine Pelachaud, Florian Pecune
In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.
在这项工作中,我们提出了一个新框架,将大型语言模式(LLMs)与基于RL的对话管理者融合在一起,以便进行开放式对话,并有一个具体目标。 通过利用等级强化学习,模拟分阶段对话,并利用元学习,提高不同用户的适应性,我们的方法提高了适应性和效率,使系统能够从有限的数据中学习,在对话阶段之间流畅地过渡,以及针对不同病人的需要作出个性化反应。 我们将我们的框架应用于动机性访谈,目的是促进行为变化,并表明拟议的对话管理者在奖励方面超过了最先进的LLM基线,显示了有条件LMs在创建具有具体目标的开放式对话系统方面的潜在好处。
Article 203
Title@2025-06-24 (2): Language Model Re-rankers are Fooled by Lexical Similarities
Title: Language Model Re-rankers are Fooled by Lexical Similarities | Sprachmodell-Reranker werden durch Lexikalische Ähnlichkeiten ausgeblendet | 语言模式重新排名者被古典相似性所愚弄 2502.17036v2 |
Authors (6): Lovisa Hagström, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge
Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
语言模型(LM) 重新排序器用于改进检索增强的一代(RAG) 的检索结果。 它们比BM25这样的词汇匹配方法更昂贵,但被认为可以更好地处理语义信息以及查询和检索到的答案之间的关系。 要了解LM重新排序器是否总是符合这一假设,我们就对NQ、LitQA2和DRUID数据集上的6个不同的LM重新排序器进行了评估。 我们的结果表明,LM重新排序器很难在DRUID上超过一个简单的BM25基线。 利用基于BM25分的新的分离指标,我们解释并找出来自不同词汇的重新排序错误。 我们还调查了不同的方法来改进LM重新排序器的性能,并发现这些方法主要对NQ有用。 一起,我们的工作查明并解释了LM重新排序器的弱点,并指出需要更多对抗性和现实的数据来进行评估。
Article 204
Title@2025-06-24 (2): Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning
Title: Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning | Recht ist nicht genug: Die Pitfalls von Outcome Supervision in der Ausbildung LLMs for Math Reasoning | 权利还是不够的:数学原因培训优等生培训成果监督的空洞 2506.06877v2 |
Authors (7): Jiaxing Guo, Wenjie Yang, Shengzhong Zhang, Tongshan Xu, Lun Du, Da Zheng, Zengfeng Huang
Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs’ answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to baselines, especially for complex, multi-step problems. This offers a more robust path towards evaluating and training LLMs with genuine mathematical reasoning.
成果优异的大型语言模型(LLMS)在数学解决问题方面取得了显著成功。然而,这一成功往往掩盖了一个关键问题:模型往往通过根本不健全的推理过程(一种显示黑客奖励现象的现象)获得正确答案。我们引入了MathOlympiadEval,这是一个带有细微批注的新数据集,揭示了LLMS的答案正确性与其低流程正确性之间的巨大差距。LLM-as-a-judge等现有自动化方法在可靠地发现这些推理缺陷方面挣扎着。为了解决这个问题,我们建议ParaStepVeroration(ParaStepVeroration),这是对数学解决方案进行细致、逐步核查的新方法。ParaStepVergier确定了错误的推理步骤。经验显示,ParaStepVeration与基线相比,特别是在复杂、多步问题方面,大大改进了找出有缺陷的解决办法的准确性。这为以真正的数学推理,评估和训练LMs提供了更加有力的途径。
Article 205
Title@2025-06-24 (2): Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge
Title: Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge | Halluzinationen in News-Zusammenfassungen korrigieren: Erforschung von selbstkorrekten LLM-Methoden mit externem Wissen | 新闻摘要:探索利用外部知识自行校正的LLM方法 2506.19607v1 |
Authors (3): Juraj Vladika, Ihsan Soydemir, Florian Matthes
While large language models (LLMs) have shown remarkable capabilities to generate coherent text, they suffer from the issue of hallucinations – factually inaccurate statements. Among numerous approaches to tackle hallucinations, especially promising are the self-correcting methods. They leverage the multi-turn nature of LLMs to iteratively generate verification questions inquiring additional evidence, answer them with internal or external knowledge, and use that to refine the original response with the new corrections. These methods have been explored for encyclopedic generation, but less so for domains like news summarization. In this work, we investigate two state-of-the-art self-correcting systems by applying them to correct hallucinated summaries using evidence from three search engines. We analyze the results and provide insights into systems’ performance, revealing interesting practical findings on the benefits of search engine snippets and few-shot prompts, as well as high alignment of G-Eval and human evaluation.
尽管大型语言模型(LLMS)表现出了生成一致文本的非凡能力,但它们却遭遇了幻觉问题 – – 事实上不准确的陈述。在解决幻觉的众多方法中,最有希望的方法是自我纠正方法。它们利用LLMs的多转性质来反复生成核查问题,以获取更多证据,以内部和外部知识回答这些问题,并利用新校正来完善最初的应对方法。这些方法是为百科全书的生成而探索的,但对于诸如新闻总结这样的领域则不那么有用。在这项工作中,我们用三个搜索引擎的证据来应用两个最先进的自我纠正系统来纠正幻觉摘要。我们分析了结果,并提供了对系统性能的洞察,揭示了搜索引擎断块和几发提示的有趣实际发现,以及G-Eval和人文评估高度一致。
Article 206
Title@2025-06-24 (2): Social Hatred: Efficient Multimodal Detection of Hatemongers
Title: Social Hatred: Efficient Multimodal Detection of Hatemongers | Sozialer Hass: Effiziente multimodale Erkennung von Hatemongern | 社会仇恨:以多种方式高效率地以多种方式探测仇恨者 2506.19603v1 |
Authors (3): Tom Marzea, Abraham Israeli, Oren Tsur
Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. Evaluating our method on three unique datasets X (Twitter), Gab, and Parler we show that processing a user’s texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. We offer comprehensive set of results obtained in different experimental settings as well as qualitative analysis of illustrative cases. Our method can be used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as to inform intervention measures. Moreover, we demonstrate that our multimodal approach performs well across very different content platforms and over large datasets and networks.
网上仇恨言论的自动检测是解毒的关键一步。此外,准确的分类可以促进更好地理解仇恨作为一种社会现象的扩散。虽然大多数先前的工作侧重于发现仇恨言论,但我们认为,侧重于用户层面同样重要,尽管具有挑战性。在本文中,我们考虑到潜在的仇恨文本、用户活动和用户网络,认为采用多式分类方法来检测仇恨分子。在三个独特的数据集X(Twitter)、Gab和Parler上评估我们的方法,我们表明,处理用户在她社会背景下的文本可大大改进对仇恨词的检测,而以往使用的文本和图表方法则不同。我们提供了在不同实验环境中取得的一整套结果以及对示例案例的定性分析。我们的方法可以用来改进编码信息的分类、警犬式和种族性燃气活动,以及通报干预措施。此外,我们还表明,我们的多式方法在非常不同的内容平台和大型数据集和网络上运作。
Article 207
Title@2025-06-24 (2): PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics
Title: PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics | PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}Markierung von großen Sprachmodellen gegen die menschliche Bevölkerung: Eine Fallstudie der Kompetenz in der Mathematik der 8. Klasse | 切奇! {P} sycyclogics- {A}ssis{T} Ben{CH} 标记针对人类人口的大语言模型: 八年级数学能力案例研究 2404.01799v3 |
Authors (3): Qixiang Fang, Daniel L. Oberski, Dong Nguyen
Many existing benchmarks of large (multimodal) language models (LLMs) focus on measuring LLMs’ academic proficiency, often with also an interest in comparing model performance with human test takers’. While such benchmarks have proven key to the development of LLMs, they suffer from several limitations, including questionable measurement quality (e.g., Do they measure what they are supposed to in a reliable way?), lack of quality assessment on the item level (e.g., Are some items more important or difficult than others?) and unclear human population reference (e.g., To whom can the model be compared?). In response to these challenges, we propose leveraging knowledge from psychometrics – a field dedicated to the measurement of latent variables like academic proficiency – into LLM benchmarking. We make four primary contributions. First, we reflect on current LLM benchmark developments and contrast them with psychometrics-based test development. Second, we introduce PATCH: a novel framework for {P}sychometrics-{A}ssis{T}ed ben{CH}marking of LLMs. PATCH addresses the aforementioned limitations. In particular, PATCH enables valid comparison between LLMs and human populations. Third, we demonstrate PATCH by measuring several LLMs’ proficiency in 8th grade mathematics against 56 human populations. We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on current benchmarking practices. Fourth, we release 4 high-quality datasets to support measuring and comparing LLM proficiency in grade school mathematics and science with human populations.
大型(多式)语言模型(LLMS)的许多现有基准侧重于衡量LLM的学术熟练程度,往往也有兴趣将模型业绩与人类测试者进行比较。虽然这些基准已证明是LLM发展LM的关键,但它们受到若干限制,包括测量质量有问题(例如,它们是否以可靠的方式衡量它们应当达到的? ),项目水平缺乏质量评估(例如,有些项目比其他项目更重要或困难?),以及人口参考不明确(例如,模型可以与谁比较? )。 为了应对这些挑战,我们建议利用心理计量学的知识 – – 即专门衡量学术熟练程度等潜在变量的领域 – – 来制定LLMM基准。我们作出四项主要贡献。首先,我们思考目前LMM基准的发展,将其与基于心理计量的测试发展进行比较。第二,我们引入PATCHT:关于{偏差-基于第四级基准的{A}A}标准的新框架 {TH}{BN}}} 标注LMs。 PRTCH针对上述局限性。具体说,PTCH能够将目前水平数据与我们采用的若干人/LMSLMs之间的数据进行比较。
Article 208
Title@2025-06-24 (2): Large Language Models as Span Annotators
Title: Large Language Models as Span Annotators | Große Sprachmodelle als Span-Annotatoren | 大语言模型作为 Span 标注器 2504.08697v2 |
Authors (10): Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondřej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondřej Dušek, Simone Balloccu
Span annotation is the task of localizing and classifying text spans according to custom guidelines. Annotated spans can be used to analyze and evaluate high-quality texts for which single-score metrics fail to provide actionable feedback. Until recently, span annotation was limited to human annotators or fine-tuned models. In this study, we show that large language models (LLMs) can serve as flexible and cost-effective span annotation backbones. To demonstrate their utility, we compare LLMs to skilled human annotators on three diverse span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We demonstrate that LLMs achieve inter-annotator agreement (IAA) comparable to human annotators at a fraction of a cost per output annotation. We also manually analyze model outputs, finding that LLMs make errors at a similar rate to human annotators. We release the dataset of more than 40k model and human annotations for further research.
批注是按自定义指南对文本进行本地化和分类的任务。可使用附加说明的篇幅分析和评价高质量文本,而单数指标无法提供可操作的反馈。直到最近,批注范围仅限于人类批注员或微调模型。在本研究中,我们显示大型语言模型(LLMS)可以作为灵活和具有成本效益的批注主干线。为了展示其效用,我们将LLMS与熟练的人类批注员进行了比较,这三种不同的批注任务是:评估数据对文本的生成,识别翻译错误,以及探测宣传技术。我们证明LMS实现了与人类批注员的类比协议(IAAA),其成本与每份产出的批次相当。我们还手动分析模型输出,发现LMSMs以与人类批注员相似的速度做出错误。我们公布了40多千个模型和人文说明的数据集,供进一步研究。
Article 209
Title@2025-06-24 (2): ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model
Title: ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model | ECCOT: Ein Rahmen für die Verbesserung der effektiven Kognition durch Gedankenkette im großen Sprachmodell | ECCT:通过高语言思维链加强有效认知的框架 2506.19599v1 |
Authors (5): Zhenke Duan, Jiqun Pan, Jiani Tu, Xiaoyi Wang, Yanqing Wang
In the era of large-scale artificial intelligence, Large Language Models (LLMs) have made significant strides in natural language processing. However, they often lack transparency and generate unreliable outputs, raising concerns about their interpretability. To address this, the Chain of Thought (CoT) prompting method structures reasoning into step-by-step deductions. Yet, not all reasoning chains are valid, and errors can lead to unreliable conclusions. We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By filtering ineffective chains using structured ordering statistics, ECCoT improves interpretability, reduces biases, and enhances the trustworthiness of LLM-based decision-making. Key contributions include the introduction of ECCoT, MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.
在大规模人工智能时代,大语言模型(LLMS)在自然语言处理方面取得了长足的进步,但往往缺乏透明度,产生不可靠的产出,引起人们对其可解释性的担忧。为了解决这个问题,思维链促使方法结构推推入逐步扣减。然而,并非所有推理链都是有效的,错误都可能导致不可靠的结论。我们提议ECCT,即最终到最终认知的思维校验链框架,以评价和完善LMS的推理链。ECCOT将Markov随机现场嵌入式专题模型(MRF-ETM)纳入专题认知COT生成和Causal判决-BERT(CSBert)的可解释性解释性模型(MRF-ETM)中,通过利用结构有序的统计数据过滤无效的链,ECCOCT改进解释性,减少偏见,提高LMM决策的可信度。主要贡献包括引入ECCT、专题驱动COT生成的MRF-ETM、以及加强因果关系推理性理论的CSBERT/COGI/CEDR/COGIGIGIGER. CoDRDRDRDRDRDRD。
Article 210
Title@2025-06-24 (2): ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation
Title: ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation | ConciseHint: Effiziente Reasonierung durch kontinuierliche ConciseHints während der Generation | 简明提示: 生成期间通过连续压缩提示促进高效理性 2506.18810v2 |
Authors (4): Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang
Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and OpenAI o1 series have achieved notable performance enhancements on complex reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT). However, an emerging issue is their inclination to produce excessively verbose reasoning processes, leading to the inefficiency problem. Existing literature on improving efficiency mainly adheres to the before-reasoning paradigms such as prompting and reasoning or fine-tuning and reasoning, but ignores the promising direction of directly encouraging the model to speak concisely by intervening during the generation of reasoning. In order to fill the blank, we propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely by injecting the textual hint (manually designed or trained on the concise data) during the token generation of the reasoning process. Besides, ConciseHint is adaptive to the complexity of the query by adaptively adjusting the hint intensity, which ensures it will not undermine model performance. Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning processes while maintaining performance well. For instance, we achieve a reduction ratio of 65\% for the reasoning length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.
最近,DeepSeek-R1 和 OpenAI O1 系列等大型推理模型(LRMs)的进步显著提高了复杂推理任务的业绩。然而,一个新出现的问题是,它们倾向于产生过度粗略推理过程,从而导致效率低下问题。关于提高效率的现有文献主要遵循先入为主的范式,如迅速和推理或微调和推理等,但忽视直接鼓励模型通过在推理产生过程中进行干预来简明发言这一有希望的方向。为了填补空白,我们提议了一个称为ConciseHint的框架,这个框架不断鼓励推理模型通过在推理过程象征性生成过程中输入文字提示(用简单设计或训练的简明数据)来简明发言。此外,ConciseHint通过适应性调整提示强度或微调和推理来适应查询的复杂性,从而确保它不会损害模型的性能。关于最先进的LRMs的实验,包括DeepS-R1 和Q-3 QR8 的精确度,这个框架不断鼓励推理学以文字提示来简明推理我们降低成本的方法。
Article 211
Title@2025-06-24 (2): KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation
Title: KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation | KAG-Thinker: Interactive Thinking und Deep Reasoning in LLMs über wissensbasierte Generation | KAG- Thinker: 通过知识型一代在LLMs中互动思考和深智 2506.17728v2 |
Authors (17): Dalong Zhang, Jun Xu, Jun Zhou, Lei Liang, Lin Yuan, Ling Zhong, Mengshu Sun, Peilong Zhao, QiWei Wang, Xiaorui Wang, Xinkai Du, YangYang Hou, Yu Ao, ZhaoYang Wang, Zhengke Gui, ZhiYing Yi, Zhongpu Bo
In this paper, we introduce KAG-Thinker, which upgrade KAG to a multi-turn interactive thinking and deep reasoning framework powered by a dedicated parameter-light large language model (LLM). Our approach constructs a structured thinking process for solving complex problems, enhancing the the logical coherence and contextual consistency of the reasoning process in question-answering (Q&A) tasks on domain-specific knowledge bases (KBs) within LLMs. Following the \textbf{Logical Form} guided retrieval and reasoning technology route of KAG, this framework first decomposes complex questions into independently solvable sub-problems (which are also referred to as logical forms) through \textbf{breadth decomposition}. Each such logical form is represented in two equivalent forms-natural language and logical function-and subsequently classified as either a Knowledge Retrieval or Reasoning Analysis task. Dependencies and parameter passing between these tasks are explicitly modeled via logical function interfaces. In the solving process, the Retrieval function performs retrieval tasks. It retrieves one-hop structured and unstructured information of specified knowledge unit. While the Math and Deduce functions are used to perform reasoning analysis tasks. Secondly, it is worth noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge boundary} module to determine the optimal source using self-regulatory mechanisms such as confidence calibration and reflective reasoning, and use the \textbf{depth solving} module to enhance the comprehensiveness of knowledge acquisition…
在此文件中, 我们引入 KAG- Thinker , KAG- Thinker 将 KAG 升级为多方向互动思维和深层次推理框架, 由专门的参数光大语言模型( LLM ) 驱动。 我们的方法构建了一个结构化的思维过程, 以解决复杂问题, 提高解答过程的逻辑一致性和背景一致性, 提高解答过程的逻辑性( A) , 提高LLMS 中特定领域知识基础( KB) 的逻辑性( KB) 逻辑性( KB) 任务。 在 KAG 的逻辑性功能界面中, 引导的检索和推理技术路径, 这个框架首先将复杂的问题分解为独立可溶的子问题( 也被称为逻辑形式 ) 。 通过\ textb{ breadd decommission , 我们用一种对等值的、 结构化和不精确性( ) 和不精确性( ) 定义性( ) 分析, 我们用数学和排序的解算算的外部( ) 任务) 工具, 和排序( 学习( ) 理解( ) 分析) 和排序( ) 分析) 分析) , 我们用数学和排序( 分析) 分析) 使用这种算算算算算算法性( 和排序( 算算) 进行 算) 分析) 等值( 等等值( 等值) 算) 分析) 。
Article 212
Title@2025-06-24 (2): Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects
Title: Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects | Fake oder Real, Können Roboter erzählen? Evaluieren von körpereigenen Vision-Sprachenmodellen auf realen und 3D-gedruckten Objekten | 假的还是假的,机器人能告诉吗?评价关于真物和3D实用物的内嵌视觉语言模型 2506.19579v1 |
Authors (3): Federico Tavella, Kathryn Mearns, Angelo Cangelosi
Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.
机器人场景理解日益依赖于视觉语言模型( VLMs) 来生成环境的自然语言描述。 在这项工作中, 我们展示了对配有 RGB 相机的机器人臂所捕捉的桌面场景标题战略的比较研究。 机器人从多个角度收集物体图像, 我们评估了产生场景描述的几种模型。 我们比较了各种字幕模型的性能, 如 BLIP 和 VLMs 。 我们的实验考察了单视图和多视角字幕之间的取舍, 以及识别真实世界和3D 打印对象之间的差异。 我们量化地评估了生成的字幕的物体识别准确性、 完整性和自然性。 结果显示, VLMS 可以在需要识别共同对象的机器人环境中使用, 但却无法概括新的描述。 我们的发现为在真实世界环境中应用集成剂的基础模型提供了实际的洞察力。
Article 213
Title@2025-06-24 (2): Benchmarking the Pedagogical Knowledge of Large Language Models
Title: Benchmarking the Pedagogical Knowledge of Large Language Models | Benchmarking der pädagogischen Kenntnisse großer Sprachmodelle | 确定大语言模式教学知识基准 2506.18710v2 |
Authors (10): Maxime Lelièvre, Amy Waldock, Meng Liu, Natalia Valdés Aspillaga, Alasdair Mackintosh, María José Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI’s knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models’ understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at https://rebrand.ly/pedagogy which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models’ capacities to understand pedagogical concepts, respond appropriately to learners’ needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.
大量多任务语言理解(MMLU)等基准在评估大赦国际在不同领域的知识和能力方面发挥了关键作用。然而,现有基准主要侧重于内容知识,在评估模型对教学方法和做法(教学方法和做法)的理解方面留下了重大差距。本文介绍了“教学基准”,这是一套新颖的数据集,旨在评价大语言模式的跨多任务教学知识(CDPK)和特殊教育需要和残疾(SEND)教学知识。这些基准是建立在由教师专业发展考试(包括教学战略和评估方法等一系列教学有效次级内容)精心整理的一组问题之上的。我们在这里概述了这些基准的方法和发展情况。我们报告97个模型的结果,在教学知识问题方面从28%到89%不等。我们考虑了成本和准确性之间的关系,并描绘了基于不同价值前沿的模型。我们在https://rebrand.ly/peagogy 上提供了在线领导板,它们以新的模型更新了有效的教学次级内容支持,并且根据不同成本和成本学习主题,将交互式探索和过滤能力决定。
Article 214
Title@2025-06-24 (2): Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress
Title: Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress | Hat die maschinelle Übersetzungsbewertung die menschliche Parität erreicht? Die menschliche Referenz und die Grenzen des Fortschritts | 机器翻译评价是否实现了人类平等?人类参考和进展的极限 2506.19571v1 |
Authors (3): Lorenzo Proietti, Stefano Perrella, Roberto Navigli
In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics’ capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to measure progress in the field, fostering discussion on an issue that we believe is crucial to the entire MT evaluation community.
在机器翻译(MT)评价中,根据与人类判断的一致,评估衡量业绩的标准。近年来,自动衡量显示与人类的一致程度越来越高。为了更清楚地了解衡量业绩并确立一个上限,我们把人类基线纳入MT元评价,即对MT计量能力的评估。我们的结果显示,人类评分者并不一贯优于自动衡量标准,最先进的衡量标准往往与人类基线相当或更高。尽管这些调查结果表明人类对等,但我们讨论了若干值得谨慎的理由。最后,我们探讨了我们的结果对研究领域产生的更广泛影响,我们问:我们能否继续可靠地衡量MT评价的改进?在这项工作中,我们的目标是阐明我们衡量实地进展的能力的局限性,促进讨论我们认为对整个MT评价界至关重要的一个问题。
Article 215
Title@2025-06-24 (2): GeistBERT: Breathing Life into German NLP
Title: GeistBERT: Breathing Life into German NLP | GeistBERT: Das Leben in die deutsche NLP einatmen | 呼吸生命化为德国NLP 2506.11903v3 |
Authors (2): Raphael Scheible-Schmitt, Johann Frei
Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. It was pre-trained using fairseq with standard hyperparameters, initialized from GottBERT weights, and trained on a large-scale German corpus using Whole Word Masking (WWM). Based on the pre-trained model, we derived extended-input variants using Nystr"omformer and Longformer architectures with support for sequences up to 8k tokens. While these long-context models were not evaluated on dedicated long-context benchmarks, they are included in our release. We assessed all models on NER (CoNLL 2003, GermEval 2014) and text classification (GermEval 2018 fine/coarse, 10kGNAD) using $F_1$ score and accuracy. The GeistBERT models achieved strong performance, leading all tasks among the base models and setting a new state-of-the-art (SOTA). Notably, the base models outperformed larger models in several tasks. To support the German NLP research community, we are releasing GeistBERT under the MIT license.
以变压器为基础的语言模型的进展突出表明了在高质量公司方面进行语言特定预先培训的好处。在这方面,德国国家采购计划将从更新的架构和适合德语语言特点的现代数据集中获益。 GeistBERT 寻求通过对多种材料进行渐进培训,改进德国语言处理,并优化各种国家采购计划任务中的模型性能。它使用标准超参数的公平标准模型进行了预先培训,从GottBERT重量开始,并用全字遮掩(WWWMM)进行大规模德国程序培训。根据预先培训的模式,我们利用Nystr'omformer和Longeform的架构,并用多达8k表示符号的序列支持,从中推导出长期模型。虽然这些长文本模型没有根据专门的长文本基准进行评估,但纳入我们的版中。我们评估了NER的所有模型(CNLLLLL,2003年;GermEval,2014年)和文本分类(Germelval,10kGNADAD)的大型德国模型,使用$F_1$1$1$1美元,在社区基准模型中,在基础模型中,在GEARFAxxxxxxxxxxxxxxxxxxxxxxxxx
Article 216
Title@2025-06-24 (2): ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
Title: ChatSR: Multimodal Large Language Models for Scientific Formula Discovery | ChatSR: Multimodale große Sprachmodelle für die wissenschaftliche Formel-Discovery | ChatSR: 科学公式发现多模式大语言模型 2406.05410v2 |
Authors (8): Yanjie Li, Lina Yu, Weijun Li, Min Wu, Jingyi Liu, Wenqiang Li, Shu Wei, Yusong Deng
Formulas are the language of communication between humans and nature. The discovery of formulas to describe natural laws from observational data is the purpose of scientific research. It is also an important research topic in artificial intelligence, which is called a symbolic regression problem. Most of the existing symbolic regression methods generate expressions directly from observed data. Although in some methods, we can inject some prior knowledge into the model by adding constraints or introducing some special character hints. However, these methods can only introduce a limited amount of prior knowledge specified in advance. Not to mention understanding natural language instructions. In this article, based on the powerful knowledge reserve and language understanding ability of multi-modal large language models, we present ChatSR, which acts like a knowledgeable human scientist, and we can tell it any prior knowledge through natural language to guide it in formula generation. By testing on 13 datasets, ChatSR not only shows state-of-the-art performance on traditional symbolic regression tasks. More notably, ChatSR can well understand the prior knowledge contained in natural language prompts and improve the quality of generated expressions. In addition, it is exciting that ChatSR has a good zero-shot capability to understand prior knowledge that is not present in the training data.
人类与自然之间的交流语言。发现用于描述观察数据自然法的公式是科学研究的目的。这也是人工智能中的一个重要研究课题,它被称为象征性回归问题。现有的多数象征性回归方法直接从观察到的数据中产生表达方式。虽然在某些方法中,我们可以通过添加限制或引入某些特殊字符提示的方式,将一些先前的知识注入模型。然而,这些方法只能引入有限的事先指定的知识;更不用说理解自然语言指令。在本条中,我们介绍像知识丰富的人类科学家一样具有强大知识储备和语言理解能力的ChatSR,我们可以通过自然语言告诉它任何先前的知识来指导公式的生成。通过测试13个数据集,ChatSR不仅显示传统象征性回归任务的最新表现,而且特别值得一提的是,ChatSR能够很好地理解以自然语言提示提供的先前知识,并提高生成的表达质量。此外,ChatSR具有很好的零射门能力来理解先前知识,而培训中却没有提供的数据。
Article 217
Title@2025-06-24 (2): DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Title: DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs | DaMO: Ein dateneffizienter Multimodal-Orchester für zeitliche Vernunft mit Video-LLMs | DaMO: 带有视频LMS的时空理由数据高效多式多式圆板 2506.11558v2 |
Authors (4): Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen
Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
最近,大型语言模型(LLMS)已扩展到视频领域,使复杂的视频理解成为可能,然而,现有的视频LLMS往往在细微的时间推理方面表现出局限性,限制了它们精确地将反应归因于特定视频时刻的能力,特别是在受限制的监督下。我们引入了数据高效的视频LLM(DaMO),这是一个数据高效的视频LM(LLM),其设计明确是为了准确的时间推理和多式联运理解。就其核心而言,拟议的Temal-aware Fusefrent采用一个等级分级的双流结构,在每种模式中逐渐捕捉时间动态,有效地结合了相近的视觉和音频信息。为了进一步提高计算效率,DaMO整合了一种全球剩余,减少空间冗余,同时保存基本的语义细节。我们通过一个结构化的四阶段渐进式培训模式对DAMO进行培训,逐步为该模式配备了多式联运、语义定位和时间推理能力。这项工作还有助于从现有数据中增加多种数据集,GPT生成的以时间为基础的QA配对需要时间监督的任务。关于时间定位和视频QA基准的全面实验表明DMO一贯超过以前的模型,特别是在要求精确时间调整和推理算。
Article 218
Title@2025-06-24 (2): RCStat: A Statistical Framework for using Relative Contextualization in Transformers
Title: RCStat: A Statistical Framework for using Relative Contextualization in Transformers | RCStat: Statistischer Rahmen für die Verwendung der relativen Kontextualisierung in Transformern | RCStat: 在变异器中使用相对环境化的统计框架 2506.19549v1 |
Authors (4): Debabrata Mahapatra, Shubham Agarwal, Apoorv Saxena, Subrata Mitra
Prior work on input-token importance in auto-regressive transformers has relied on Softmax-normalized attention weights, which obscure the richer structure of pre-Softmax query-key logits. We introduce RCStat, a statistical framework that harnesses raw attention logits via Relative Contextualization (RC), a random variable measuring contextual alignment between token segments, and derive an efficient upper bound for RC. We demonstrate two applications: (i) Key-Value compression, where RC-based thresholds drive adaptive key-value eviction for substantial cache reduction with minimal quality loss; and (ii) Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level explanations than post-Softmax methods. Across question answering, summarization, and attribution benchmarks, RCStat achieves significant empirical gains, delivering state-of-the-art compression and attribution performance without any model retraining.
在自动递减变压器中,先前关于投入式重要性的工作依赖于软体调整式的注意重量,这模糊了较富的软体前查询键记录结构。我们引入了RCStat,这是一个统计框架,通过相对背景化(RC)来利用原始关注记录,这是一个随机变量,用来测量代号段之间的背景一致性,并为RC带来有效的上限。我们展示了两种应用:(一) 关键价值压缩,基于驻地协调员的门槛驱动着适应性关键价值驱逐,以便大量减少缓冲,同时尽量减少质量损失;(二) 属性,即RC的产值高于软体后代号、句式和块级解释。在回答问题、总和归因基准方面,RCStat取得了重大的经验收益,在没有任何示范再培训的情况下交付了最先进的压缩和归因业绩。
Article 219
Title@2025-06-24 (2): Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection
Title: Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection | Health Sentinel: Eine KI-Pipeline zur Erkennung von Krankheiten in Echtzeit | 健康哨兵:AI 实时疾病爆发检测管道 2506.19548v1 |
Authors (16): Devesh Pant, Rishi Raj Grandhe, Vipin Samaria, Mukul Paul, Sudhir Kumar, Saransh Khanna, Jatin Agrawal, Jushaan Singh Kalra, Akhil VSSG, Satish V Khalikar, Vipin Garg, Himanshu Chauhan, Pranay Verma, Neha Khandelwal, Soma S Dhavala, Minesh Mathew
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.
早期发现疾病爆发对于确保卫生当局及时干预至关重要。由于传统基于指标的监测方面的挑战,监测网上媒体等非正式来源越来越受欢迎。然而,由于每天刊登的在线文章数量很多,人工筛选文章是不切实际的。为此,我们提议健康哨兵。这是一个多阶段的信息提取管道,使用多阶段的ML和非ML方法,从网上文章中提取有关疾病爆发或其他异常健康事件的结构化信息。提取的活动提供给德里国家疾病控制中心的媒体扫描和核查小组(MSVC),以便分析、解释和进一步向地方机构传播,以便及时干预。从2022年4月至今,健康哨兵已经处理了3亿多篇新闻文章,并查明了印度各地95 000多起独特的健康事件,其中3 500多起事件被全国疾病控制中心公共卫生专家列为可能爆发的短清单。
Article 220
Title@2025-06-24 (2): KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs
Title: KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs | KnowMap: Effiziente, wissensgetriebene Aufgabenanpassung für LLMs | KnowMap: 高效知识驱动任务适应LLMs 2506.19527v1 |
Authors (2): Kelin Fu, Kaigui Bian
While Large Language Models (LLMs) possess significant capabilities in open-world agent tasks, they also face challenges in rapidly adapting to new, specialized tasks due to their reliance on static pre-trained knowledge. Traditional methods such as fine-tuning are often costly, data-intensive, and may lead to “catastrophic forgetting.” Therefore, we present KnowMap, a novel approach that dynamically constructs a knowledge base from environmental and experiential data. KnowMap fine-tunes a small knowledge-embedding model to equip a larger LLM with valuable task-specific knowledge. Our experiments on the ScienceWorld benchmark demonstrate 17.71% improvement for the performance of gpt-4-turbo model. KnowMap not only provides an efficient and effective means for LLM task-adapting, but also highlights how integrating environmental and experiential knowledge can enhance LLMs’ reasoning capabilities.
虽然大语言模型(LLMS)在开放世界的代理任务方面拥有巨大的能力,但它们也面临着迅速适应新的专门任务的挑战,因为它们依赖静态的预先培训知识。 微调等传统方法往往成本高,数据密集,可能导致“灾难性的遗忘 ” 。 因此,我们介绍KnowMap,这是一个创新方法,它能动态地从环境和实验数据中构建知识库。KnowMap微调一个小型的知识组合模型,用来为大型LLMS配置宝贵的具体任务知识。我们在科学世界基准方面的实验显示,Gpt-4-turbo模型的性能提高了17.71%。 KnowMap不仅为LM任务适应提供了高效和有效的手段,而且还突出了环境和实验知识的整合如何提高LLMS的推理能力。
Article 221
Title@2025-06-24 (2): Automatic Posology Structuration : What role for LLMs?
Title: Automatic Posology Structuration : What role for LLMs? | Automatische Posologie Structuration : Welche Rolle spielen LLMs? | 自动人口结构结构:LLMS发挥什么作用? 2506.19525v1 |
Authors (6): Natalia Bobkova, Laura Zanella-Calzada, Anyes Tafoughalt, Raphaël Teboul, François Plesse, Félix Gaschi
Automatically structuring posology instructions is essential for improving medication safety and enabling clinical decision support. In French prescriptions, these instructions are often ambiguous, irregular, or colloquial, limiting the effectiveness of classic ML pipelines. We explore the use of Large Language Models (LLMs) to convert free-text posologies into structured formats, comparing prompt-based methods and fine-tuning against a “pre-LLM” system based on Named Entity Recognition and Linking (NERL). Our results show that while prompting improves performance, only fine-tuned LLMs match the accuracy of the baseline. Through error analysis, we observe complementary strengths: NERL offers structural precision, while LLMs better handle semantic nuances. Based on this, we propose a hybrid pipeline that routes low-confidence cases from NERL (<0.8) to the LLM, selecting outputs based on confidence scores. This strategy achieves 91% structuration accuracy while minimizing latency and compute. Our results show that this hybrid approach improves structuration accuracy while limiting computational cost, offering a scalable solution for real-world clinical use.
在法国处方中,这些处方往往含糊不清、不规则或合议,限制了经典ML输油管的效力。我们探索使用大语言模型(LLMs)将自由文本孔径转换成结构化格式,比较基于迅速的方法和微调,以基于实体识别和链接(NERL)的“LLM前”系统为基础。我们的结果显示,在推动改进性能的同时,只有微调的LLMS符合基线的准确性。我们通过错误分析观察互补优势:NERL提供结构精确性,而LLMs则更好地处理语义上的细微差别。在此基础上,我们建议采用混合管道,将低信任案例从NERL(<0.8)到LLM(LLM),根据信任分数选择产出。这个战略达到91%的结构精度,同时尽量减少拉伸缩和校准性。我们的结果显示,这种混合方法在限制计算成本的同时提高了结构精准性,同时提供了现实临床使用的可变化解决方案。
Article 222
Title@2025-06-24 (2): heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation
Title: heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation | heiDS bei ArchEHR-QA 2025: Von Fixed-k zu Query-dependent-k für Retrieval Augmented Generation | ArchEHR-QA 2025:从固定公里到求回增代的依靠查询公里 2506.19512v1 |
Authors (2): Ashish Chouhan, Michael Gertz
This paper presents the approach of our team called heiDS for the ArchEHR-QA 2025 shared task. A pipeline using a retrieval augmented generation (RAG) framework is designed to generate answers that are attributed to clinical evidence from the electronic health records (EHRs) of patients in response to patient-specific questions. We explored various components of a RAG framework, focusing on ranked list truncation (RLT) retrieval strategies and attribution approaches. Instead of using a fixed top-k RLT retrieval strategy, we employ a query-dependent-k retrieval strategy, including the existing surprise and autocut methods and two new methods proposed in this work, autocut* and elbow. The experimental results show the benefits of our strategy in producing factual and relevant answers when compared to a fixed-$k$.
本文介绍了我们团队在ArchEHR-QA 2025年共同任务中称为 “ 肝脏安全数据系统 “ 的方法; 利用检索增强型发电(RAG)框架的管道,旨在根据病人电子健康记录中的临床证据找到答案,以回应病人的具体问题; 我们探讨了RAG框架的各个组成部分,侧重于排位列表脱节(RLT)检索策略和归因方法; 我们不是使用固定的顶级RLT检索策略,而是采用自答式检索策略,包括目前存在的突袭和自动切割方法以及这项工作中提议的两种新方法,即自动切割* 和肘部; 实验结果显示,与固定美元相比,我们战略在提供事实和相关答案方面的益处。
Article 223
Title@2025-06-24 (2): AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
Title: AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models | AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization für KV Cache in großen Sprachmodellen | AnTKV: 大语言模型中 KV 缓存的 Anchor Token- Aware 子Bit 矢量定量 2506.19505v1 |
Authors (9): Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, Xiaowen Chu
Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models (LLMs). Nevertheless, minimizing the performance degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. We observe that quantizing the KV cache of different tokens has varying impacts on the quality of attention outputs. To systematically investigate this phenomenon, we perform forward error propagation analysis on attention and propose the Anchor Score (AnS) that quantifies the sensitivity of each token’s KV cache to quantization-induced error. Our analysis reveals significant disparities in AnS across tokens, suggesting that preserving a small subset with full precision (FP16) of high-AnS tokens can greatly mitigate accuracy loss in aggressive quantization scenarios. Based on this insight, we introduce AnTKV, a novel framework that leverages Anchor Token-aware Vector Quantization to compress the KV cache. Furthermore, to support efficient deployment, we design and develop a triton kernel that is fully compatible with FlashAttention, enabling fast online Anchor Token selection. AnTKV enables LLaMA-3-8B to handle context lengths up to 840K tokens on a single 80GB A100 GPU, while achieving up to 3.5x higher decoding throughput compared to the FP16 baseline. Our experiment results demonstrate that AnTKV matches or outperforms prior works such as KIVI, SKVQ, KVQuant, and CQ under 4-bit settings. More importantly, AnTKV achieves significantly lower perplexity under ultra-low-bit quantization on Mistral-7B, with only 6.32 at 1-bit and 8.87 at 0.375-bit, compared to the FP16 baseline of 4.73.
量化是一个有效和轻量级的解决方案,可以减少大语言模型中KV缓存的记忆足迹(LLMs) 。 然而,将超低比特KV缓存量量化造成的性能退化降到最低程度仍是一个重大挑战。 我们观察到,对KV缓存的不同质点进行量化对关注输出的质量有不同的影响。 为了系统调查这一现象,我们进行了关注度的远端错误传播分析,并提议将每个标志的 KV 缓存的敏感度量化到量化误差。 我们的分析显示, AnS 之间在物证上差异很大差异很大, 表示完全精确保存一个小子群(FP16) 高安检物证可以大大减轻攻击性四星的精确度损失。 基于这一洞察, 我们引入了 AnTKVV, 一个利用Anchor-aware Vectal 量计量化为 KV的缓存点。 此外,我们设计并开发一个与直径Vtal-40 完全兼容的 Anton KVal-lickral-lickral 在直径端点上快速定位, 使得K- 直达80 KK-k-k-k-to 直径直径直径, 直达80 KK-ral-ral-ral-le 。
Article 224
Title@2025-06-24 (2): NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling
Title: NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling | NaviAgent: Bilevel-Planung auf Werkzeugabhängigkeitsgraphen für Funktionsaufruf | NaviAgent: 功能调用工具依赖图双层规划 2506.19500v1 |
Authors (5): Yan Jiang, Hao Zhou, LiZhong GU, Ai Han, TianLong Li
LLMs’ reliance on static knowledge and fragile tool invocation severely hinders the orchestration of complex, heterogeneous toolchains, particularly at large scales. Existing methods typically use rigid single-path execution, resulting in poor error recovery and exponentially growing search spaces. We introduce NaviAgent, a graph-navigated bilevel planning architecture for robust function calling, comprising a Multi-Path Decider and Graph-Encoded Navigator. As an LLM-powered agent, the Multi-Path Decider defines a four-dimensional decision space and continuously perceives environmental states, dynamically selecting the optimal action to fully cover all tool invocation scenarios. The Graph-Encoded Navigator constructs a Tool Dependency Heterogeneous Graph (TDHG), where node embeddings explicitly fuse API schema structure with historical invocation behavior. It also integrates a novel heuristic search strategy that guides the Decider toward efficient and highly successful toolchains, even for unseen tool combinations. Experiments show that NaviAgent consistently achieves the highest task success rate (TSR) across all foundation models and task complexities, outperforming the average baselines (ReAct, ToolLLM, {\alpha}-UMI) by 13.5%, 16.4%, and 19.0% on Qwen2.5-14B, Qwen2.5-32B, and Deepseek-V3, respectively. Its execution steps are typically within one step of the most efficient baseline, ensuring a strong balance between quality and efficiency. Notably, a fine-tuned Qwen2.5-14B model achieves a TSR of 49.5%, surpassing the much larger 32B model (44.9%) under our architecture. Incorporating the Graph-Encoded Navigator further boosts TSR by an average of 2.4 points, with gains up over 9 points on complex tasks for larger models (Deepseek-V3 and GPT-4o), highlighting its essential role in toolchain orchestration.
LLMS 依赖静态知识和脆弱的工具操作, 严重妨碍了复杂、 混杂的工具链的操作, 特别是在大尺度上。 现有方法通常使用僵硬的单路执行, 导致错误恢复差, 并成倍增加搜索空间 。 我们引入了 NaviAgent, 一个由图形驱动的双级规划架构, 包括多帕思决定器和图形编码导航器 。 多帕思决定器作为LLM 动力代理器, 定义了一个四维决定空间, 并不断看到环境状态, 动态地选择了能完全覆盖所有工具任务情景的最佳动作 。 图表- 驱动器编码的导航器通常使用僵硬的单路段执行, 并使用所有基底的 T- 4- 4 高级智能执行器, 将APIPIS 的图状结构结构与历史职业行为行为行为习惯。 将一个新的超成功的搜索战略搜索策略, 甚至是无形的工具组合。 实验显示, NavIAG- 持续地实现最高的任务成功率( TSR ) (TSR) 和最常态 最低 QOILD 4, 的进度 和最低的进度 Q, 和最基本的进度 和最基本的进度 和最低的进度 Q, 和最低的进度 基点 , , , 基底的进度 , 和13的进度 基点 , 和最高的进度 , 。
Article 225
Title@2025-06-24 (2): Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs
Title: Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs | Ist lange bis kurz ein kostenloses Mittagessen? Untersuchung von Inkonsistenz und vernünftiger Effizienz in LRMs | 长到短是免费午餐吗? 调查低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低 2506.19492v1 |
Authors (6): Shu Yang, Junchao Wu, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang
Large Reasoning Models (LRMs) have achieved remarkable performance on complex tasks by engaging in extended reasoning before producing final answers, yet this strength introduces the risk of overthinking, where excessive token generation occurs even for simple tasks. While recent work in efficient reasoning seeks to reduce reasoning length while preserving accuracy, it remains unclear whether such optimization is truly a free lunch. Drawing on the intuition that compressing reasoning may reduce the robustness of model responses and lead models to omit key reasoning steps, we investigate whether efficient reasoning strategies introduce behavioral inconsistencies. To systematically assess this, we introduce $ICBENCH$, a benchmark designed to measure inconsistency in LRMs across three dimensions: inconsistency across task settings (ITS), inconsistency between training objectives and learned behavior (TR-LB), and inconsistency between internal reasoning and self-explanations (IR-SE). Applying $ICBENCH$ to a range of open-source LRMs, we find that while larger models generally exhibit greater consistency than smaller ones, they all display widespread “scheming” behaviors, including self-disagreement, post-hoc rationalization, and the withholding of reasoning cues. Crucially, our results demonstrate that efficient reasoning strategies such as No-Thinking and Simple Token-Budget consistently increase all three defined types of inconsistency. These findings suggest that although efficient reasoning enhances token-level efficiency, further investigation is imperative to ascertain whether it concurrently introduces the risk of models evading effective supervision.
大型理性模型(LRMs)通过在提出最后答案之前进行广泛的推理,在复杂任务上取得了显著的成绩,在提出最后答案之前,这种优势带来了过度思考的风险,因为在那里,甚至对简单的任务也会产生过度的象征性生成;虽然最近的有效推理工作力求缩短推理长度,同时保持准确性,但这种优化是否真正是一种免费午餐,仍然不清楚。根据压缩推理可能降低模型反应的稳健性并导致模型忽略关键推理步骤的直觉,我们调查高效推理战略是否引入了行为上的不一致。为了系统地评估这一点,我们引入了美元(ICBENCHs)这一基准,用以衡量风险管理在三个层面的不一致之处:任务设置不一致(ITS),培训目标和学习行为(TR-LB)之间不一致,以及内部推理(IR-SE)之间不一致。 将美元运用于一系列开放源LMM(IRMs),我们发现,虽然规模较大的模型一般比较小的模型表现出更大的一致性,但它们都表现出广泛的“规划”行为,包括自相矛盾、后推理理,以及预判的推理判(推理)等级推理(推理)等级推理,但不断推理判断(推理)的推理(推理)的推理,这些推推理,这些推理的推推推的推推理是不断的推理的推理的推理推理性推理的推理性推理是,是,这三的推理性推推理的推理,我们的推理的推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推推推的推推理性推理性推推推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理性推理
Article 226
Title@2025-06-24 (2): Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning
Title: Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning | Dialogische Pädagogik für große Sprachmodelle: Konversationale KI mit bewährten Theorien des Lernens ausrichten | 大语言模型对话教学法:使大赦国际与证明的学习理论相一致 2506.19484v1 |
Authors (1): Russell Beale
Large Language Models (LLMs) are rapidly transforming education by enabling rich conversational learning experiences. This article provides a comprehensive review of how LLM-based conversational agents are being used in higher education, with extensions to secondary and lifelong learning contexts. We synthesize existing literature on LLMs in education and theories of conversational and dialogic pedagogy - including Vygotsky’s sociocultural learning (scaffolding and the Zone of Proximal Development), the Socratic method, and Laurillard’s conversational framework - and examine how prompting strategies and retrieval-augmented generation (RAG) can align LLM behaviors with these pedagogical theories, and how it can support personalized, adaptive learning. We map educational theories to LLM capabilities, highlighting where LLM-driven dialogue supports established learning principles and where it challenges or falls short of traditional pedagogical assumptions. Notable gaps in applying prior theories to LLMs are identified, such as the models tendency to provide direct answers instead of fostering co-construction of knowledge, and the need to account for the constant availability and broad but non-human expertise of LLM tutors. In response, we propose practical strategies to better align LLM interactions with sound pedagogy - for example, designing prompts that encourage Socratic questioning, scaffolded guidance, and student reflection, as well as integrating retrieval mechanisms to ensure accuracy and contextual relevance. Our aim is to bridge the gap between educational theory and the emerging practice of AI-driven conversational learning, offering insights and tools for making LLM-based dialogues more educationally productive and theory-aligned.
大型语言模型(LLMS)正在通过丰富对话学习经验迅速改变教育。本篇文章全面审查了以LLM为基础的对话媒介在高等教育中如何使用LLM为基础的对话媒介,扩展至中等和终身学习环境。我们综合了教育中关于LLML的现有文献以及谈话和对话教学理论-包括Vygotsky的社会文化学习(Satcholdolding和Proximal Development区)、Scrotic 方法以及Laurillard的谈话框架,并探讨了如何促使战略和检索一代(RAG)使LLM行为与这些教学理论相一致,以及如何支持个性化和适应性学习。我们将LLM的教学理论与LM能力相结合,强调LM驱动的对话支持既定的学习原则,以及挑战或落后于传统教学假设的理论。在应用LLMS的社会文化学习(S)之前的理论学习(Sprocialde)方面存在着明显差距,例如提供直接的答案而不是培养知识差距的共同构建模式,以及需要将LLM教学方法的不断提供和广泛但非人类的专业知识,我们学习的教学方法作为学习的理论的精确的理论的理论,我们提出正确的分析,我们将理论的理论的理论的理论的理论的理论的理论的理论与研究作为正确的分析,我们提出正确的分析的理论的理论的理论的理论的理论的理论的理论的理论的理论,作为正确的分析。
Article 227
Title@2025-06-24 (2): Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models
Title: Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models | Commonsense-Erstellung und -Evaluierung für Dialogsysteme mit großen Sprachmodellen | 使用大语言模式生成和评估对话系统 2506.19483v1 |
Authors (4): Marcos Estecha-Garitagoitia, Chen Zhang, Mario Rodríguez-Cantelar, Luis Fernando D’Haro
This paper provides preliminary results on exploring the task of performing turn-level data augmentation for dialogue system based on different types of commonsense relationships, and the automatic evaluation of the generated synthetic turns. The proposed methodology takes advantage of the extended knowledge and zero-shot capabilities of pretrained Large Language Models (LLMs) to follow instructions, understand contextual information, and their commonsense reasoning capabilities. The approach draws inspiration from methodologies like Chain-of-Thought (CoT), applied more explicitly to the task of prompt-based generation for dialogue-based data augmentation conditioned on commonsense attributes, and the automatic evaluation of the generated dialogues. To assess the effectiveness of the proposed approach, first we extracted 200 randomly selected partial dialogues, from 5 different well-known dialogue datasets, and generate alternative responses conditioned on different event commonsense attributes. This novel dataset allows us to measure the proficiency of LLMs in generating contextually relevant commonsense knowledge, particularly up to 12 different specific ATOMIC [10] database relations. Secondly, we propose an evaluation framework to automatically detect the quality of the generated dataset inspired by the ACCENT [26] metric, which offers a nuanced approach to assess event commonsense. However, our method does not follow ACCENT’s complex eventrelation tuple extraction process. Instead, we propose an instruction-based prompt for each commonsense attribute and use state-of-the-art LLMs to automatically detect the original attributes used when creating each augmented turn in the previous step. Preliminary results suggest that our approach effectively harnesses LLMs capabilities for commonsense reasoning and evaluation in dialogue systems.
本文根据不同类别的常识关系和对生成的合成转折进行自动评价,就探讨为对话系统进行转折一级数据增强的任务和对生成的合成转折进行自动评价提供了初步结果。为了评估拟议方法的有效性,我们首先从5个众所周知的对话数据集抽取了200个随机选定的部分对话,产生了以不同事件常识属性为条件的替代反应。这个新数据集使我们能够衡量LLOM在产生与背景相关的常识知识方面的熟练程度,特别是12个不同的基于ATOMIC的原始数据库关系。第二,我们建议建立一个评价框架,以自动检测由常识属性和生成的对话所激励的数据集的质量。我们从5个众所周知的对话数据集中抽取了200个随机随机选定的部分对话,产生了以不同事件常识常识常识属性为条件的替代反应。这个新数据集使我们能够衡量LOMC在生成与背景相关的常识知识方面的熟练程度,特别是12个基于ATOMIC的原始数据库关系。我们建议建立一个评价框架,以自动检测由ACTRE [26] 阶定步骤所生成的数据质量质量的质量。每个常识度指标显示我们先前的常识度选择的常识度方法。
Article 228
Title@2025-06-24 (2): LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages
Title: LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages | LEVOS: Leveraging Vocabulary Overlap mit Sanskrit zur Generierung von technischen Lexikons in indischen Sprachen | LEVOS:利用梵文的词汇重叠来生成印度语言的技术词汇 2407.06331v2 |
Authors (4): Karthika N J, Krishnakant Bhatt, Ganesh Ramakrishnan, Preethi Jyothi
Translating technical terms into lexically similar, low-resource Indian languages remains a challenge due to limited parallel data and the complexity of linguistic structures. We propose a novel use-case of Sanskrit-based segments for linguistically informed translation of such terms, leveraging subword-level similarity and morphological alignment across related languages. Our approach uses character-level segmentation to identify meaningful subword units, facilitating more accurate and context-aware translation. To enable this, we utilize a Character-level Transformer model for Sanskrit Word Segmentation (CharSS), which addresses the complexities of sandhi and morpho-phonemic changes during segmentation. We observe consistent improvements in two experimental settings for technical term translation using Sanskrit-derived segments, averaging 8.46 and 6.79 chrF++ scores, respectively. Further, we conduct a post hoc human evaluation to verify the quality assessment of the translated technical terms using automated metrics. This work has important implications for the education field, especially in creating accessible, high-quality learning materials in Indian languages. By supporting the accurate and linguistically rooted translation of technical content, our approach facilitates inclusivity and aids in bridging the resource gap for learners in low-resource language communities.
由于数据有限,语言结构复杂,因此,将技术术语转换为语言相似、资源较少的印度语言仍是一项挑战。我们提议对基于梵语的语系进行语言知情翻译,利用亚字级相似性和相关语言的形态调整。我们的方法是使用品格分解来确定有意义的次词单位,促进更准确和符合背景的翻译。为此,我们使用梵语分解(CharSS)的品格级变异器模型(CharSS),该模型处理分解期间沙沙希和轻声语变化的复杂性。我们观察到,在两个实验环境中,使用桑斯里特语分解的语系技术术语翻译技术术语在两个方面不断改进,分别平均为8.46和6.79chrF++分数。此外,我们开展了一项后期人类评价,以核实使用自动化指标对翻译技术术语进行的质量评估。这项工作对教育领域有重要影响,特别是在以印度语言制作无障碍、高质量的学习材料方面。我们的方法有助于准确和语言化地翻译技术内容。我们的方法有助于弥合低语言学习社区的资源差距。
Article 229
Title@2025-06-24 (2): MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages
Title: MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages | MuBench: Bewertung der Mehrsprachigkeit von großen Sprachmodellen in 61 Sprachen | Bunch:评估61种语文大语言模式多语文能力 2506.19468v1 |
Authors (10): Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Liu, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, Yin Zheng
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench’s alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models on English and Chinese with 500B tokens, varying language ratios and parallel data proportions to investigate cross-lingual transfer dynamics.
多语文大型语言模式(LLMS)正在迅速发展,新的模式往往要求支持越来越多的语言,但是,现有的评价数据集有限,缺乏跨语文的一致,因此对多种语文能力的评估在语言和技能覆盖面方面都支离破碎,为此,我们采用Mubench,这是一个涵盖61种语言并评估广泛能力的基准,我们评价了若干先进的多语文大型语言模式,发现声称的语言覆盖面与实际语言覆盖面之间存在明显差距,特别是英语与低资源语言之间持续的绩效差距。我们建议多语文一致性(MLC)作为分析工作瓶颈和指导模式改进的精确度的补充指标。最后,我们预先准备一套1.2B参数的中英方和中文模型,配有500B符号、不同语言比例和平行数据比例,以调查跨语文的传输动态。
Article 230
Title@2025-06-24 (2): Can Large Language Models Capture Human Annotator Disagreements?
Title: Can Large Language Models Capture Human Annotator Disagreements? | Können große Sprachmodelle menschliche Annotatoren-Disagreements erfassen? | 大语言模型能捕捉人类代言人分歧吗? 2506.19467v1 |
Authors (9): Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash
Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs) are increasingly used for automatic annotation to reduce human effort, their evaluation often focuses on predicting the majority-voted “ground truth” labels. It is still unclear, however, whether these models also capture informative human annotation variation. Our work addresses this gap by extensively evaluating LLMs’ ability to predict annotation disagreements without access to repeated human labels. Our results show that LLMs struggle with modeling disagreements, which can be overlooked by majority label-based evaluations. Notably, while RLVR-style (Reinforcement learning with verifiable rewards) reasoning generally boosts LLM performance, it degrades performance in disagreement prediction. Our findings highlight the critical need for evaluating and improving LLM annotators in disagreement modeling. Code and data at https://github.com/EdisonNi-hku/Disagreement_Prediction.
虽然大语言模型(LLMs)越来越多地用于自动注解以减少人类的努力,但其评价往往侧重于预测多数人所投的“地面真相”标签,然而,尚不清楚这些模型是否也捕捉了信息丰富的人类注解变量。我们的工作通过广泛评价LLMs预测注解差异的能力而消除了这一差距,而没有获得重复的人类标签。我们的结果表明,LLMs与建模分歧挣扎,而大多数基于标签的评价可以忽略这些分歧。值得注意的是,RLVR风格(强化学习可核实的奖赏)通常会提高LLM的性能,但这种推理会降低分歧预测的性能。我们的调查结果强调,在建模时,评估和改进LLMNattors的迫切需要。代码和数据见https://github.com/Edison Ni-hku/Disagment_Pretriction。
Article 231
Title@2025-06-24 (2): Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights
Title: Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights | Mehrsprachige Tokenisierung durch die Gläser indischer Sprachen: Herausforderungen und Einsichten | 通过印度语言的镜头多语种教学:挑战和洞察 2506.17789v2 |
Authors (5): N J Karthika, Maharaj Brahma, Rohit Saluja, Ganesh Ramakrishnan, Maunendra Sankar Desarkar
Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. This paper presents a comprehensive intrinsic evaluation of tokenization strategies across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.
然而,现有的代用品商往往偏向高资源语言,限制了其对于语言多样性和形态丰富语言(如印度次大陆的语言)的效用。本文件全面评估了17种印度语言的代用品化战略。我们量化了自下而上的代用品算法和自上而下的代用品算法(BPE和Unigram LM)之间的权衡、词汇大小的影响,并比较了诸如联合和集群培训等多语言词汇构建战略。我们还表明,对相关高资源语言培训的代用品商可以使极低资源语言获益。我们的研究为建立更加公平、高效、语言上知情的多语言代用品提供了实用的洞察力。
Article 232
Title@2025-06-24 (2): TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Title: TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems | TTSDS2: Ressourcen und Benchmark für die Bewertung von Human-Quality Text to Speech Systems | TTSTSDS2:评价向语音系统提供的人质量文本的资源和基准 2506.19441v1 |
Authors (3): Christoph Minixhofer, Ondrej Klejch, Peter Bell
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.
语言对语言系统的评价具有挑战性,而且需要大量资源。平均意见评分(MOS)等主观指标在各种作品之间不易比较。客观衡量标准经常使用,但很少对照主观衡量标准加以验证。两种衡量标准都受到最近的TTS系统的挑战,这些系统能够产生合成语言与真实评分无法区分的合成语言。在这项工作中,我们引入了语言评分2(TTSDS2)的文本,这是更加健全和改良的TTSDS版本。在一系列领域和语言中,16种比较指标中只有一种与Spearman在每一个领域和主观评分上高于0.50的相对关系有关。我们还释放了一系列资源,用于评价接近真实评分的合成语言:11 000多份主观评分的数据集;为不断重新创建多语言测试数据集以避免数据泄漏的管道;以及持续更新14种语言的TTS基准。
Article 233
Title@2025-06-24 (2): Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
Title: Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System | Mem4Nav: Förderung der Vision-und-Sprache-Navigation in städtischen Umgebungen mit einem Hierarchischen Raum-Kognition Long-Short-Memory-System | Mem4Nav:利用一个等级空间-定位长短记忆系统促进城市环境中的视觉和语言导航 2506.19433v1 |
Authors (6): Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.
大型城市环境中的视觉和语言导航(VLN)要求有在复杂场景中进行地面语言教学的内装剂,并回忆在延长的时间跨度中的相关经验。以前的模块化管道提供可解释性,但缺乏统一的记忆,而端至端(M)LLLM代理物则擅长使用视觉和语言,但仍然受到固定环境窗口和隐含的空间推理的限制。我们引入了可增强VLUN骨架的高级空间认知长距离内存系统(VLN)系统。Mem4Nav连接了一个稀疏的奥氏树,用于为高层次里程碑连通提供精细的肉毒剂索引,同时储存在通过可变变变变变变器嵌入的可训练的记忆符号中。长期记忆(LTM)压缩并保留在离子树枝和图形节上的历史观察,短期记忆(STM)将近期障碍规避和本地规划的最近摩联条目添加到相对坐标中。STM公司正在快速检索的智能智能智能智能智能数据库,并在更深的轨道上进行快速的SDLDLDlevill 。
Article 234
Title@2025-06-24 (2): Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study
Title: Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study | Lernen, mit Sprach-VAEs latente Regeln zu entwirren: Eine systematische Studie | 学习解开语言 VAE 的边端理由解释规则:系统研究 2506.19418v1 |
Authors (4): Yingji Zhang, Marco Valentino, Danilo S. Carvalho, André Freitas
Incorporating explicit reasoning rules within the latent space of language models (LMs) offers a promising pathway to enhance generalisation, interpretability, and controllability. While current Transformer-based language models have shown strong performance on Natural Language Inference (NLI) tasks, they often rely on memorisation rather than rule-based inference. This work investigates how reasoning rules can be explicitly embedded and memorised within the LMs through Language Variational Autoencoders (VAEs). We propose a complete pipeline for learning reasoning rules within Transformer-based language VAEs. This pipeline encompasses three rule-based reasoning tasks, a supporting theoretical framework, and a practical end-to-end architecture. The experiment illustrates the following findings: Disentangled reasoning: Under explicit signal supervision, reasoning rules - viewed as functional mappings - can be disentangled within the encoder’s parametric space. This separation results in distinct clustering of rules in the output feature space. Prior knowledge injection: injecting reasoning information into the Query enables the model to more effectively retrieve the stored value Value from memory based on Key. This approach offers a simple method for integrating prior knowledge into decoder-only language models. Performance bottleneck: In mathematical reasoning tasks using Qwen2.5(0.5B), increasing sample count doesn’t improve performance beyond a point. Moreover, ffn layers are better than attention layers at preserving the separation of reasoning rules in the model’s parameters.
在语言模型(LMs)的潜在空间内纳入明确的推理规则,为在语言模型(LMs)的潜在空间内加强通用、可解释性和可控性提供了一条充满希望的道路。虽然目前以变换者为基础的语言模型在自然语言推理(NLI)任务方面表现良好,但它们往往依赖于沉思而不是基于规则的推理。这项工作调查了如何通过语言变换自动自动计算器(VAEs)在语言模型(VAEs)在语言模型内明确嵌入和回忆推理规则。我们提出了在基于变换者的语言VAEs中学习推理规则的完整管道。这个管道包括三个基于规则的推理任务、一个支持的理论框架和一个实用的端对端结构。实验展示了以下结果:分解推理:在明确的信号监督下,推理规则(被视为功能映图)可以被分解到 LMmus 空间内。这种分离的结果是在输出模型中将规则分解为不同的组合。先前的引力信息注入:将推理学信息注入Query使模型能够更有效地从基于Key的内存储价值的模型中检索到存储值值。这个模型,使用更精确的推理学的推理方法是更精确的推理学的推算。在Key的推理学的推算。在前的推理学上一个简单的推理学的推理学的推算法。 。在前的推算,在前的推算上,在前的推算法的推算中,在前的推算上一个简单的推算中将一个简单的推算法是更精确的推算法的推算法。
Article 235
Title@2025-06-24 (2): Automated Detection of Pre-training Text in Black-box LLMs
Title: Automated Detection of Pre-training Text in Black-box LLMs | Automatisierte Erkennung von Vorschulungstexten in Black-Box LLMs | 黑箱LMM 培训前文字自动检测 2506.19399v1 |
Authors (6): Ruihan Hu, Yu-Ming Shang, Jiankun Peng, Wei Luo, Yazhe Wang, Xi Zhang
Detecting whether a given text is a member of the pre-training data of Large Language Models (LLMs) is crucial for ensuring data privacy and copyright protection. Most existing methods rely on the LLM’s hidden information (e.g., model parameters or token probabilities), making them ineffective in the black-box setting, where only input and output texts are accessible. Although some methods have been proposed for the black-box setting, they rely on massive manual efforts such as designing complicated questions or instructions. To address these issues, we propose VeilProbe, the first framework for automatically detecting LLMs’ pre-training texts in a black-box setting without human intervention. VeilProbe utilizes a sequence-to-sequence mapping model to infer the latent mapping feature between the input text and the corresponding output suffix generated by the LLM. Then it performs the key token perturbations to obtain more distinguishable membership features. Additionally, considering real-world scenarios where the ground-truth training text samples are limited, a prototype-based membership classifier is introduced to alleviate the overfitting issue. Extensive evaluations on three widely used datasets demonstrate that our framework is effective and superior in the black-box setting.
检测某文本是否是大语言模型(LLMs)培训前数据的成员,对于确保数据隐私和版权保护至关重要。大多数现有方法都依赖于LLM的隐藏信息(例如,模型参数或象征性概率),使该文本在黑箱设置中无效,因为只有输入和输出文本可以访问。虽然为黑箱设置提出了某些方法,但是它们依赖大量手工工作,例如设计复杂的问题或指示。为了解决这些问题,我们提议VeilProbe,这是在没有人类干预的情况下在黑箱设置中自动检测LMs培训前文本的第一个框架。VeilProbe使用一个顺序到顺序的绘图模型来推断输入文本和LM生成的相应输出部分之间的潜在映射特征。然后,它进行关键符号穿透,以获得更能区分的会籍特征。此外,考虑到地面真相培训文本样本有限的现实世界情景,我们引入了一个基于原型的会员分类,以缓解黑箱中的过度问题。在三种广泛使用的框中进行广泛的评估,显示我们的高端框架。
Article 236
Title@2025-06-24 (2): Statistical Multicriteria Evaluation of LLM-Generated Text
Title: Statistical Multicriteria Evaluation of LLM-Generated Text | Statistische Multikriterien Bewertung des LLM-generierten Textes | 对LLLM-创用文字的多标准统计评价 2506.18082v2 |
Authors (5): Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Matthias Aßenmacher, Christoph Jansen
Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.
评估LLMM产生的文本的质量仍然是自然语言处理中的一项根本挑战。目前的评价方法往往依靠孤立的衡量标准或简单汇总方法,这些方法未能反映一致性、多样性、流畅性和其他相关文本质量指标之间的细微取舍。在这项工作中,我们调整了最近提出的基于通用的Sottacistic Dominance(GSD)的统计推断框架,该框架解决了现有基准方法的三个关键局限性:单度评价不足、基本自动衡量标准与普通人类判断不相容以及缺乏推断性统计保证。GSD前沿方法使得在尊重不同测量尺度的同时,能够同时进行多个质量方面的评价,同时以部分解码战略为基础,从而避免武断地权衡所涉指标。通过应用这一框架来评价共同的对人为文本的解码战略,我们证明它有能力在计算与抽样设计i.d假设的潜在偏差的同时,查明具有统计意义的绩效差异。
Article 237
Title@2025-06-24 (2): Measuring and Guiding Monosemanticity
Title: Measuring and Guiding Monosemanticity | Messen und Führen von Monosemantik | 衡量和指导单项措施 2506.19382v1 |
Authors (7): Ruben Härle, Felix Friedrich, Manuel Brack, Stephan Wäldchen, Björn Deiseroth, Patrick Schramowski, Kristian Kersting
There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.
对利用机械解释和可控性来更好地理解和影响大型语言模型的内部动态的兴趣日益增长,然而,目前的方法在可靠地定位和调控地貌图示方面面临着根本性的挑战。 Sparse Autoencoders(SAE)最近逐渐成为规模地采地貌的有希望的方向,然而,由于地貌不完全孤立和不可靠的单一对称性,它们也受到限制。为了系统地量化这些限制,我们引入了地物单体抗性评分(FMS),这是量化潜在代表性中独一性的新指标。根据这些认识,我们建议了“Sprass Autoencoders”(G-SAE),这是在培训期间为标签概念提供潜在代表性的方法。我们证明,在潜在空间内目标概念的可靠本地化和分解可以改善可解释性、行为探测性和控制性。具体地说,我们对毒性检测、风格识别和隐私属性识别的评估表明,G-SAE不仅能增强单一性,而且能够以低质量降解方式进行更有效和精细化的指导。我们的调查结果为测量和推进性解释性解释性提供了可操作性控制提供了可操作性指南。
Article 238
Title@2025-06-24 (2): ReDit: Reward Dithering for Improved LLM Policy Optimization
Title: ReDit: Reward Dithering for Improved LLM Policy Optimization | ReDit: Belohnung für verbesserte LLM-Policy-Optimierung | Redit:为改进LLM政策优化而向优利分差 2506.18631v2 |
Authors (6): Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu
DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it’s a ‘‘perfect’’ reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.
DeepSeek-R1 通过其有章可循的奖赏制度,成功地提升了大语言模型的推理能力。这是一个“完美”的奖赏制度,可以有效减少奖赏黑客,但这种奖赏功能往往互不相干。我们的实验观察显示,离散的奖励可能导致梯度异常、不稳定的优化和缓慢的趋同。为了解决这个问题,我们提议了ReDit(再向分流)方法,通过添加简单的随机噪音,使离散的奖赏信号相互脱钩。有了这一方法,在整个学习过程中不断提供探索性梯度,使得更顺畅的梯度更新和加速趋同。注入的噪音还给平板奖励区带来了可选性,鼓励探索新政策和逃避本地选择的模式。不同任务的实验显示了ReDit的有效性和效率。平均而言,ReDit在培训步骤中只达到10%左右的比Vanilla GROPO的性能,此外,在培训期间比Vanla GROPO提高了4%的性能改进。可视化能证实与ReDit的优点。此外,还进一步进行了理论分析。
Article 239
Title@2025-06-24 (2): SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
Title: SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents | SpokenWOZ: Ein großformatiger Sprach-Text-Benchmark für gesprochene Task-Orientierte Dialog-Agenten | pokenWOZ:针对以任务为主的对话代理方的大型演讲-文本基准 2305.13040v7 |
Authors (10): Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li
Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/.
近年来,以任务为导向的对话模式(TOD)取得了显著进展,然而,以往的研究主要侧重于由说明者编写的数据集,这导致学术研究与现实世界口述对话情景之间的差距。虽然一些小规模的口述TOD数据集被提议解决强健问题,如ASR错误,但它们忽视了口述对话中的独特挑战。为了克服这些局限性,我们引入了SpokenWoZ,一个供口述TOD使用的大型语音-文本数据集,包括8个域、203k转折、5.7k对话以及来自人与人之间口对话的249小时音频。SpokenWoZ还吸收了通用的口述特征,如逐字处理和口述语言推理等。基于这些特征,我们提出跨转时档和推理时间探测作为新的挑战。我们实验了各种基线,包括文本模式、新提议的双模式和LLMS,例如,ChatGPT。结果显示,目前的各种模式在口述对话中仍有很大的改进空间,其中,最先进的对话状态跟踪器仅达到25.65%的用户对目标的精确度,在SO-tal-tal-tal com distrate distration distrutes compeutes
Article 240
Title@2025-06-24 (2): Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation
Title: Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation | Aus-von-Charakter-Behavior zu erkennen: Atom-Level-Bewertung von Persona Fidelity in Open-Ended Generation | 《查哈拉彻查行为外的观察:对不限名额的一代人助人行为的原子层面评价》 2506.19352v1 |
Authors (5): Jisu Shin, Juhyun Oh, Eunsu Kim, Hoyun Song, Alice Oh
Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.
大型语言模型(LLMs)的忠诚性对于保持一致性和使人与AI互动至关重要,然而,LLMs往往展示出远离Character(OOC)的行为,产生的反应与指定的人不同,导致影响模型可靠性的不一致。现有的评价方法通常为整个反应分配单分,努力捕捉微妙的人与人之间的不和,特别是在长式文本生成过程中。为解决这一限制,我们提议了一个原子级评价框架,在细微颗粒上量化人与人之间的忠诚性。我们的三个关键指标衡量人与世世代代之间的协调一致程度。我们的方法通过查明实际使用者会遇到的微妙的偏差,使得能够更准确、更现实地评估人与人之间的忠诚性。我们通过实验,证明我们的框架有效检测了先前方法所忽略的人与人之间的不相容性。我们通过分析不同任务和个性类型之间的忠诚性,我们揭示任务结构和人情可取性如何影响模型的适应性,突出在保持一致性表达方面的挑战。
Article 241
Title@2025-06-24 (2): In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly
Title: In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly | In-Context Occams Razor: Wie Transformer einfachere Hypothesen auf der Fliege bevorzugen | 内文 Occam 的剃刀: 如何在飞行中发生变形人更倾向于简单易碎的假说 2506.19351v1 |
Authors (4): Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis
In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity environments, practical language models encounter tasks spanning diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design well-controlled testbeds based on Markov chains and linear regression that reveal transformers not only identify the appropriate complexity level for each task but also accurately infer the corresponding parameters–even when the in-context examples are compatible with multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam’s razor by balancing model fit against complexity penalties. We further ablate on the roles of model size, training mixture distribution, inference context length, and architecture. Finally, we validate this Occam’s razor-like inductive bias on a pretrained GPT-4 model with Boolean-function tasks as case study, suggesting it may be inherent to transformers trained on diverse task distributions.
文本内学习(ICL) 使变压器能够通过没有参数更新的背景实例适应新的任务。 虽然现有的研究通常在固定复杂环境中研究ICL, 但实用语言模型会遇到不同复杂程度的任务。 本文调查变压器如何在高复杂类别完全代表更简单的结构所产生的任何模式的等级任务结构中穿行。 我们根据Markov链和线性回归设计有良好控制的试床,这些测试床显示变压器不仅确定每项任务的适当复杂程度,而且准确推算相应的参数,即使变压器与多重复杂假设相容。 值得注意的是, 当使用更简单的流程生成的数据时,变压器总是倾向于最不复杂的解释。 我们理论上通过Bayesian Occam 框架来解释这一行为, 表明变压器通过平衡模型与复杂处罚的平衡, 从而有效地执行内置 Bayesian Occam 的剃刀。 我们进一步强化变压器的模型大小、 培训混合分布、 推断环境长度和结构的作用。 最后, 我们验证了Occam的剃须式定型的定型偏向性偏向最不复杂的解释性偏差的GPT-4 的模型, 可能是的变型任务的变形分析, 可能是的GPT- POLT- Pl- Plated- dism- dismismismismillings exmissmillymilling to ex ex ex ex ex ex ex ex resmissmissmissmission ex ex resmission resmissionaltitionaltitionaltitional ex lactionaltitionaltitional ex lactions lactions
Article 242
Title@2025-06-24 (2): Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
Title: Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations | Analysieren des Wissens von LLMs Grenzenkognition über Sprachen hinweg durch die Linse interner Repräsentationen | 分析LLM女士通过内部代表的镜头跨越各种语言的知识边界认知 2504.13816v3 |
Authors (7): Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, Yu Rong
While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs’ perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs’ recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
虽然了解LLMS的知识界限对于防止幻觉至关重要,但关于LLMS知识界限的研究主要侧重于英语,在这项工作中,我们提出第一份研究报告,分析LLMS如何在以多种语言处理已知和未知问题时进行内部陈述,从而认识不同语言的知识界限。我们的实证研究揭示了三个主要结论:(1) LLMs对知识界限的看法是在不同语言的中上层和中上层进行编码的。(2) 知识边界概念的语文差别遵循线性结构,这促使我们提议采用一种不培训的统一方法,在各种语言之间有效转让知识边界感知能力,从而帮助减少低资源语言中的幻觉风险;(3) 双语问题对口翻译的微调进一步提高LLMs对不同语言知识界限的认识。鉴于没有跨语言知识边界分析的标准测试台,我们建立了一个由三种具有代表性的知识边界数据组成的多语种评价套件。我们的代码和数据集可公开查阅https://github.com/DAM-NLP-SG/LLLM-M-Multiloadledge-kledgege-Boundarys。
Article 243
Title@2025-06-24 (2): RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning
Title: RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning | RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung | RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成 2506.11555v2 |
Authors (9): Yu Wang, Shiwan Zhao, Zhihu Wang, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang, Ming Fan, Ting Liu
The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.
通过检索增强的一代人(RAG)整合外部知识已成为加强大型语言模型(LLMs)用于知识密集型任务的基础,然而,现有的RAG模式往往忽视应用知识的认知步骤,从而在检索到的事实和具体任务推理之间留下差距。在这项工作中,我们引入了RAG+,这是一个原则性和模块扩展,明确将应用程序认知推理纳入RAG管道。RAG+构建了一个由知识和统一应用实例组成的双轨组合,由人工生成或自动生成,并在推断过程中共同检索。这一设计使LLAMs不仅能够获取相关信息,而且能够在结构化、面向目标的推理过程中应用这些信息。在多个模型上进行的跨数学、法律和医学领域的实验表明,RAG+始终超越了标准RAG变体,实现了平均3-5 %的改进,在复杂情景下最高收益达7.5%。通过将检索与可操作的应用连接起来,RAG+推进一个更基于认知的知识整合框架,这代表了向更可解释性和更能的LMSMs的一步。
Article 244
Title@2025-06-24 (2): FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
Title: FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression | FLAT-LLM: Feinkörnige Low-rank Aktivierung Raumtransformation für großsprachliche Modellkompression | FLAT-LLM: 用于大语言模型压缩的精制低级激活空间转换 2505.23966v2 |
Authors (6): Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang
Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis (PCA), and employ an importance-based metric to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 4 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
大型语言模型(LLMs)在自然语言处理方面取得了显著进展,但其高的计算和记忆需求对资源限制环境中的部署提出了挑战。尽管最近的低级分解方法为结构压缩提供了一条很有希望的道路,但它们往往受到精度退化和昂贵校准程序的影响,并导致模型结构效率低下,从而阻碍了真实世界的推导速度。在本文中,我们提议FLAT-LLM是一种快速、准确、无培训的结构性压缩方法,其基础是激活空间中精细的低级转换。具体地说,我们通过使用通过头部主构件分析(PCA)计算出来的短精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精
Article 245
Title@2025-06-24 (2): JCAPT: A Joint Modeling Approach for CAPT
Title: JCAPT: A Joint Modeling Approach for CAPT | JCAPT: Ein gemeinsamer Modellierungsansatz für CAPT | JCAPT: CAPT的联合示范方法 2506.19315v1 |
Authors (3): Tzu-Hsuan Yang, Yue-Yang He, Berlin Chen
Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.
有效的发音反馈在第二语言(L2)学习中至关重要,对于第二语言(L2)的发音培训,计算机辅助发音培训(CAPT)系统通常包括两项关键任务:自动发音评估(APA)和发音检测和诊断(MDD),最近的工作表明,对这两项任务进行联合建模可以产生互利。我们的统一框架利用了有选择性的州空间模型Mamba(SSM),同时结合了声学特征和象征性战略,以共同加强APA和MDD的可解释性和细微时间推理。 据我们所知,这是将声学归属、SMM建模和CAPT的催化结合起来的第一项研究。 一系列关于SOcean762演讲基准的实验表明,我们的模型始终超越了以往方法,特别是在MDD任务上。
Article 246
Title@2025-06-24 (2): Long-Context Generalization with Sparse Attention
Title: Long-Context Generalization with Sparse Attention | Lang-Kontext Verallgemeinerung mit Sparse Achtung | 以粗略的注意力概括化长文本 2506.16640v2 |
Authors (3): Pavlo Vasylenko, Marcos Treviso, André F. T. Martins
Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Finally, we show that the ability to locate and generalize fixed-size patterns can be further improved through a careful design of position encodings, which impacts both dense and sparse attention methods. By integrating ASEntmax into standard transformer layers alongside proper positional encodings, we show that our models greatly outperform softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines on long-context generalization.
以变压器为基础的结构传统上使用软式最大值来计算注意重量, 从而在序列中生成所有象征的密集分布。 虽然在许多环境中都有效, 但这种密度已经证明不利于要求精确关注固定规模模式的任务: 随着序列长度的增加, 非信息性象征会聚集关注概率质量, 导致分散和表达式崩溃。 我们在本文件中显示, 使用 $alpha$- entmax 的微量关注机制可以避免这些问题, 因为它们能够将精确的零分配到不相关的象征。 此外, 我们引入了适应性可缩缩缩缩式( ASEntmax) (ASEntmax) , 因为它会将 $\ alphax- entax 和可学习性温度参数联系起来, 使得人们能够关注分散到稀薄( 量式重度) 和 密度( 软式) 的( 类似软式) 制度之间的交叉分布。 最后, 我们表明, 定位和概括固定规模模式可以通过谨慎设计位置编码来进一步提高固定和稀薄的注意方法。 通过将ASecontapermon- asqal- gasqal- gasqal- glasmaxmaxyalmaxyalmaxmaxyal) maxyal mamamaxy modsmmm 。
Article 247
Title@2025-06-24 (2): Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
Title: Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs | Skywork-SWE: Enthüllen von Datenskalierungsgesetzen für Software-Engineering in LLMs | SKYWork-SWE:LLM中软件工程法数据缩放法 2506.19290v1 |
Authors (11): Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou
Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model’s performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.
软件工程(SWE)最近成为下一代LLM 代理商的关键测试点, 要求具备以下两个关键层面的内在能力: 持续迭代解答问题( 例如, > 50 互动回合) 和长文本依赖度解析( 例如, > 32k 质证) 。 然而, SWE的数据校正进程仍然臭名昭著地耗时, 因为它严重依赖人工批注代码文件过滤和设置专用运行时间环境来执行和验证单位测试。 因此, 大部分现有数据集仅限于数千个 GitHub 源案例。 为此, 我们提议建立一个渐进式自动数据校正管道, 系统测量 SWE 数据集的数量和多样性。 我们的数据集包括来自 2 531 个不同的 GitHub 仓库的10 169个真实世界 Python 任务, 每个任务都伴随着一个以自然语言指定的任务, 以及一个用于自动测试单位测试的专用运行时间- 环境图像。 我们仔细整理了8000个成功运行的模拟未来测试轨迹。 在SWE 测试的SWE 数据库中, 升级的SWEDSDSDSDSLOD 的运行数据能力中, 继续改进我们这些测试的运行运行的运行的运行数据。
Article 248
Title@2025-06-24 (2): Evaluating Transparent Reasoning in Large Language Models for Accountable Critical Tasks
Title: Evaluating Transparent Reasoning in Large Language Models for Accountable Critical Tasks | Bewertung der transparenten Begründung in großen Sprachmodellen für buchhalterische kritische Aufgaben | 评价可问责关键任务大语言模式的透明理由评估 2408.01933v5 |
Authors (4): Junhao Chen, Bowen Wang, Jiuyang Chang, Yuta Nakashima
This paper introduces REACT, a benchmark designed to rigorously evaluate the reasoning capabilities of large language models (LLMs) within accountable, high-stakes decision-making tasks in medical and legal domains. Unlike traditional benchmarks primarily focused on prediction accuracy, REACT emphasizes transparent and interpretable reasoning, requiring models to align their logic closely with expert-derived procedures. To assess whether LLM reasoning aligns closely with human experts, we annotated 511 clinical cases from the medical domain and 86 legal cases from the legal domain, each enriched with detailed expert-extracted rationales and evidence supporting each step of the reasoning process. These annotations were guided by carefully constructed reasoning graphs, which explicitly encode domain-specific inference structures and decision criteria derived by domain experts. These reasoning graphs serve not only as standards for expert annotation but also as structured guidelines enabling models to reason transparently and step-by-step. To address the scalability challenges of manual annotation, we further developed a semi-automatic annotation pipeline leveraging expert-defined reasoning graph templates to efficiently generate new graphs, exploring the potential to extend our approach into additional critical domains. Experimental results demonstrate that reasoning graphs substantially enhance the interpretability and accuracy of LLM reasoning compared to traditional baselines, although significant gaps remain relative to expert-level reasoning performance.
本文介绍了REACT,这是一个旨在严格评价大语言模型(LLMS)在负责、高层次医疗和法律领域决策任务范围内的推理能力的基准,与主要侧重于预测准确性的传统基准不同,REACT强调透明和可解释的推理,要求模型使其逻辑与专家派生程序密切配合;为了评估LAM推理是否与人类专家密切配合,我们附加了511个医学领域的临床案例和法律领域的86个法律案例,每个案例都有详细的专家提取的理由和证据支持推理过程的每一步骤。这些说明以精心设计的推理图为指导,这些图表明确将特定领域的推理结构和由域专家制定的决策标准编码。这些推理图不仅作为专家注解的标准,而且作为结构化指南,使模型能够以透明和逐步的方式说明问题。为了解决手册注解的可扩展性挑战,我们进一步开发了半自动注解管道,利用专家定义的推理图表模板,以有效生成新的图表。探索将我们的方法扩大到其他关键领域的可能性。这些推理推理学结果表明,虽然传统推理学的推理仍然具有重要的推理学程度,但推理学水平,但推推推推推,但推推推推推推推的推推推的推推推的推推,仍然是相当的推推推推推推推推推推推推推推的精确性推推的推推推推推推推推。
Article 249
Title@2025-06-24 (2): Disentangling Reasoning and Knowledge in Medical Large Language Models
Title: Disentangling Reasoning and Knowledge in Medical Large Language Models | Entwirren von Vernunft und Wissen in medizinischen großen Sprachmodellen | 在医疗大语言模式中分离理由和知识 2505.11462v2 |
Authors (14): Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou
Medical reasoning in large language models (LLMs) aims to emulate clinicians’ diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, HuatuoGPT-o1 scores 56.9 on knowledge but only 44.8 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.
大型语言模型(LLMS)的医疗推理旨在模仿临床医师的诊断思维,但目前的基准,如MedQA-USMLE、MedMCQA和PubMedQA,往往将推理与事实回顾混为一流。我们通过将11项生物医学质量基准分为推理和以知识为重点的子子集,使用PubMedBERT分类,其精确度达到81%,可与人类性能相比。我们的分析表明,只有32.8%的问题需要复杂的推理。我们评估生物医学模型(HuatuoGPT-o1,MedReason, m1)和普通模型(Dep Seek-R1, o4-mini, Quen3)和普通模型(Diep Seek-R1, o4-mini, Quen3),往往在知识和推理学绩效之间找到一致的差距。例如,HuatoGPT-o1,但仅就推理学而言,只有44.8项。在对抗性测试中,生物模型显示较强的模型和反级的实验性模型中可以取得最强的成绩。
Article 250
Title@2025-06-24 (2): EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition
Title: EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition | EmoStage: Ein Rahmen für eine präzise Empathetic Response Generation über Perspective-Taking und Phasenerkennung | 电站:通过透视和分阶段识别准确产生同情反应的框架 2506.19279v1 |
Authors (4): Zhiyang Qi, Keiko Takamizo, Mariko Ukiyo, Michimasa Inaba
The rising demand for mental health care has fueled interest in AI-driven counseling systems. While large language models (LLMs) offer significant potential, current approaches face challenges, including limited understanding of clients’ psychological states and counseling stages, reliance on high-quality training data, and privacy concerns associated with commercial deployment. To address these issues, we propose EmoStage, a framework that enhances empathetic response generation by leveraging the inference capabilities of open-source LLMs without additional training data. Our framework introduces perspective-taking to infer clients’ psychological states and support needs, enabling the generation of emotionally resonant responses. In addition, phase recognition is incorporated to ensure alignment with the counseling process and to prevent contextually inappropriate or inopportune responses. Experiments conducted in both Japanese and Chinese counseling settings demonstrate that EmoStage improves the quality of responses generated by base models and performs competitively with data-driven methods.
虽然大型语言模式(LLMs)提供了巨大的潜力,但目前的方法也面临挑战,包括对客户心理状态和咨询阶段的了解有限,依赖高质量的培训数据,以及与商业部署有关的隐私问题。为了解决这些问题,我们提议EmoStage,这是一个通过利用开放源LMs的推论能力来增加同情反应的框架,而没有额外的培训数据。我们的框架引入了对推断客户心理状态和支助需要的视角,从而能够产生情感共振反应。此外,还纳入了阶段识别,以确保与咨询进程保持一致,并防止因地制宜或不合适的反应。在日本和中国咨询环境中进行的实验表明,EmoStage提高了基础模型所产生反应的质量,并以数据驱动的方法进行竞争。
Article 251
Title@2025-06-24 (2): Process Reward Models That Think
Title: Process Reward Models That Think | Prozess Belohnung Modelle, die denken | 思考的流程奖励模式 2504.16828v3 |
Authors (8): Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
Step-by-step verifiers – also known as process reward models (PRMs) – are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers – using only 1% of the process labels in PRM800K – across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ‘24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.
渐进式核查者 – – 也称为流程奖励模式(PRMS) – – 是测试时间缩放的一个关键要素。PRMS需要升级级监督,因此培训费用昂贵。这项工作的目的是将数据高效的PRMs建设成口头化的渐进式奖励模式,通过生成一个思维链(Cot)来核查解决方案的每一个步骤。我们建议SinkPRM,这是一个长期的COT核查者,对数量比歧视性的PRM要求的更少的程序标签进行微调。我们的方法利用了长期 CoT模型的内在推理能力,超过了LM-as-Juds和具有歧视性的校验者 – – 仅使用PRM800K中进程标签的1% – – 超越了几个具有挑战性的基准。具体地说,SinkPRMM在最佳选择和奖赏指导搜索下,超越了进程基准。在GPQA-DM和LiveCodeBench中,我们PRMBS-超前级的校验官/LOFMS-LMS-LADMS-LODRVILMS-LADMSDRVILMS-LILILILILMS-CS-S-LODRDRDRLILILILLLLLLLLLD 和在升级中, MAMA RVILMLMLMLMLVIKS-S-S-S-SOLLVDRVLILILVD-SDRVDRVDRVDRLLLLLLILLLLLLLLLLLLLD-CS-S-S-S-S-S-SD-SD-S-S-SD-SD-SD-SD-SD-S-SD-SD-SD-SD-LIGMT-SD-SD-SD-SVD-SVD-SD-LILILT-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S
Article 252
Title@2025-06-24 (2): Personality Prediction from Life Stories using Language Models
Title: Personality Prediction from Life Stories using Language Models | Persönlichkeitsvorhersage aus Lebensgeschichten mit Sprachmodellen | 使用语言模型对生活故事的个性预测 2506.19258v1 |
Authors (5): Rasiq Hussain, Jerry Ma, Rithik Khandelwal, Joshua Oltmanns, Mehak Gupta
Natural Language Processing (NLP) offers new avenues for personality assessment by leveraging rich, open-ended text, moving beyond traditional questionnaires. In this study, we address the challenge of modeling long narrative interview where each exceeds 2000 tokens so as to predict Five-Factor Model (FFM) personality traits. We propose a two-step approach: first, we extract contextual embeddings using sliding-window fine-tuning of pretrained language models; then, we apply Recurrent Neural Networks (RNNs) with attention mechanisms to integrate long-range dependencies and enhance interpretability. This hybrid method effectively bridges the strengths of pretrained transformers and sequence modeling to handle long-context data. Through ablation studies and comparisons with state-of-the-art long-context models such as LLaMA and Longformer, we demonstrate improvements in prediction accuracy, efficiency, and interpretability. Our results highlight the potential of combining language-based features with long-context modeling to advance personality assessment from life narratives.
自然语言处理(NLP)通过利用丰富的、开放的文本,超越传统的问卷,为个性评估提供了新的途径。在本研究中,我们探讨了在每次超过2000个符号的长话访谈中进行长话访谈的模型化挑战,以便预测五因素模型(FFM)个性特征。我们提出了一个两步方法:首先,我们利用预先培训的语言模型的滑动-窗口微调提取背景嵌入;然后,我们运用经常性神经网络(RNN),并采用关注机制,整合远程依赖性,提高可解释性。这种混合方法有效地连接了预先训练的变异器和序列模型化的长话数据处理的优势。通过对比研究和比较,我们展示了预测准确性、效率和可解释性等最新长话模型的改进。我们的成果突出了将语言特征与长文字模型相结合,从生命描述中提前进行个性评估的潜力。
Article 253
Title@2025-06-24 (2): MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
Title: MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models | MSR-Align: Policy-Grounded Multimodal Alignment für sicherheitsbewusste Begründung in Vision-Language-Modellen | MSR-对称:在愿景-语言模型中安全意识理由的政策-全方位多式联运协调 2506.19257v1 |
Authors (6): Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.
视觉-语言模型(VLMS)通过强化思维链能力,在多式联运推理任务方面取得了显著进展;然而,这一进展也带来了新的安全风险,因为这些模型越来越容易受到有害的多式联运催化行为的影响,从而引发不道德或不安全的行为; 现有的安全调整方法,主要为单一方式语言模型设计,在应对多式联运投入带来的复杂和细化威胁方面不足; 此外,目前的安全数据集缺乏强有力地调整具有理性能力的VLMS所需的精细、政策依据的推理。 在这项工作中,我们引入了{MSR-Allign},一个高质量的多模式安全解释数据集,以弥补这一差距。 MSR-Allign支持关于标准化安全政策的精细、细化的思考推理,涵盖各种愿景和文本模式; 我们的数据生成管道强调多式联运的多样性、基于政策的推理,以及利用强大的多式联运法官严格地过滤质量。 广泛的实验表明,微调的VLMMM-ALMS-ALMS-All 大幅改进了应对文本和愿景-日历攻击的稳健健性,同时维护或加强公众推理学基础。
Article 254
Title@2025-06-24 (2): Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track
Title: Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track | Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen | 职位:机器学习会议应建立“反驳和批评”轨道 2506.19882v1 |
Authors (15): Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Koustuv Sinha, Francesco Orabona, Sanmi Koyejo, David Donoho
Science progresses by iteratively advancing and correcting humanity’s understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made.This position paper argues that ML conferences should establish a dedicated “Refutations and Critiques” (R & C) Track. This R & C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.
在机器学习(ML)研究中,快速进展导致出版物爆炸,但也导致误导、不正确、有缺陷或甚至可能是欺诈性的研究在ML会议上被接受,有时由于同行审评的失败而被强调。虽然这些错误是可以理解的,但ML会议并没有提供强有力的进程,帮助实地在出现这种错误时系统地纠正这些错误。本立场文件认为ML会议应该建立一个专门的“反驳和批评(R & C)轨道 ” 。这一R & C轨道将提供一个高知名度的、声誉良好的平台,支持对先前研究提出严峻挑战的重要研究,从而形成动态自我纠正研究生态系统。我们讨论了关键考虑因素,包括跟踪设计、审查原则、潜在陷阱,并就最近的ICLR 2025口头发言提供了说明性实例。我们的结论是,ML会议应该建立正式的、声誉良好的机制,以帮助ML研究自我纠正。
Article 255
Title@2025-06-24 (2): Augmenting Multi-Agent Communication with State Delta Trajectory
Title: Augmenting Multi-Agent Communication with State Delta Trajectory | Augmenting Multi-Agent Kommunikation mit State Delta Trajektorie | 增强与国家三角三角洲轨迹的多代理通信 2506.19209v1 |
Authors (7): Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai
Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing LLM-based multi-agent systems mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to concrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process, so we propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning. This shows the potential of communication augmentation for LLM-based multi-agent systems.
多剂技术,例如角色扮演或多转式辩论,已证明在提高大型语言模型(LLMs)在下游任务方面的效绩方面是有效的。尽管现有LLM多剂系统在工作流程上存在差异,但现有LLM多剂系统大多使用自然语言来进行代理通信。这需要简单易懂,但也带来了不可避免的信息损失,因为一个模型必须先从连续状态矢量取样到具体标记,然后再将其转移到另一个模型。当传输的信息不是简单事实,而是逻辑或抽象思维时,这种损失就特别大。为了解决这一问题,我们提议一项新的通信协议,将自然语言符号和象征性状态转换轨迹从一个代理器转移到另一个代理器。特别是,与实际的状态价值相比,我们发现LLMs在产生每个符号之后的状态变化顺序能够更好地反映推断过程背后隐藏的信息,因此我们建议一个国家三角洲电解(SDE)方法来代表国家过渡轨迹。实验结果表明,与基于SDETA的通信协议相比,实现SOTA性工作与其他通信协议的绩效,特别是在涉及复杂推理学的任务中。这显示了SBRALM的通信系统的潜在。
Article 256
Title@2025-06-23 (1): Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition
Title: Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition | Bayesische evolutionäre Schwarmarchitektur: Ein formales epistemisches System, das im wahrheitsbasierten Wettbewerb verankert ist | Bayesian 进化型蜂轮结构:以基于真相的竞争为基础的正式流行系统 2506.19191v1 |
Authors (1): Craig Steven Wright
We introduce a mathematically rigorous framework for an artificial intelligence system composed of probabilistic agents evolving through structured competition and belief revision. The architecture, grounded in Bayesian inference, measure theory, and population dynamics, defines agent fitness as a function of alignment with a fixed external oracle representing ground truth. Agents compete in a discrete-time environment, adjusting posterior beliefs through observed outcomes, with higher-rated agents reproducing and lower-rated agents undergoing extinction. Ratings are updated via pairwise truth-aligned utility comparisons, and belief updates preserve measurable consistency and stochastic convergence. We introduce hash-based cryptographic identity commitments to ensure traceability, alongside causal inference operators using do-calculus. Formal theorems on convergence, robustness, and evolutionary stability are provided. The system establishes truth as an evolutionary attractor, demonstrating that verifiable knowledge arises from adversarial epistemic pressure within a computable, self-regulating swarm.
我们引入了一个数学严谨的人工情报系统框架,由通过结构性竞争和信仰修正而演变的概率因素组成。建筑基于巴伊西亚推论、测量理论和人口动态,将代理人的适合性定义为与代表地面真理的固定外部神器相匹配的功能。代理在离散的时环境中竞争,通过观察到的结果调整后视信仰,由高等级的代理人再生和低等级的代理人濒临灭绝。评级通过双向的符合事实的效用比较进行更新,信仰更新保持了可衡量的一致性和切合性。我们引入了基于散列的加密身份承诺,以确保可追踪性,与使用计算方法的因果推断操作者一起。提供了关于趋同、稳健和进化稳定性的正式论。该系统将真理确立为进化吸引者,表明可核实的知识产生于可比较、自我调节的温和的反向感应力压力。
Article 257
Title@2025-06-23 (1): Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages
Title: Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages | Prompt, Übersetzen, Fine-Tune, Re-Initialize, oder Instruction-Tune? Anpassung von LLMs für das In-Context-Lernen in Low-Resource-Sprachen | 迅速、翻译、微调、微调、重新启动或指示-图? 2506.19187v1 |
Authors (2): Christopher Toukmaji, Jeffrey Flanigan
LLMs are typically trained in high-resource languages, and tasks in lower-resourced languages tend to underperform the higher-resource language counterparts for in-context learning. Despite the large body of work on prompting settings, it is still unclear how LLMs should be adapted cross-lingually specifically for in-context learning in the low-resource target languages. We perform a comprehensive study spanning five diverse target languages, three base LLMs, and seven downstream tasks spanning over 4,100 GPU training hours (9,900+ TFLOPs) across various adaptation techniques: few-shot prompting, translate-test, fine-tuning, embedding re-initialization, and instruction fine-tuning. Our results show that the few-shot prompting and translate-test settings tend to heavily outperform the gradient-based adaptation methods. To better understand this discrepancy, we design a novel metric, Valid Output Recall (VOR), and analyze model outputs to empirically attribute the degradation of these trained models to catastrophic forgetting. To the extent of our knowledge, this is the largest study done on in-context learning for low-resource languages with respect to train compute and number of adaptation techniques considered. We make all our datasets and trained models available for public use.
LLMS通常是以高资源语言培训的,而低资源语言的任务往往低于高资源语言对应方的成绩,以便进行内流学习。尽管在快速设置方面做了大量工作,但仍然不清楚LLMS应如何进行跨语言的适应,具体地说,用于低资源目标语言的内流学习。我们开展了一项涵盖五种不同目标语言、三个基础LMS和七个下游任务的全面研究,涵盖范围超过4,100个GPU培训小时(9,900+TFLOPs),涉及各种适应技术:鲜有的提示性、翻译性测试、微调、嵌入重新初始化和教学微调。我们的成果显示,少数的提示性和翻译性测试环境往往大大超出基于梯度的适应方法。为了更好地理解这一差异,我们设计了一套新的标准、有效的输出回调(VOR),并分析模型产出,将这些经过训练的模式的退化以造成灾难性的遗忘。就我们的知识范围而言,这是对低资源语言的文字学习进行的最大研究,以培训现有和适应技术的数量。我们考虑过所有数据。
Article 258
Title@2025-06-23 (1): Transferring Features Across Language Models With Model Stitching
Title: Transferring Features Across Language Models With Model Stitching | Übertragung von Funktionen über Sprachmodelle mit Modellstich | 使用模型裁剪的跨语言模型传输功能 2506.06609v2 |
Authors (4): Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick
In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
在这项工作中,我们证明,语言模型剩余流之间的拼图是有效转移模型之间特征的一种廉价方法。我们运用这一技术在不同大小模型之间转移Sprasse Autoencoders(SAEs)的重量,以比较其代表性。我们发现,小型和大型模型学习类似的代表空间,这鼓励对像SAEs这样昂贵的组件进行小型模型培训,并在FLOPs节省的资金中转换到更大的模型。特别是,在对大型模型进行SAE培训时,使用小到大转让的SAE可以带来50%的更便宜的培训。接下来,我们证明转让的探测器和向导能够有效地恢复地面真实性能。最后,我们深入地探索地发现,在特定的功能类别具有真实性的情况下,语义性和结构特征的转移明显不同,而特定的功能类别的作用也得到了忠实的映射。总体而言,我们的调查结果说明了小型和大型模型线性代表空间的相似性和差异,并展示了提高SAEs培训效率的方法。
Article 259
Title@2025-06-23 (1): Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data
Title: Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data | Verbesserter Hybrid-Transducer und Aufmerksamkeits-Encoder-Decoder mit Textdaten | 带有文本数据的增强混合转换器和注意编码器解码器 2506.19159v1 |
Authors (3): Yun Tang, Eesung Kim, Vijendra Raj Apsingekar
A joint speech and text optimization method is proposed for hybrid transducer and attention-based encoder decoder (TAED) modeling to leverage large amounts of text corpus and enhance ASR accuracy. The joint TAED (J-TAED) is trained with both speech and text input modalities together, while it only takes speech data as input during inference. The trained model can unify the internal representations from different modalities, and be further extended to text-based domain adaptation. It can effectively alleviate data scarcity for mismatch domain tasks since no speech data is required. Our experiments show J-TAED successfully integrates speech and linguistic information into one model, and reduce the WER by 5.8 ~12.8% on the Librispeech dataset. The model is also evaluated on two out-of-domain datasets: one is finance and another is named entity focused. The text-based domain adaptation brings 15.3% and 17.8% WER reduction on those two datasets respectively.
为混合传感器和关注编码器(TAED)建模提出了一种联合语音和文字优化方法,以利用大量文字资料,提高ASR的准确性。联合TAED(J-TAED)同时接受语音和文字输入模式的培训,但仅将语音数据作为推论期间的输入。经过培训的模型可以将不同方式的内部表述统一起来,并进一步扩大至基于文字的域适应。它可以有效地减轻不匹配域任务的数据稀缺,因为不需要语言数据。我们的实验显示J-TAED成功地将语言信息和语言信息整合到一个模型中,并在Librispeech数据集中将WER减少5.8~12.8 %。模型还用两个主域外数据集进行评估:一个是财务,另一个是指定实体为重点。基于文字的域适应分别使这两个数据集减少15.3%和17.8%的WER。
Article 260
Title@2025-06-23 (1): ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs
Title: ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs | ProxSparse: Regularisiertes Lernen von halbstrukturierten Sparsity Masken für vortrainierte LLMs | ProxSparse:为预先培训的LMM 定期学习半结构化半结构化的顶罩 2502.00258v2 |
Authors (8): Hongyi Liu, Rajarshi Saha, Zhen Jia, Youngsuk Park, Jiaji Huang, Shoham Sabach, Yu-Xiang Wang, George Karypis
Large Language Models (LLMs) have demonstrated exceptional performance in natural language processing tasks, yet their massive size makes serving them inefficient and costly. Semi-structured pruning has emerged as an effective method for model acceleration, but existing approaches are suboptimal because they focus on local, layer-wise optimizations using heuristic rules, failing to leverage global feedback. We present ProxSparse, a learning-based framework for mask selection enabled by regularized optimization. ProxSparse transforms the rigid, non-differentiable mask selection process into a smoother optimization procedure, allowing gradual mask exploration with flexibility. ProxSparse does not involve additional weight updates once the mask is determined. Our extensive evaluations on 7 widely used models show that ProxSparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning.
大型语言模型(LLMS)在自然语言处理任务中表现出了非凡的性能,但是其庞大的规模使得其效率低、成本高。 半结构化的剪裁作为模型加速的有效方法已经出现,但现有方法并不理想,因为它们侧重于使用超常规则的局部、多层次优化,未能利用全球反馈。我们提出了ProxSparse,这是一个学习基础框架,用于通过正规化优化进行遮罩选择。ProxSparse将僵硬、非差别化的遮罩选择程序转化为一种更平稳的优化程序,允许以灵活的方式逐步遮盖探索。ProxSparse在确定遮罩后不涉及额外的重量更新。我们对7个广泛使用的模型的广泛评价表明,ProxSparse持续地超越了先前提议的半结构化遮罩选择方法,并大大改进了我们所学习的半结构化剪裁方法的有效性。
Article 261
Title@2025-06-23 (1): Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
Title: Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series | Zeit-IMM: Ein Datensatz und Benchmark für irreguläre multimodale Multivariate Zeitreihen | 时间-IMM:非正常多式联运多变时间序列的数据集和基准 2506.10412v2 |
Authors (7): Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang
Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
在现实世界应用中的时间序列数据,如医疗保健、气候模型和融资,往往不规则、多模式和混乱,抽样率不同,不统一的方式和普遍缺失;然而,现有基准通常假定清洁、定期抽样、单一方式的数据,造成研究和现实世界部署之间的巨大差距;我们引入了时间-IMM,这是一个数据集,专门用来记录多式多变时间序列中由原因驱动的不规则性;时间-IMM是9种不同的时间序列,分为触发型、约束型和工艺型机制;补充数据集,我们引入IMM-TSF,这是一个用于预测不规则的多式联运时间序列的基准图书馆,能够实现非同步整合和现实评估;IMM-TSM-TSF包括专门的聚合模块,包括一个时间戳至文字融合模块和一个基于多式多变时间序列的多式联运模块;时间-观测平均和关注型整合型战略。
Article 262
Title@2025-06-23 (1): TRAIL: Trace Reasoning and Agentic Issue Localization
Title: TRAIL: Trace Reasoning and Agentic Issue Localization | TRAIL: Nachvollziehen von Vernunft und Agentik Lokalisierung | TRAIL:追踪理由和制剂问题地方化 2505.08638v3 |
Authors (6): Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.
在不同领域越来越多地采用代理工作流程带来了一个至关重要的需要,即必须对这些系统产生的复杂痕迹进行精确和系统的评估。目前的评价方法取决于对长工作流程痕迹进行人工、具体领域的人力分析 – – 这种方法与代理产出的日益复杂和数量不相称。这些环境中的错误分析由于外部工具产出和语言模型推理的相互作用而变得更加复杂,使得这种分析比传统的软件调试更具挑战性。在这项工作中,我们(1) 明确了对代理工作流程痕迹进行强有力和动态评价方法的必要性,(2) 对在代理系统中遇到的错误类型进行正式的分类,(3) 提出一套148种大型人类附加说明的痕迹(TRAIL),这是利用这一分类法建立起来的,并以既定的代理基准为基础。为了确保生态有效性,我们从单一和多试剂系统收集的痕迹,我们侧重于软件工程和开放世界信息检索等现实世界应用程序。我们的评价表明,现代长背景的LIMS在追踪调试方面表现不佳,最佳的Gemini-2.5-pro模型在TRAIL上只评出11%。我们的数据设置和代码是公开的,用以支持和加速未来研究的代理商工作流程。
Article 263
Title@2025-06-23 (1): Human-Aligned Faithfulness in Toxicity Explanations of LLMs
Title: Human-Aligned Faithfulness in Toxicity Explanations of LLMs | Menschlich ausgerichtete Treue in der Toxizität Erklärungen von LLMs | 人与人之间和谐的对LLMM 的毒性解释的信念 2506.19113v1 |
Authors (4): Ramaravind K. Mothilal, Joanna Roy, Syed Ishtiaque Ahmed, Shion Guha
The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs’ reasoning about toxicity – from their explanations that justify a stance – to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs’ free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate \haf of LLMs’ toxicity explanations with no human involvement, and highlight how “non-ideal” the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and nonsensical responses. We open-source our code and LLM-generated explanations at https://github.com/uofthcdslab/HAF.
围绕NLP的毒性和LLMML的论述主要围绕检测任务。这项工作将重点转向评估LLMS关于毒性的推理 – – 从说明其立场的理由转向评估LLMS关于毒性的推理 – – 以提高其在下游任务中的可信任性。尽管对解释性进行了广泛的研究,但采用现有方法评估自由形式的毒性解释并不简单,因为过度依赖输入文本扰动等等挑战。为此,我们提出了一个新颖的、理论上的多维标准,即人与众不同的信仰(HAF),衡量LLMS的自由形式毒性解释与理想条件下理性人的毒性解释一致的程度。我们根据不确定性的量化制定六项指标,在没有人类参与的情况下全面评估LLMS的毒性解释,并突出“非理想”的解释方式。我们就三种Llama模型(尺寸高达70B)和关于五种不同毒性数据集的8BMLAF模式进行了几次实验。我们的结果表明,LMMS在简单解释的准确解释中,关于毒性解释的推理断断时,在导致其毒性/结果不连贯的单个解释时,我们不连贯的法/结果的法解释。
Article 264
Title@2025-06-23 (1): ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
Title: ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities | ADVLLM: Iterative Selbst-Tuning LLMs für verbesserte Jailbreaking-Fähigkeiten | ADVLLM: 强化破室能力自动自动自调LMs 2410.18469v4 |
Authors (8): Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao
Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.
最近的研究显示,大型语言模型(LLMs)很容易受到自动越狱攻击,在这种攻击中,由有害查询所附带的算法所设计的对抗性后缀可以绕过安全调整并引发意外反应;目前制作这些后缀的方法计算成本昂贵,攻击率低,特别是针对Llama2和Llama3等非常接近的模式。为了克服这些限制,我们引入ADV-LLM(ADV-LLM),这是一个迭接自调程序,它能以强化越狱能力来制造对抗性后缀,我们的框架大大减少了产生对抗性后缀的计算成本,同时使各种开放源LMs达到近100 ARS。此外,它显示出向封闭源模式转移攻击性很强,在GPT-3.5和GPT-4上达到99ASR(GPT-3.5和49°SR),尽管只是利用Llama来优化。除了提高越狱能力外,ADV-LM(LM)通过生成大型LM安全研究LM数据集的能力,为未来安全协调研究提供宝贵的见解。
Article 265
Title@2025-06-23 (1): Small Language Models in the Real World: Insights from Industrial Text Classification
Title: Small Language Models in the Real World: Insights from Industrial Text Classification | Kleine Sprachmodelle in der realen Welt: Einblicke aus der industriellen Textklassifikation | 《现实世界中的小语言模式:对工业文本分类的洞察》 2505.16078v3 |
Authors (5): Lujun Li, Lama Sleem, Niccolo’ Gentile, Geoffrey Nichil, Radu State
With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings.
随着查特格伯特的出现,变换模型具有显著的先进文本分类和相关任务。Llama等只使用代号的模型表现出很强的性能和灵活性,但是由于代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代谢,在文本分类方面缺乏效率;此外,变换代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代
Article 266
Title@2025-06-23 (1): Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
Title: Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting | Sprachmodelle verstehen dich vielleicht nicht: Theorie des Geistes über Story Prompting bewerten | 语言模型可能无法理解你:通过故事提示评估心理理论 2506.19089v1 |
Authors (2): Nathaniel Getachew, Abulhair Saparov
We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.
我们引入了 $\ textt{ storySim} $\ textt{ storySim} $\ textt{ storySim} $\ textt{ storySim $, 一个合成生成故事的可编程框架, 用于评估大型语言模型( LLM) 的思维理论( ToM) 和世界建模( WM) 能力。 与先前的基准不同, $\ textt{ storySim} 美元产生新的、 编造故事, 由高度可控的 $\ texttt{ storyStorySimboard} 美元支撑, 使得能够精确地操纵字符视角和事件。 我们使用这个框架来设计一和二等的 TOM 任务, 与 WM 任务一起控制跟踪和模拟精神状态的 WM 任务。 我们在一组最先进的LMSM 中进行的实验显示, 多数模型在WM 任务上的表现比 Tom 任务比 Tom 任务要好, 而模型往往比 与无动性对象物体进行更好的推理。 此外, 我们的框架让我们能够找到 诸如 偏差偏差和过分行为的证据行为的证据 证据 。 所有代码 数据和评估 。
Article 267
Title@2025-06-23 (1): MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanation
Title: MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanation | MFTCXplain: Ein Mehrsprachiger Benchmark-Datensatz zur Bewertung der moralischen Vernunft von LLMs durch Hassreden-Multi-Hop-Erklärung | MFTCXplain:通过仇恨言论多呼多呼解释评估LLMs道德理由的多语言基准数据集 2506.19073v1 |
Authors (9): Jackson Trager, Francielle Vargas, Diego Alves, Matteo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Flor Plaza-del-Arco, Yalda Daryanai, Farzan Karimi-Malekabadi
Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.
由于这些系统被用于社会敏感的任务,确保大语言模型的道德推理能力日益成为日益令人关切的问题,然而,目前的评价基准存在两个主要缺陷:缺乏说明,说明道德分类的理由,限制了透明度和可解释性;主要侧重于英语,限制了对不同文化背景的道德推理的评估;在本文件中,我们引入了MFTCXplain,这是一个多语种的基准数据集,用于利用道德基金会理论(MFT)通过仇恨言论多语种解释评价LLM的道德推理;数据集包括葡萄牙语、意大利语、波斯语和英语的3 000个推文,附有二元仇恨言论标签、道德类别和跨层次的文字理由说明;经验性结果突出表明了在道德推理任务中LLM产出与人说明之间的不协调;虽然LMs在仇恨言论探测方面表现良好(F1至0.836),但其预测道德情绪的能力明显薄弱(F1 < 0.35),但理由调整仍然主要限于代表性不足的语言。
Article 268
Title@2025-06-23 (1): HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models
Title: HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models | HAWAII: Hierarchischer Wissenstransfer für effiziente Vision-Sprachenmodelle | HAWAII:通过高层次视觉知识转让促进高效视觉语言模型 2506.19072v1 |
Authors (4): Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki
Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.
提高视力语言模型(VLMS)的视觉理解能力对于提高他们在各个任务中的绩效至关重要。虽然使用多个经过预先培训的视觉专家表现出了巨大的希望,但在培训和推断过程中往往要花费大量的计算费用。为了应对这一挑战,我们建议HAWAI,这是一个将多视力专家的知识提炼成单一的视觉编码器的新框架,使其能够继承数名专家的互补优势,而其计算管理费微乎其微。为了减少不同教师之间的冲突和不同教师具体知识之间的转换,而不是使用一套固定的多级教师适应器,我们提议使用一个相应的路由器,使用教师专用的低兰克适应(LORA)适应(LORA)适应器。每个适应器都与一个特定的教师相匹配,避免在蒸馏过程中提供吵闹的指引。为了能够有效地吸收知识,我们建议精细的和粗糙的蒸馏法。在精细的层次上,使用象征性的重要分数来强调每位教师在适应性方面最丰富的信息符号。在粗糙的层次上,我们总结了来自多个教师的知识,然后将高层次的MA-MA-LSLL的实验方法转换成一个方向。
Article 269
Title@2025-06-23 (1): NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching
Title: NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching | NLPnorth @ TalentCLEF 2025: Vergleich diskriminativer, kontrastiver und prompt-basierter Methoden für Jobtitel und Skill Matching | NLPnortth @ TalentCLEF 2025:比较有歧视、有抵触和基于迅速的方法的职称和技能匹配方法 2506.19058v1 |
Authors (2): Mike Zhang, Rob van der Goot
Matching job titles is a highly relevant task in the computational job market domain, as it improves e.g., automatic candidate matching, career path prediction, and job market analysis. Furthermore, aligning job titles to job skills can be considered an extension to this task, with similar relevance for the same downstream tasks. In this report, we outline NLPnorth’s submission to TalentCLEF 2025, which includes both of these tasks: Multilingual Job Title Matching, and Job Title-Based Skill Prediction. For both tasks we compare (fine-tuned) classification-based, (fine-tuned) contrastive-based, and prompting methods. We observe that for Task A, our prompting approach performs best with an average of 0.492 mean average precision (MAP) on test data, averaged over English, Spanish, and German. For Task B, we obtain an MAP of 0.290 on test data with our fine-tuned classification-based approach. Additionally, we made use of extra data by pulling all the language-specific titles and corresponding \emph{descriptions} from ESCO for each job and skill. Overall, we find that the largest multilingual language models perform best for both tasks. Per the provisional results and only counting the unique teams, the ranking on Task A is 5$^{\text{th}}$/20 and for Task B 3$^{\text{rd}}$/14.
匹配职称是计算工作市场领域的一项高度相关的任务,因为它改进了两个任务,例如:自动候选人匹配、职业路径预测和工作市场分析。此外,将职称与工作技能相匹配可被视为延长任务,与下游任务具有类似的相关性。在本报告中,我们概述了NLPnortth提交《2025年塔伦特CLEF》的文件,其中包括以下两项任务:多语言职称匹配和基于职称的基于职称的预测。对于这两个任务,我们比较了(调整好)基于分类的、(不全调的)基于对比的、基于对比的{基于和促动的方法。我们观察到,对于任务A,我们的促进方法可以被视为是最佳的,测试数据的平均平均为0.492平均精确度(MAP),高于英文、西班牙文和德文的平均精确度。对于任务B,我们用精确的分类法方法获得测试数据0.290的测试数据。此外,我们利用额外数据,从 ESCO$ 调出所有具体语文职称和对应的emph{descriction} ,我们发现每个职组和技能的A/14 最高等级,我们只算出第5个任务和最高成绩。
Article 270
Title@2025-06-23 (1): Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
Title: Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages | Auswirkungen des visuellen Kontextes auf geräuschreiche multimodale NMT: Eine empirische Studie für Englisch-Indisch-Sprachen | 视觉背景对噪音、多式NMT:英语对印度语经验研究的影响 2308.16075v2 |
Authors (5): Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Chandranath Adak, Asif Ekbal
Neural Machine Translation (NMT) has made remarkable progress using large-scale textual data, but the potential of incorporating multimodal inputs, especially visual information, remains underexplored in high-resource settings. While prior research has focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model handle textual noise. Multimodal models slightly outperform text-only models in noisy settings, even when random images are used. The study’s experiments translate from English to Hindi, Bengali, and Malayalam, significantly outperforming state-of-the-art benchmarks. Interestingly, the effect of visual context varies with the level of source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features perform better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, and opens up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information to improve translation across various environments. Our code is publicly available at https://github.com/babangain/indicMMT.
使用大规模文本数据,神经机器翻译(NMT)取得了显著的进展,但是在高资源环境下,将多式联运投入,特别是视觉信息纳入多式投入的潜力仍未得到充分探讨。虽然先前的研究侧重于在低资源情景中使用多式数据,但本研究审查了在大规模、经过事先培训的单式NMT系统中添加图像特征对翻译的影响。令人惊讶的是,研究发现图像在此背景下可能是多余的。此外,研究还引入合成噪音,以评估图像是否有助于模型处理文本噪音。多式模型在噪音环境中,即使使用随机图像,也略高于只使用文本的模型。研究的实验将英语转化为印地语、孟加拉语和马来拉姆语,大大超过最新基准。有趣的是,视觉环境的影响随源文本噪音程度而变化:没有视觉环境的最佳效果是低噪音,裁量式图像特征在高音频情景中表现得更好。这在视觉环境,特别是在噪音、孟加拉语系、孟加拉语系和马来拉姆拉姆语翻译环境中的角色上展示光线,这是将各种可公开翻译的图像翻版中的重要。
Article 271
Title@2025-06-23 (1): Rational Metareasoning for Large Language Models
Title: Rational Metareasoning for Large Language Models | Rationale Metaveraking für große Sprachmodelle | 大语言模型的逻辑比值 2410.05563v3 |
Authors (5): C. Nicolò De Sabbata, Theodore R. Sumers, Badr AlKhamissi, Antoine Bosselut, Thomas L. Griffiths
Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs), deploying additional inference-time compute to improve task performance. However, as LLMs increase in both size and adoption, inference costs are correspondingly becoming increasingly burdensome. How, then, might we optimize reasoning’s cost-performance tradeoff? This work introduces a novel approach based on computational models of metareasoning used in cognitive science, training LLMs to selectively use intermediate reasoning steps only when necessary. We first develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning, then use this reward function with Expert Iteration to train the LLM. Compared to few-shot chain-of-thought prompting and STaR, our method significantly reduces inference costs (20-37\% fewer tokens generated across three models) while maintaining task performance across diverse datasets.
激励人们参与推理已成为使用大语言模型(LLMs)的核心技术,运用额外的推论时间计算来改进任务绩效。然而,随着LLMs在规模和采用方面都有所增加,推理成本也相应地变得日益沉重。那么,我们如何优化推理的成本-绩效权衡?这项工作引入了一种基于认知科学中使用的计算转换模型的新颖方法,培训LLMs仅在必要时才有选择地使用中间推理步骤。我们首先开发了一种奖励功能,通过惩罚不必要的推理,将计算值纳入计算值,然后利用这一奖励功能来培训LM。相比于几近一连串的启发思考和STaR,我们的方法大大降低了推理成本(在三个模型中产生的符号减少20-37倍),同时保持了不同数据集的任务性。
Article 272
Title@2025-06-23 (1): Self-reflecting Large Language Models: A Hegelian Dialectical Approach
Title: Self-reflecting Large Language Models: A Hegelian Dialectical Approach | Selbstreflektierende große Sprachmodelle: Ein hegelianischer dialektischer Ansatz | 自我反映大语言模型:海格利人对立方法 2501.14917v6 |
Authors (6): Sara Abdali, Can Goksen, Michael Solodko, Saeed Amizadeh, Julie E. Maybee, Kazuhito Koishida
Investigating NLP through a philosophical lens has recently caught researchers’ eyes, as it bridges computational methods with classical schools of philosophy. This paper introduces a philosophical framework inspired by the Hegelian Dialectic to enable LLMs’ self-reflection, utilizing a self-dialectical approach to emulate internal critiques and synthesize new scientific ideas (spanning domains such as mathematics, physics, and more). Additionally, we explore the effect of generation temperature in LLMs by introducing a dynamic annealing approach, which encourages creativity in the early stages and gradually focuses on refinement and nuance, as well as a constant-temperature strategy. Furthermore, we implement a Multi-Agent Majority Voting (MAMV) strategy to assess the validity and novelty of the generated ideas, which proves useful in the absence of domain experts. We also evaluate the effectiveness of our method in generating novel scientific ideas and improving LLMs’ reasoning capabilities. Our experiments demonstrate promising results in ideation, along with significant improvements in mathematical and symbolic reasoning.
通过哲学透镜进行NLP调查最近引起了研究人员的注意,因为它将计算方法与古典哲学流派相连接。本文件介绍了由Hegelian Diacletic 所启发的哲学框架,使LLMs的自我反省能够使LLMs的自我反省,利用自我反省方法来模仿内部评论和综合新的科学思想(涉及数学、物理等更多领域)。此外,我们还通过采用动态的反射方法,探索LLMS中产生温度的影响,鼓励早期的创造性,并逐步侧重于完善和细微,以及恒温战略。此外,我们实施了多位多数投票(MAMV)战略,以评估所产生的思想的有效性和新颖性,这在没有领域专家的情况下证明是有用的。我们还评估了我们在产生新的科学思想和提高LLMS的推理能力方面的方法的有效性。我们的实验在构想方面显示了有希望的结果,同时在数学和象征性推理方面也有重大改进。
Article 273
Title@2025-06-23 (1): Plan for Speed – Dilated Scheduling for Masked Diffusion Language Models
Title: Plan for Speed – Dilated Scheduling for Masked Diffusion Language Models | Plan für Geschwindigkeit – Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle | 速度计划 – – 蒙面传播语言模型的压缩排程计划 2506.19037v1 |
Authors (3): Omer Luxembourg, Haim Permuter, Eliya Nachmani
Masked diffusion language models (MDLM) have shown strong promise for non-autoregressive text generation, yet existing samplers act as implicit planners, selecting tokens to unmask via denoiser confidence or entropy scores. Such heuristics falter under parallel unmasking - they ignore pairwise interactions between tokens and cannot account for dependencies when unmasking multiple positions at once, limiting their inference time to traditional auto-regressive (AR) models. We introduce the Dilated-scheduled Unmasking Strategy (DUS), an inference-only, planner-model-free method that requires no additional training. DUS leverages a first-order Markov assumption to partition sequence positions into dilation-based groups of non-adjacent tokens, enabling independent, parallel unmasking steps that respect local context that minimizes the joint entropy of each iteration step. Unlike semi-AR block approaches (e.g., LLADA and Dream) that still invoke the denoiser per block, DUS reduces the number of denoiser calls to O(log B) per generation block - yielding substantial speedup over the O(B) run time of state-of-the-art diffusion models, where B is the block size in the semi-AR inference process. In experiments on math (GSM8K) and code completion (Humaneval, MBPP) benchmarks - domains suited to non-ordinal generation - DUS improves scores over parallel confidence-based planner, without modifying the underlying denoiser. DUS offers a lightweight, budget-aware approach to efficient, high-quality text generation, paving the way to unlock the true capabilities of MDLMs.
蒙面的传播语言模型(MDLM)显示,非自动递增制文本生成有强烈的希望,但现有的样本采集者却充当隐性规划者,通过 Denoiser 信心或英特普评分选择符号,通过 Denoiser 信心或英特普评分选择符号来进行解析。这些超常主义在平行的脱形下摇摆,它们忽略了象征之间的双向互动,在一次解压缩多个位置时无法说明依赖性,将其推导时间限制在传统的自动递减(AR)模型中。我们引入了不偏差的、无计划型(DUS)法(DUS),一种只进行推断的、不需额外培训的、无计划型的、不设模范的、将序列位置分配顺序排列成基于通配方格的不相近似符号组,使独立、不同时的不折叠行步骤在尊重当地环境时,将双亚特(LADADAD和MD) 块方法(eb) ,DUS 和MDUSA (enoseral-de) 方法在不需的O(log) AL-B-B) RElental-deal-deal demodustrislate) rodustrislate roduction astemmal deal deal devel develisl) a des deal demod the the the the the the des the the des the rout the the the the lading the the the lading the lading the lading lax the lading the laut the des the rout the rout the des the rout the in in in in in in in in in in in in in in in in the ladaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldald the des in in in in in in in in in in in in in the the the the the the the the the the the the the ladaldaldaldaldaldaldaldald the lad the rout the lad the
Article 274
Title@2025-06-23 (1): Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Title: Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations | Gebrochene Token? Ihr Sprachmodell kann geheim nicht-kanonische Tokenisierungen handhaben | 断断的音调? 您的语言模式可以秘密处理非天体的音调 。 2506.19004v1 |
Authors (6): Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith
Modern tokenizers employ deterministic algorithms to map text into a single “canonical” token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can improve performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and demonstrate the promise of intervening on tokenization at inference time to boost performance.
现代代谢器使用确定性算法将文本映射成一个单一的“ 卡门” 象征性序列, 但同样字符串也可以像许多非卡门象征性符号一样编码。 在这项工作中, 我们调查LMs对以非卡门象征性符号编码的文本的坚固性能, 在培训期间完全看不见。 令人惊讶的是, 当用20个基准来评估时, 我们发现, 在随机抽样的象征性符号化和90.8%的字符级符号化中, 受指导的模型保留了高达93.4%的原始性能, 而有90.8%的原性能。 我们看到, 总体更强的模型往往更加强大, 随着代谢性远离了卡门形式, 强性模型会越来越弱。 我们发现, 非卡门象征性象征性象征性象征性象征性象征性符号化的模型会显示它们不固定的性能操作和代码理解性能。 发现, 字符级化的分解模式会通过读到 +14%, 正确的数字组合承诺会提高大数的算法 +33 % 。 最后, 我们调查了这种坚固性模型的源的源源源源, , 显示, 的驱动性模型会显示, 方向性模型会显示, 方向的精确性模型会显示, 而后演变后演变的模型会显示, 平底底部的模型会显示, 。
Article 275
Title@2025-06-23 (1): Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge
Title: Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge | Mirage of Mastery: Memorization Tricks LLMs in künstlich aufgeblasenes Selbstwissen | 万能万能幻影:记忆的秘诀 人工膨胀的自我知识 2506.18998v1 |
Authors (2): Sahil Kale, Vijaykant Nadadur
When artificial intelligence mistakes memorization for intelligence, it creates a dangerous mirage of reasoning. Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues and do not recognize an intertwining link that degrades the trustworthiness of LLM responses. In our study, we utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data or merely memorize them to assume competence across problems of similar complexity focused on STEM domains. Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge about their reasoning ability, which manifests as an over 45% inconsistency in feasibility assessments when faced with self-validated, logically coherent task perturbations. This effect is most pronounced in science and medicine domains, which tend to have maximal standardized jargon and problems, further confirming our approach. Significant wavering within the self-knowledge of LLMs also shows flaws in current architectures and training patterns, highlighting the need for techniques that ensure a balanced, consistent stance on models’ perceptions of their own knowledge for maximum AI explainability and trustworthiness. Our code and results are available publicly at https://github.com/knowledge-verse-ai/LLM-Memorization_SK_Eval-.
当人工智能错误记忆用于智能时,它会产生一种危险的推理幻觉。现有的研究将LLM的记忆和自我认知缺陷作为单独的问题处理,并且不承认一种相互交织的联系,这种联系会降低LLM答复的可信度。在我们的研究中,我们利用一个新的框架来确定LLMs是否真正从培训数据中学习推理模式,或者只是将它们记忆起来,以承担以STEM领域为重点的类似复杂问题的能力。我们的分析表明一个值得注意的普遍化问题:LLMs从记忆化的解决办法中获取信心,以推断出对其推理能力的更高自知度,这在面对自我验证、逻辑上一致的任务扭曲的任务时,其可行性评估有超过45%的不一致性。这种效应在科学和医学领域最为明显,往往有最高标准的术语和问题,进一步证实了我们的做法。LMS的自我认知也显示出当前结构和培训模式的缺陷,这突出表明需要技术确保模型对自身知识的平衡、一致立场,以获得最高程度的AI解释性和信任性。我们现有的代码和结果是公开的。
Article 276
Title@2025-06-23 (1): Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Title: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations | Vision als Dialekt: Visuelle Verständigung und Generierung über textorientierte Repräsentationen vereinen | 视觉透视:通过文本统一代表方式统一视觉理解和生成 2506.18898v1 |
Authors (9): Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model’s (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com
本文提出一个多式框架,试图在共同的离散语义代表中统一视觉理解和生成;其核心是文本统一调控器(TA-Tok),将图像转换成离散符号,使用从大语言模式词汇中预测的文本统一代码表,将图像转换成离散符号;通过将愿景和文本整合成一个统一的空间,扩大词汇,我们的多式LLM(Tar)将多式LLM(TA-Tok),通过共享界面,实现跨模式输入和产出,而不需要特定模式设计;此外,我们提议采用比例调控编码和解码,以平衡效率和视觉细节;同时采用基因化解调控器,以产生高异性视觉输出;为解决多种不同的解码需求,我们使用两种互补的解码器:快速自动递增模式和基于扩散的模式;为加强模式融合,我们调查先进的培训前任务,展示视觉理解和生成两方面的改进;跨基准的实验显示,Tar与现有多式LM(LM)方法相匹配或超过现有方法,实现更快的趋同,提高培训效率。
Article 277
Title@2025-06-23 (1): ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
Title: ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs | GrundFlux-PRM: Trajektorien-Bewusst-PRMs für lange Ketten-of-Thought-Reasoning in LLMs | 合理性-PRM:LLMs中用于长期研究链原因的轨迹-软件 2506.18896v1 |
Authors (7): Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFlux
最近出现了一个强有力的框架,用以监督大型语言模型中的中间推理步骤。以前的理学模型主要在示范最终产出反应方面受过培训,并努力地评价中间思维轨迹,特别是在正在形成的由边界推理模型(如Deepseek-R1)产生的轨迹反应产出方面。 在这个工作中,我们引入了“理性-有轨PRM”,这是一个新的轨迹-有识的PRM,明确旨在评价推理的轨迹类型。 理性-平流-平流-平流-平流-平流-平流-平流/平流-平流-级监督,使微微微微微微的奖励任务与结构性平流-定序数据一致。我们调整的理性-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平流-平-平-平-平流-平流-平-平-平流-平流-平流-平-平流-平流-平-平-平-平-平流-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-平-
Article 278
Title@2025-06-23 (1): OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
Title: OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization | OMEGA: Kann LLMs Vernunft außerhalb der Box in Mathe? Bewertung von exploratorischen, kompositorischen und transformativen Generalisierung | OMEGA: 数学框外的理学理学优异人士能否评价探索、构成和变换的通用性? 2506.18880v1 |
Authors (7): Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song
Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden’s typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.
最近大型语言模型(LLMS),其历史推理链长,如DeepSeek-R1等,在奥林匹亚数学基准方面取得了令人印象深刻的成果;然而,这些模型往往依赖一套狭窄的战略,与需要创新思维方法的问题作斗争;为了系统地调查这些局限性,我们采用了OMEGA-从发行的数学问题评估,采用3个通用Axes-a 控制但又多样化的基准,以评价在博登创造性类型启发下,在分配范围外的三条方向上普遍化:(1) 探索如何将已知的精细问题解决技能应用于同一问题领域的更复杂情况;(2) 将先前孤立地学习的截然不同的推理技巧组合,以解决需要以新的和连贯的方式综合这些技能的新问题;(3) 通过超越熟悉的通用方法来更有效地解决问题,采用新颖的、通常非常规战略。 OMEGA由方案生成的培训测试配对,由跨地质测量、数字理论、微镜学、梳理学、逻辑和谜题解的改进方法,我们用最高级的平级平级的平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平比。
Article 279
Title@2025-06-23 (1): CommVQ: Commutative Vector Quantization for KV Cache Compression
Title: CommVQ: Commutative Vector Quantization for KV Cache Compression | CommVQ: Kommutative Vector Quantization für KV Cache Compression | commVQ: KV 缓存压缩的通量矢量量 2506.18879v1 |
Authors (11): Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.
大型语言模型(LLMS) 越来越多地用于需要长上下文长度的应用程序, 但关键值缓存通常会随着上下文的增长而成为 GPU 的内存瓶颈。 为了解决这个问题, 我们提议传播矢量定量(ComVQ) , 以大幅降低长中文本 LLM 推断的内存使用量。 我们首先采用轻量的编码器和代码簿, 以压缩 KV 缓存, 可以通过简单矩阵乘法解码。 为了进一步降低解码过程中的计算成本, 我们设计了代码簿, 用扶轮性定位嵌入器(ROPE) 进行中存储瓶颈。 为了解决这个问题, 我们提议使用 compt- 矢量矢量 QQQQ, 将解码有效整合到自定义的LLLLMMMM 机制中。 我们的方法在添加添加的二次曲线缩略图和低管理代码后, 可以进行远程定义基准实验, GSMOV 来源显示我们的方法将 FP16 KV 缓存存储器大小减少87. 5%, 在 2- SIGIPI/ QODODLA 上, 在可操作上以2- deal- deal- QQQQQQQQQQQQQQQQQ- 以可操作一个最小化的最小化为最小化, Q- sqmal- 。
Article 280
Title@2025-06-23 (1): A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap
Title: A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap | Ein Kommentar zu “Die Illusion des Denkens”: Den vernünftigen Cliff als Agent-Gap zurückweisen | 关于“思考的幻觉”的评论:将理性断裂重新定位为一种危险差距 2506.18957v1 |
Authors (3): Sheraz Khan, Subha Madhavan, Kannan Natarajan
The recent work by Shojaee et al. (2025), titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, presents a compelling empirical finding, a reasoning cliff, where the performance of Large Reasoning Models (LRMs) collapses beyond a specific complexity threshold, which the authors posit as an intrinsic scaling limitation of Chain-of-Thought (CoT) reasoning. This commentary, while acknowledging the study’s methodological rigor, contends that this conclusion is confounded by experimental artifacts. We argue that the observed failure is not evidence of a fundamental cognitive boundary, but rather a predictable outcome of system-level constraints in the static, text-only evaluation paradigm, including tool use restrictions, context window recall issues, the absence of crucial cognitive baselines, inadequate statistical reporting, and output generation limits. We reframe this performance collapse through the lens of an agentic gap, asserting that the models are not failing at reasoning, but at execution within a profoundly restrictive interface. We empirically substantiate this critique by demonstrating a striking reversal. A model, initially declaring a puzzle impossible when confined to text-only generation, now employs agentic tools to not only solve it but also master variations of complexity far beyond the reasoning cliff it previously failed to surmount. Additionally, our empirical analysis of tool-enabled models like o4-mini and GPT-4o reveals a hierarchy of agentic reasoning, from simple procedural execution to complex meta-cognitive self-correction, which has significant implications for how we define and measure machine intelligence. The illusion of thinking attributed to LRMs is less a reasoning deficit and more a consequence of an otherwise capable mind lacking the tools for action.
Shojaee等人(2025年)最近题为“思考的幻觉:通过问题复杂度的镜头理解理性模型的长处和局限性”的工作(2025年)题为“思考的幻觉:通过问题复杂度的镜头理解理性模型的长处和局限性”,提出了令人信服的实证结论,即推理悬崖,其中大型理性模型(LRMs)的性能超过了具体的复杂阈值,作者认为这是“连锁推理(Cot)推理(Cot)推理(Cott)推理)”推理的内在规模限制。这一评论承认研究的方法老练,但认为这一结论是实验性工艺的精度。我们认为,所观察到的失败不是基本认知界限,而是一个在静态、只使用文本的评价范式模式(LRMRM)中系统层面的局限性的可预见结果。 一种模型,最初的逻辑推理推理(IL)推理(IL)推理(IL)分析(IL)(IL)(IL)(I)(IL)(IL)(L)(IL)(I)(I)(IL)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)))(L)(L)(L)(L)(L))(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)))(L))(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L))(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)
Article 281
Title@2025-06-23 (1): Mechanistic Interpretability Needs Philosophy
Title: Mechanistic Interpretability Needs Philosophy | Mechanistische Dolmetschbarkeit braucht Philosophie | 哲学 2506.18852v1 |
Authors (9): Iwan Williams, Ninell Oldenburg, Ruchira Dhar, Joshua Hatherley, Constanza Fierro, Nina Rajcic, Sandrine R. Schiller, Filippos Stamatiou, Anders Søgaard
Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.
机械可解释性(MI)旨在解释神经网络如何通过发现其内在因果机制而发挥作用。 随着这个领域影响力的扩大,不仅研究模型本身,而且研究模型研究中隐含的假设、概念和解释战略,变得越来越重要。 我们争论说,机械可解释性需要哲学:不是事后思考,而是作为澄清其概念、完善其方法以及评估解释人工智能系统的认知性和道德利害关系的持续伙伴。 以三个尚未解决的医学文献问题为例,本立场文件以三个尚未解决的问题为例,展示了可增加人工智能研究的价值哲学,并勾画了通往更深层次跨学科对话的道路。
Article 282
Title@2025-06-23 (1): USAD: Universal Speech and Audio Representation via Distillation
Title: USAD: Universal Speech and Audio Representation via Distillation | USAD: Universelle Sprach- und Audiodarstellung über Destillation | USAD:通过蒸馏实现普遍言论和音频代表 2506.18843v1 |
Authors (4): Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu
Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
自我监督的学习(SSL)已经使声音表现发生了革命性的变化,但模型往往仍然具有特定领域的特点,侧重于语言或非语言任务。在这项工作中,我们介绍了通用语音和音效蒸馏(USD),这是将多种声音类型(语音、声音和音乐)整合为单一模式的听力表现学习的统一方法。 USAD从具体领域的SSL模型中采用了高效的层到层的蒸馏方法,对学生进行综合音频数据集培训。 USAD提供各种基准和数据集的竞争性性能,包括框架和实例级语音处理任务、音频标记和音频分类,在SUPERB和听力基准上使用单一编码器实现接近最新的结果。
Article 283
Title@2025-06-23 (1): LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Title: LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning | LongWriter-Zero: Mastering Ultra-Long Text Generation via Verstärkungslernen | LongWriter-Zero:通过强化学习掌握超大龙制文本 2506.18841v1 |
Authors (5): Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li
Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ‘‘teaching’’, which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B
大型语言模型(LLMS)的超长生成是一个广泛需要的情景,然而,由于其最大发电期限和随着序列长度的增加而整体质量下降,它仍然是一个重大挑战。LongWriter(LongWriter)的以往方法通常依赖“教学”,这涉及对合成长式产出进行监管的微调(SFT),然而,这一战略严重依赖合成SFT数据,而合成SFT数据既困难又昂贵,往往缺乏一致性和一致性,而且往往过于人为和结构开放。在这项工作中,我们建议采用基于激励的方法,完全从零开始,不依赖任何附加说明或合成的数据,利用强化学习(RLL),促进LMSLMS出现超长、高质量文本生成能力。我们从一个基础模型(类似于R1-Zero)开始,指导它参与推理,促进编写过程中的规划和完善。我们采用专业的奖励模式,引导LLMOVSB的深度模型和结构化方法,即甚至依靠任何附加数据或合成合成数据,而实验性地显示我们长式的SWIB-R-RB-Bxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 284
Title@2025-06-23 (1): EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions
Title: EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions | EMULATE: Multi-Agenten-Rahmen für die Bestimmung der Veracity von atomaren Forderungen durch Emulation menschlicher Handlungen | EMULATE:通过模拟人类行动确定原子索赔的多机构框架 2505.16576v2 |
Authors (3): Spencer Hong, Meng Luo, Xinyi Wan
Determining the veracity of atomic claims is an imperative component of many recently proposed fact-checking systems. Many approaches tackle this problem by first retrieving evidence by querying a search engine and then performing classification by providing the evidence set and atomic claim to a large language model, but this process deviates from what a human would do in order to perform the task. Recent work attempted to address this issue by proposing iterative evidence retrieval, allowing for evidence to be collected several times and only when necessary. Continuing along this line of research, we propose a novel claim verification system, called EMULATE, which is designed to better emulate human actions through the use of a multi-agent framework where each agent performs a small part of the larger task, such as ranking search results according to predefined criteria or evaluating webpage content. Extensive experiments on several benchmarks show clear improvements over prior work, demonstrating the efficacy of our new multi-agent framework.
确定原子主张的真实性是许多最近提议的事实核对系统的一个必要组成部分。许多方法首先通过查询搜索引擎来检索证据,然后进行分类,提供证据组和原子对大语言模型的主张,从而解决这个问题,但这一过程偏离了人类为完成任务而要做的工作。最近的工作试图解决这一问题,办法是提出迭代证据检索,允许收集证据好几次,而且只有在必要的时候才能这样做。我们继续沿着这一研究轨迹,提议建立一个称为EMULATE的新的索赔核实系统,目的是通过使用多试剂框架更好地仿效人类的行动,使每个代理人都履行较大的任务中的一小部分,例如根据预先确定的标准进行排序搜索结果或评价网页内容。关于若干基准的广泛试验表明比以前的工作有明显改进,表明我们新的多剂框架的功效。
Article 285
Title@2025-06-23 (1): STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning
Title: STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning | STU-PID: Steuerungs-Token-Verwendung über PID-Controller für effizientes Large Language Model Reasoning | STU-PID:通过PID控制器用于高效大语言示范理由的指导使用 2506.18831v1 |
Authors (1): Aryasomayajula Ram Bharadwaj
Large Language Models employing extended chain-of-thought (CoT) reasoning often suffer from the overthinking phenomenon, generating excessive and redundant reasoning steps that increase computational costs while potentially degrading performance. While recent work has explored static steering approaches to mitigate this issue, they lack the adaptability to dynamically adjust intervention strength based on real-time reasoning quality. We propose STUPID (Steering Token Usage via PID controller), a novel training-free method that employs a PID controller to dynamically modulate activation steering strength during inference. Our approach combines a chunk-level classifier for detecting redundant reasoning patterns with a PID control mechanism that adaptively adjusts steering intensity based on the predicted redundancy probability. Experimental evaluation on GSM8K demonstrates that STUPID achieves a 6% improvement in accuracy while reducing token usage by 32%, outperforming static steering baselines. Our method provides a principled framework for dynamic reasoning calibration that maintains reasoning quality while significantly improving computational efficiency.
使用思维链推理的大型语言模型往往受到过度思考现象的影响,产生过多和多余的推理步骤,增加计算成本,同时可能降低业绩。虽然最近的工作探索了静态指导方法来缓解这一问题,但它们缺乏根据实时推理质量动态调整干预强度的适应性。我们建议采用一种新型的无培训方法STUPID(通过 PID 控制器进行调控 Token 使用),即使用PID 控制器在推论期间动态调节激活方向力的新型无培训方法。我们的方法将一个用于探测冗余推理模式的块级分类器与一个根据预测冗余概率调整指导强度的PID控制机制结合起来。关于GSM8K的实验评估表明,STUPID在将象征性使用率减少32%的同时,实现了6%的准确性提高,超过了静态方向基线。我们的方法为动态推理校提供了原则框架,既保持推理质量,又显著提高计算效率。
Article 286
Title@2025-06-23 (1): MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task
Title: MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task | MLLP-VRAIN UPV-System für die IWSLT 2025 Simultanübersetzung | MLLP-VRAIN IWSLT 2025 IWSLT 双声翻译翻译任务MLLP-VRAIN UPV系统 2506.18828v1 |
Authors (5): Jorge Iranzo-Sánchez, Javier Iranzo-Sánchez, Adrià Giménez, Jorge Civera, Alfons Juan
This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system that adapts strong pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight adaptation techniques rather than training new end-to-end models from scratch. Our approach employs document-level adaptation with prefix training to enhance the MT model’s ability to handle incomplete inputs, while incorporating adaptive emission policies including a wait-$k$ strategy and RALCP for managing the translation stream. Specialized buffer management techniques and segmentation strategies ensure coherent translations across long audio sequences. Experimental results on the ACL60/60 dataset demonstrate that our system achieves a favorable balance between translation quality and latency, with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of 2.94 seconds. Our final model achieves a preliminary score on the official test set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully adapted pre-trained components can create effective simultaneous translation systems for long-form content without requiring extensive in-domain parallel data or specialized end-to-end training.
这项工作描述了MLLP-VRAIN研究小组参与IWSLT 2025年双声传译轨道的共同任务的情况。我们提交的材料通过开发一个模块级联系统,使经过预先培训的模型适应流式情景,从而应对长式语音实时翻译的独特挑战。我们把ASR的Whisper 大型V3-Turbo与多语言的NLLB-3.3BMT模型结合起来,采用轻量级适应技术,而不是从头到尾培训新的端到尾模型。我们的方法采用文件级调整,先进行预先培训,以加强MT模型处理不完整投入的能力,同时纳入适应性排放政策,包括等待-K$战略和用于管理翻译流的RALCP。我们专门化缓冲管理技术和分解战略确保长音序列的一致翻译。ACLLL60/60数据集的实验结果表明,我们的系统在翻译质量和从头到尾的调时,可以达到31.96分和非直线调的升级培训,以提高MTMT模型处理不完整投入的能力,同时纳入适应性排放流流流流流流流流战略战略。我们2.94 LSTLLLTLTLTAS的常规测试系统的最后模型,在299秒前测试中可以实现正式测试。
Article 287
Title@2025-06-23 (1): A Survey on Data Selection for LLM Instruction Tuning
Title: A Survey on Data Selection for LLM Instruction Tuning | Eine Umfrage zur Datenauswahl für LLM Instruction Tuning | 关于LLM指示图示数据选择的调查 2402.05123v2 |
Authors (6): Bolin Zhang, Jiahao Wang, Qianlong Du, Jiajun Zhang, Zhiying Tu, Dianhui Chu
Instruction tuning is a vital step of training large language models (LLM), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than the quantity during instruction tuning of LLM. Therefore, recently a lot of studies focus on exploring the methods of selecting high-quality subset from instruction datasets, aiming to reduce training costs and enhance the instruction-following capabilities of LLMs. This paper presents a comprehensive survey on data selection for LLM instruction tuning. Firstly, we introduce the wildly used instruction datasets. Then, we propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances,and the evaluation strategies and results of data selection methods are also elaborated in detail. Finally, we emphasize the open challenges and present new frontiers of this task.
教学调整是培训大型语言模型(LLM)的一个重要步骤,因此,如何加强教学调整的效果受到越来越多的关注。现有工作表明,数据集的质量比LLM教学调整期间的数量更为关键。因此,最近许多研究的重点是探索从教学数据集中选择高质量子集的方法,目的是减少培训费用,提高LLMS的教学能力。本文件对LLM教学调整的数据选择进行了全面调查。首先,我们引入了狂野使用的教学数据集。然后,我们提出了数据选择方法的新分类,并详细介绍了最新进展,还详细阐述了数据选择方法的评价战略和结果。最后,我们强调公开的挑战,并提出了这项任务的新领域。
Article 288
Title@2025-06-23 (1): RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
Title: RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies | RWEZusammenfassung: Ein Rahmen und Test zur Auswahl großer Sprachmodelle zur Zusammenfassung von Real-World Evidence (RWE) Studien | RWE 摘要:为总结真实世界证据研究而选择大语言模型的框架和测试 2506.18819v1 |
Authors (4): Arjun Mukerji, Michael L. Jackson, Jason Jones, Neil Sanghavi
Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.
大型语言模型(LLMS)已经为一般归纳任务和医学研究援助进行了广泛的评价,但是没有专门评价它们,以便根据RWE研究的结构性产出总结真实世界证据(RWE)的任务。我们引入了RWESummary,这是MedHELM框架(Bedi、Cui、Fuentes、Unell等人,2025年)的拟议补充,以便能够为这一任务制定LLMS基准。RWESummary包括一个假设和三个评价,涉及在医学研究总结中观察到的主要类型的错误,并且使用Atropos健康专有数据来制定。此外,我们使用RWESummary来比较我们内部RWE总归工具中不同LMS的性能。在公布时,我们发现Gemini 2.5模型(包括Flash和Pro)总体效果最佳。我们建议,作为现实世界证据研究总和实用的基础基准基准。我们建议,将RWESummary作为新的和有用的基准基准。
Article 289
Title@2025-06-23 (1): Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models
Title: Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models | Schritt für Schritt Entlarvung für Parameter-Effizient Feinabstimmung von großen Sprachmodellen | 大型语言模型的分步制式分步式微微调 2408.14470v3 |
Authors (4): Aradhye Agarwal, Suhas K Ramesh, Ayan Sengupta, Tanmoy Chakraborty
Fine-tuning large language models (LLMs) on downstream tasks requires substantial computational resources. Selective PEFT, a class of parameter-efficient fine-tuning (PEFT) methodologies, aims to mitigate these computational challenges by selectively fine-tuning only a small fraction of the model parameters. Although parameter-efficient, these techniques often fail to match the performance of fully fine-tuned models, primarily due to inherent biases introduced during parameter selection. Traditional selective PEFT techniques use a fixed set of parameters selected using different importance heuristics, failing to capture parameter importance dynamically and often leading to suboptimal performance. We introduce $\text{ID}^3$, a novel selective PEFT method that calculates parameter importance continually, and dynamically unmasks parameters by balancing exploration and exploitation in parameter selection. Our empirical study on 16 tasks spanning natural language understanding, mathematical reasoning and summarization demonstrates the effectiveness of our method compared to fixed-masking selective PEFT techniques. We analytically show that $\text{ID}^3$ reduces the number of gradient updates by a factor of two, enhancing computational efficiency. Since $\text{ID}^3$ is robust to random initialization of neurons and operates directly on the optimization process, it is highly flexible and can be integrated with existing additive and reparametrization-based PEFT techniques such as adapters and LoRA respectively.
在下游任务上微调大型语言模型(LLMS)需要大量的计算资源。选择性的PEFT是一组具有参数效率的微调(PEFT)方法,目的是通过有选择地微调模型参数中的一小部分参数来缓解这些计算挑战。尽管这些技术具有参数效率,但往往无法与完全微调模型的性能匹配,这主要是因为参数选择过程中引入了内在偏见。传统的选择性PEFT技术使用一套固定参数,使用不同重要超力技术,未能动态地捕捉参数的重要性,往往导致不最佳的性能。我们引入了一种新型选择性PEEFT方法,不断计算参数重要性,通过在参数选择中平衡勘探和开发,动态地计算参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数参数。我们关于涵盖自然语言理解、数学推理和总和的16项任务的经验研究表明,我们的方法与固定式选择性PEFFT技术相比是有效的。我们分析显示, $\text{ID3$减少梯度更新次数,以两个系数计算,提高计算效率。由于$\textID{FE3 直接操作具有高度的弹性和稳定性调整,因此可以直接操作,因此可以进行高度的系统调整。
Article 290
Title@2025-06-23 (1): Existing LLMs Are Not Self-Consistent For Simple Tasks
Title: Existing LLMs Are Not Self-Consistent For Simple Tasks | Bestehende LLMs sind für einfache Aufgaben nicht selbstkonsistent | 现有的LLLM女士对于简单任务不具有自我关联性 2506.18781v1 |
Authors (4): Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency – no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods – a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.
大型语言模型(LLMS)已变得越来越强大,然而,确保其决定保持透明和可信赖需要自我一致 – – 内部推理没有自相矛盾。我们的研究显示,即使是在简单的任务上,如比较线上或平面上的点数,或家庭树上的推理,所有较小的模型都高度不一致,甚至DeepSeek-R1和GPT-o4-mini等最先进的模型也不完全自相矛盾。为了量化和减少这些不一致之处,我们引入了不一致的尺度,并提出了两种自动化方法 – – 一种基于图表和以能源为基础的方法。这些方法虽然提供了部分改进,但也强调了在建立更可靠和可解释的AI方面自我一致的复杂性和重要性。代码和数据可在https://github.com/scorpio-nova/llm-self-confistence查阅。
Article 291
Title@2025-06-23 (1): Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training
Title: Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training | Programmierung durch Backprop: LLMs Erwerben wiederverwendbarer algorithmischer Abstraktionen während der Code-Schulung | 按后方程式分列的编程情况: 守则培训期间可重复使用的演算摘要LLM Acquire Accre Repre Reable Agrotic Empactations 2506.18777v1 |
Authors (7): Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis
Training large language models (LLMs) on source code significantly enhances their general-purpose reasoning abilities, but the mechanisms underlying this generalisation are poorly understood. In this paper, we propose Programming by Backprop (PBB) as a potential driver of this effect - teaching a model to evaluate a program for inputs by training on its source code alone, without ever seeing I/O examples. To explore this idea, we finetune LLMs on two sets of programs representing simple maths problems and algorithms: one with source code and I/O examples (w/ IO), the other with source code only (w/o IO). We find evidence that LLMs have some ability to evaluate w/o IO programs for inputs in a range of experimental settings, and make several observations. Firstly, PBB works significantly better when programs are provided as code rather than semantically equivalent language descriptions. Secondly, LLMs can produce outputs for w/o IO programs directly, by implicitly evaluating the program within the forward pass, and more reliably when stepping through the program in-context via chain-of-thought. We further show that PBB leads to more robust evaluation of programs across inputs than training on I/O pairs drawn from a distribution that mirrors naturally occurring data. Our findings suggest a mechanism for enhanced reasoning through code training: it allows LLMs to internalise reusable algorithmic abstractions. Significant scope remains for future work to enable LLMs to more effectively learn from symbolic procedures, and progress in this direction opens other avenues like model alignment by training on formal constitutional principles.
关于源代码的大型语言模型(LLMS)培训大大增强了其一般目的推理能力,但这种概括性所依据的机制却不易理解。在本文件中,我们提议由Backpop(PBB)制定程序,作为这一效果的潜在驱动因素——教一个模型,仅通过源代码培训来评价投入方案,而从不见I/O实例。为了探索这一想法,我们微调LMS在代表简单数学问题和算法的两套方案中的成品:一是源代码,一是I/O实例(w/IO),二是源代码(w/o IO),二是只有源代码(w/o IO),另一是源代码(w/o IO)。我们发现,LMS有能力在一系列实验环境中评估用于投入的W/o IO程序,并发表若干观察意见。首先,PBBLB在方案作为代码而不是语义等同语言描述时,可以大大改进程序的产出。LMS可以直接为w/O方案提供产出,在通过思维链中加入程序时更可靠。我们从SBBBBBLBLBLL导致更可靠地在投入中进行更可靠的正式的程序上进行更精确的评估,我们通过升级的成一个更精确的模型,通过强化的模型,然后通过强化的研判读取出一个内部程序。
Article 292
Title@2025-06-23 (1): Neural Total Variation Distance Estimators for Changepoint Detection in News Data
Title: Neural Total Variation Distance Estimators for Changepoint Detection in News Data | Neurale Gesamtvariationsdistanz-Schätzer für Changepoint Detection in News Daten | 用于新闻数据中变化点探测变化点的神经总变化 2506.18764v1 |
Authors (3): Csaba Zsolnai, Niels Lörch, Julian Arnold
Detecting when public discourse shifts in response to major events is crucial for understanding societal dynamics. Real-world data is high-dimensional, sparse, and noisy, making changepoint detection in this domain a challenging endeavor. In this paper, we leverage neural networks for changepoint detection in news data, introducing a method based on the so-called learning-by-confusion scheme, which was originally developed for detecting phase transitions in physical systems. We train classifiers to distinguish between articles from different time periods. The resulting classification accuracy is used to estimate the total variation distance between underlying content distributions, where significant distances highlight changepoints. We demonstrate the effectiveness of this method on both synthetic datasets and real-world data from The Guardian newspaper, successfully identifying major historical events including 9/11, the COVID-19 pandemic, and presidential elections. Our approach requires minimal domain knowledge, can autonomously discover significant shifts in public discourse, and yields a quantitative measure of change in content, making it valuable for journalism, policy analysis, and crisis monitoring.
当公众对话因重大事件而转移时,对了解社会动态至关重要。现实世界数据是高维、稀疏和吵闹的,使得该领域的变化点探测成为一项具有挑战性的努力。在本文中,我们利用神经网络在新闻数据中进行变化点探测,采用基于所谓的“逐个学习”计划的方法,该计划最初是为探测物理系统的阶段过渡而开发的。我们训练分类人员区分不同时间段的文章。由此得出的分类精确度被用来估计基本内容分布之间的总差异距离,其中显著距离突出变化点。我们展示了这一方法在合成数据集和《卫报》真实世界数据方面的有效性,成功识别了包括911、COVID-19大流行和总统选举在内的重大历史事件。我们的方法需要最低限度的域知识,能够自主发现公共讨论的重大变化,并产生内容变化的量化尺度,从而对新闻、政策分析和危机监测具有价值。
Article 293
Title@2025-06-23 (1): SEAL: Scaling to Emphasize Attention for Long-Context Retrieval
Title: SEAL: Scaling to Emphasize Attention for Long-Context Retrieval | SEAL: Skalierung zur Betonung der Aufmerksamkeit für die Langzeitretrieval-Retrieval | SEAL: 逐步强调对长期检索的重视 2501.15225v2 |
Authors (5): Changhun Lee, Minsang Seok, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park
While many advanced LLMs are designed to handle long sequence data, we can still observe notable quality degradation even within the sequence limit. In this work, we introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL), which enhances the retrieval performance of large language models (LLMs) over long contexts. We observe that specific attention heads are closely tied to long-context retrieval, showing positive or negative correlation with retrieval scores, and adjusting the strength of these heads boosts the quality of LLMs in long context by a large margin. Built on this insight, we propose a learning-based mechanism that leverages generated data to emphasize these heads. By applying SEAL, we achieve significant improvements in long-context retrieval performance across various tasks and models. Additionally, when combined with existing training-free context extension techniques, SEAL extends the contextual limits of LLMs while maintaining highly reliable outputs.
虽然许多高级LLMS的设计是为了处理长序数据,但我们仍然可以观察到即使在序列限度内也存在显著的质量退化。在这项工作中,我们引入了一种新颖的方法,称为 “ 注重注意长通检索(SEAL) “ (SEAL),这提高了大型语言模型(LLMS)在长长背景下的检索性能。我们观察到,具体的注意力与长通检索密切相关,显示与检索得分的正或负相关性,并在长通情况下调整这些头顶的强度提升了LLMS的质量。我们借助这一洞察力,提出了一种基于学习的机制,利用生成的数据来强调这些头部。我们通过应用SEAL,在各种任务和模型的长通检索性能方面取得了显著的改进。此外,在与现有的无培训环境扩展技术相结合的情况下,SEALM在保持高度可靠的产出的同时,扩大了LMs的背景界限。
Article 294
Title@2025-06-23 (1): Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
Title: Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX | Auge des Urteils: Die Bewertung der russischsprachigen LLMs mit POLLUX | 判断之眼:用POLLUX对讲俄语的LLMs的评价进行分解 2505.24616v2 |
Authors (11): Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova
We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.
我们引入了POLLUX,这是一个全面的开放源码基准,旨在评价俄语大语言模型(LLMs)的基因能力,我们的主要贡献是一个新的评价方法,它提高了LLM评估的可解释性。我们为每一任务类型确定了一套详细标准,并制定了评分协议,使模型能够评价答复并提供评分理由。这除了传统的资源消耗和人与人之间的比较之外,还能够进行透明、标准驱动的评价。POLLUX包含一个详细的细细分类,包括35种任务类型,涵盖不同的基因领域,例如代码生成、创造性写作和实用助理使用案例,总计2 100个手工手工制作和专业撰写案例。每项任务都按困难(容易/中/硬)分类,专家完全从零开始构建数据集。我们还释放了一组受过精细评估的LLMM-a-judge(7B和32B)评价员,他们受过精细评估。这个方法为模型开发提供了可扩展、可解释的评价和说明的工具,有效地取代成本和不精确的人的判断。
Article 295
Title@2025-06-23 (1): Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation
Title: Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation | Multimodaler Ankerverteiler mit Wissensdestillation zur Emotionserkennung im Gespräch | 具有知识蒸馏的多式锁定器变异器,在对话中承认情感 2506.18716v1 |
Authors (4): Jie Li, Shifei Ding, Lili Guo, Xuan Li
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
在对话中发现情感认同(ERC)的目的是在对话中发现个人言论的情绪。为每个言论创造高效和特定模式的表述方式仍是一项重大挑战。以前的研究提出了各种模型,以整合使用不同模式特定编码器提取的特征。然而,它们忽视了不同模式对这项任务的不同贡献,并通过在框架一级调整模式而引入了高度复杂性。为了应对这些挑战,我们提议为ERC的任务使用多式收缩器(MAGTKD ) 。具体地说,迅速学习用于加强文本模式的表述方式,而知识蒸馏则用于加强较弱模式的表述方式。此外,我们引入了多式固定锁定式变压器,以有效地整合不同模式之间的表达方式。关于IMOCAP和MELD数据集的广泛实验显示了知识蒸馏在加强模式表述方式和在情感识别中实现状态-艺术表现方面的有效性。我们的代码可以在https://github.com/JieLi-d/MAGTDDD中查到。
Article 296
Title@2025-06-23 (1): Handling Numeric Expressions in Automatic Speech Recognition
Title: Handling Numeric Expressions in Automatic Speech Recognition | Umgang mit numerischen Ausdrücken bei der automatischen Spracherkennung | 在自动语音识别中处理数字表达式 2408.00004v2 |
Authors (2): Christian Huber, Alexander Waibel
This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expressions such as years, timestamps, currency amounts, and quantities. For the end-to-end approach, we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test data set show that while approaches based on LLMs perform well in recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.
本文讨论了在自动语音识别记录誊本中正确格式化数字表达式的问题,这是具有挑战性的问题,因为预期的笔录格式取决于上下文,例如1945年(年份)对19:45(时间戳)。我们比较了级联和端到端办法,以识别和格式化数字表达式,例如年份、时间戳、货币数额和数量。在端到端办法中,我们采用了一种数据生成战略,使用大语言模型(LLM)和语音模型(TTS)文本来生成适应数据。我们的测试数据集的结果表明,在承认格式化数字表达式方面,基于LLMs的做法表现良好,但经过调整的端到端模型提供了竞争性的性能,其优势是低延缩和推断成本。
Article 297
Title@2025-06-23 (1): Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
Title: Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition | Kontext Biasing für Aussprachen-Orthographie Missverhältnis in der automatischen Spracherkennung | 自动语音识别中出现偏差以引发-正正对学误差的背景情况 2506.18703v1 |
Authors (2): Christian Huber, Alexander Waibel
Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, for words with a pronunciation-orthography mismatch, these methods may still struggle. We propose a method which allows corrections of substitution errors to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate of up to 11\%, while maintaining a competitive overall word error rate.
神经序列到顺序系统提供最先进的自动语音识别性能。 当使用适当的模型单位时, 例如字节- pair 编码字符, 这些系统位于主要的开放词汇系统中。 但是, 实际上, 它们往往不承认培训期间看不到的单词, 例如命名实体、 缩略语或特定域的特殊单词。 为了解决这个问题, 提出了许多背景偏差方法; 但是, 对于有发音- 方向学不匹配的单词, 这些方法可能仍然难以解决。 我们提出一种方法, 允许更正替换错误, 以提高这类挑战性单词的准确性。 用户可以在推断过程中在飞行上添加更正。 我们用这种方法表明,我们相对改进了偏差单错误率, 最多达11, 同时保持有竞争力的总字错率 。
Article 298
Title@2025-06-23 (1): Better Language Model Inversion by Compactly Representing Next-Token Distributions
Title: Better Language Model Inversion by Compactly Representing Next-Token Distributions | Bessere Sprachmodellumwandlung durch kompakte Darstellung von Next-Token-Distributionen | 由代表下移分发的语法缩略语 2506.17090v2 |
Authors (5): Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta
Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model’s system message. We propose a new method – prompt inversion from logprob sequences (PILS) – that recovers hidden prompts by gleaning clues from the model’s next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2–3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5–27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
语言变换模式试图利用仅使用语言模型输出的输出来恢复隐藏的提示。 这种能力对语言模型部署的安全和问责产生影响, 比如从一个受API保护的语言模型的系统信息中泄漏隐私信息。 我们提出了一个新方法 – – 从日志序列中迅速转换(PILS) – – 通过在多代步骤中从该模型的下一代概率中收集线索,从而恢复隐藏的提示。 我们的方法是由关键洞察力促成的。 一个语言模型的矢量值输出值占低维次空间。 这使我们能够用线性地图在多代步骤中无懈可击地压缩全下调概率分布,允许使用更多的输出信息进行反转。 我们的方法比以往最先进的恢复隐藏提示方法产生巨大收益, 在一个案例中将基于模型的回收率从17%提高到60 % 。 我们的方法也令人惊讶地展示了良好的一般化行为; 例如, 16代次步骤中经过反向全面压缩概率分布的概率分布, 使得我们更能快速地分析我们35的恢复方法。
Article 299
Title@2025-06-23 (1): HausaNLP at SemEval-2025 Task 11: Hausa Text Emotion Detection
Title: HausaNLP at SemEval-2025 Task 11: Hausa Text Emotion Detection | HausaNLP bei SemEval-2025 Aufgabe 11: Hausa Text Emotion Detection | SemEval-2025任务11:Hausa短信情感检测 2506.16388v2 |
Authors (5): Sani Abdullahi Sani, Salim Abubakar, Falalu Ibrahim Lawan, Abdulhamid Abubakar, Maryam Bala
This paper presents our approach to multi-label emotion detection in Hausa, a low-resource African language, for SemEval Track A. We fine-tuned AfriBERTa, a transformer-based model pre-trained on African languages, to classify Hausa text into six emotions: anger, disgust, fear, joy, sadness, and surprise. Our methodology involved data preprocessing, tokenization, and model fine-tuning using the Hugging Face Trainer API. The system achieved a validation accuracy of 74.00%, with an F1-score of 73.50%, demonstrating the effectiveness of transformer-based models for emotion detection in low-resource languages.
本文介绍了我们在非洲低资源语言豪萨(Hausa,非洲低资源语言)为SemEval Trace A(SemEval Track A)检测多标签情绪的方法。 我们微调了AfriBERTA(以变压器为基础的非洲语言预培训模型),将豪萨文本分为六种情感:愤怒、厌恶、恐惧、快乐、悲伤和惊喜。 我们的方法涉及数据预处理、符号化以及使用Hugging Face trainer API进行模型微调。 该系统的验证精确度达到了74.00 % , F1芯为73.50%,显示了基于变压器的低资源语言情感检测模型的有效性。
Article 300
Title@2025-06-23 (1): “I understand why I got this grade”: Automatic Short Answer Grading with Feedback
Title: “I understand why I got this grade”: Automatic Short Answer Grading with Feedback | “Ich verstehe, warum ich diese Note bekommen habe”: Automatisches Kurzbeantworten mit Feedback | “我理解我为什么得到这个分级”: 自动简短的回答,加上反馈 2407.12818v2 |
Authors (4): Dishank Aggarwal, Pritam Sil, Bhaskaran Raman, Pushpak Bhattacharyya
In recent years, there has been a growing interest in using Artificial Intelligence (AI) to automate student assessment in education. Among different types of assessments, summative assessments play a crucial role in evaluating a student’s understanding level of a course. Such examinations often involve short-answer questions. However, grading these responses and providing meaningful feedback manually at scale is both time-consuming and labor-intensive. Feedback is particularly important, as it helps students recognize their strengths and areas for improvement. Despite the importance of this task, there is a significant lack of publicly available datasets that support automatic short-answer grading with feedback generation. To address this gap, we introduce Engineering Short Answer Feedback (EngSAF), a dataset designed for automatic short-answer grading with feedback. The dataset covers a diverse range of subjects, questions, and answer patterns from multiple engineering domains and contains ~5.8k data points. We incorporate feedback into our dataset by leveraging the generative capabilities of state-of-the-art large language models (LLMs) using our Label-Aware Synthetic Feedback Generation (LASFG) strategy. This paper underscores the importance of enhanced feedback in practical educational settings, outlines dataset annotation and feedback generation processes, conducts a thorough EngSAF analysis, and provides different LLMs-based zero-shot and finetuned baselines for future comparison. The best-performing model (Mistral-7B) achieves an overall accuracy of 75.4% and 58.7% on unseen answers and unseen question test sets, respectively. Additionally, we demonstrate the efficiency and effectiveness of our ASAG system through its deployment in a real-world end-semester exam at a reputed institute.
nan
Article 301
Title@2025-06-23 (1): Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
Title: Is There a Case for Conversation Optimized Tokenizers in Large Language Models? | Gibt es einen Fall für Gespräche optimierte Tokenizer in großen Sprachmodellen? | 在大语言模型中是否有“最优化调价器”对话的例子? 2506.18674v1 |
Authors (4): Raquel Ferrando, Javier Conde, Gonzalo Martínez, Pedro Reviriego
The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.
nan
Article 302
Title@2025-06-23 (1): C-SEO Bench: Does Conversational SEO Work?
Title: C-SEO Bench: Does Conversational SEO Work? | C-SEO Bench: Funktioniert gesprächige SEO? | C-SEO法官:对口的SEO有效吗? 2506.11097v2 |
Authors (5): Haritz Puerto, Martin Gubri, Tommaso Green, Seong Joon Oh, Sangdoo Yun
Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not understand whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are largely ineffective, contrary to reported results in the literature. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at https://github.com/parameterlab/c-seo-bench and https://huggingface.co/datasets/parameterlab/c-seo-bench.
nan
Article 303
Title@2025-06-23 (1): Alignment Helps Make the Most of Multimodal Data
Title: Alignment Helps Make the Most of Multimodal Data | Alignment hilft, multimodale Daten optimal zu nutzen | 对齐帮助大多数多模式数据 2405.08454v3 |
Authors (2): Christian Arnold, Andreas Küpfer
Political scientists increasingly analyze multimodal data. However, the effective analysis of such data requires aligning information across different modalities. In our paper, we demonstrate the significance of such alignment. Informed by a systematic review of 2,703 papers, we find that political scientists typically do not align their multimodal data. Introducing a decision tree that guides alignment choices, our framework highlights alignment’s untapped potential and provides concrete advice in research design and modeling decisions. We illustrate alignment’s analytical value through two applications: predicting tonality in U.S. presidential campaign ads and cross-modal querying of German parliamentary speeches to examine responses to the far-right AfD.
nan
Article 304
Title@2025-06-23 (1): Pretraining Language Models to Ponder in Continuous Space
Title: Pretraining Language Models to Ponder in Continuous Space | Vorschulung von Sprachmodellen im kontinuierlichen Raum | 连续空间Ponder语言模型培训前 2505.20674v2 |
Authors (9): Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, PonderingPythia-2.8B surpasses Pythia-6.9B, and PonderingPythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.
nan
Article 305
Title@2025-06-23 (1): ByteSpan: Information-Driven Subword Tokenisation
Title: ByteSpan: Information-Driven Subword Tokenisation | ByteSpan: Informationsgetriebene Subwort-Tokenisierung | ByteSpan: 信息驱动小词 2506.18639v1 |
Authors (5): Zébulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery
Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model’s prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and R'enyi efficiency for 25 languages.
nan
Article 306
Title@2025-06-23 (1): AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs
Title: AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs | AggTruth: Kontextuelle Halluzination Detection mit aggregierten Aufmerksamkeits-Scores in LLMs | AggTurth:利用LLMM中累计关注分数进行背景幻觉探测 2506.18628v1 |
Authors (7): Piotr Matys, Jan Eliasz, Konrad Kiełczyński, Mikołaj Langner, Teddy Ferdinan, Jan Kocoń, Przemysław Kazienko
In real-world applications, Large Language Models (LLMs) often hallucinate, even in Retrieval-Augmented Generation (RAG) settings, which poses a significant challenge to their deployment. In this paper, we introduce AggTruth, a method for online detection of contextual hallucinations by analyzing the distribution of internal attention scores in the provided context (passage). Specifically, we propose four different variants of the method, each varying in the aggregation technique used to calculate attention scores. Across all LLMs examined, AggTruth demonstrated stable performance in both same-task and cross-task setups, outperforming the current SOTA in multiple scenarios. Furthermore, we conducted an in-depth analysis of feature selection techniques and examined how the number of selected attention heads impacts detection performance, demonstrating that careful selection of heads is essential to achieve optimal results.
nan
Article 307
Title@2025-06-23 (1): The Anatomy of Speech Persuasion: Linguistic Shifts in LLM-Modified Speeches
Title: The Anatomy of Speech Persuasion: Linguistic Shifts in LLM-Modified Speeches | Die Anatomie der Sprachüberzeugung: Linguistische Verschiebungen in LLM-modifizierten Reden | 语音解剖分析:LLM-修改后的语音语言变化 2506.18621v1 |
Authors (5): Alisa Barkar, Mathieu Chollet, Matthieu Labeau, Beatrice Biancardi, Chloe Clavel
This study examines how large language models understand the concept of persuasiveness in public speaking by modifying speech transcripts from PhD candidates in the “Ma These en 180 Secondes” competition, using the 3MT French dataset. Our contributions include a novel methodology and an interpretable textual feature set integrating rhetorical devices and discourse markers. We prompt GPT-4o to enhance or diminish persuasiveness and analyze linguistic shifts between original and generated speech in terms of the new features. Results indicate that GPT-4o applies systematic stylistic modifications rather than optimizing persuasiveness in a human-like manner. Notably, it manipulates emotional lexicon and syntactic structures (such as interrogative and exclamatory clauses) to amplify rhetorical impact.
nan
Article 308
Title@2025-06-23 (1): Semantic similarity estimation for domain specific data using BERT and other techniques
Title: Semantic similarity estimation for domain specific data using BERT and other techniques | Semantische Ähnlichkeitsschätzung für bereichsspezifische Daten unter Verwendung von BERT und anderen Techniken | 使用BERT和其他技术对具体领域数据进行语义相似性估计 2506.18602v1 |
Authors (1): R. Prashanth
Estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding, and that has tremendous application on various downstream tasks such as question answering, semantic search, information retrieval, document clustering, word-sense disambiguation and machine translation. In this work, we carry out the estimation of semantic similarity using different state-of-the-art techniques including the USE (Universal Sentence Encoder), InferSent and the most recent BERT, or Bidirectional Encoder Representations from Transformers, models. We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset which is the Quora’s question pairs dataset. We observe that the BERT model gave much superior performance as compared to the other methods. This should be because of the fine-tuning procedure that is involved in its training process, allowing it to learn patterns based on the training data that is used. This works demonstrates the applicability of BERT on domain specific datasets. We infer from the analysis that BERT is the best technique to use in the case of domain specific data.
nan
Article 309
Title@2025-06-23 (1): Reply to “Emergent LLM behaviors are observationally equivalent to data leakage”
Title: Reply to “Emergent LLM behaviors are observationally equivalent to data leakage” | Antwort auf “Emergente LLM-Verhalten sind Beobachtungsäquivalent zu Daten Leckage” | 对“紧急LLM行为”的答复在观测上等同于数据泄漏” 2506.18600v1 |
Authors (3): Ariel Flint Ashery, Luca Maria Aiello, Andrea Baronchelli
A potential concern when simulating populations of large language models (LLMs) is data contamination, i.e. the possibility that training data may shape outcomes in unintended ways. While this concern is important and may hinder certain experiments with multi-agent models, it does not preclude the study of genuinely emergent dynamics in LLM populations. The recent critique by Barrie and T"ornberg [1] of the results of Flint Ashery et al. [2] offers an opportunity to clarify that self-organisation and model-dependent emergent dynamics can be studied in LLM populations, highlighting how such dynamics have been empirically observed in the specific case of social conventions.
nan
Article 310
Title@2025-06-23 (1): No Training Wheels: Steering Vectors for Bias Correction at Inference Time
Title: No Training Wheels: Steering Vectors for Bias Correction at Inference Time | Keine Trainingsräder: Lenk-Vektoren für Bias-Korrektur zur Inferenzzeit | 无培训轮:推论时间比亚更正指导矢量 2506.18598v1 |
Authors (3): Aviral Gupta, Armaan Sethi, Ameesh Sethi
Neural network classifiers trained on datasets with uneven group representation often inherit class biases and learn spurious correlations. These models may perform well on average but consistently fail on atypical groups. For example, in hair color classification, datasets may over-represent females with blond hair, reinforcing stereotypes. Although various algorithmic and data-centric methods have been proposed to address such biases, they often require retraining or significant compute. In this work, we propose a cheap, training-free method inspired by steering vectors used to edit behaviors in large language models. We compute the difference in mean activations between majority and minority groups to define a “bias vector,” which we subtract from the model’s residual stream. This leads to reduced classification bias and improved worst-group accuracy. We explore multiple strategies for extracting and applying these vectors in transformer-like classifiers, showing that steering vectors, traditionally used in generative models, can also be effective in classification. More broadly, we showcase an extremely cheap, inference time, training free method to mitigate bias in classification models.
nan
Article 311
Title@2025-06-23 (1): Airalogy: AI-empowered universal data digitization for research automation
Title: Airalogy: AI-empowered universal data digitization for research automation | Airalogy: KI-fähige universelle Daten-Digitalisierung für die Forschungsautomatisierung | 航空学:用于研究自动化的AI-动力通用数据数字化 2506.18586v1 |
Authors (22): Zijie Yang, Qiji Zhou, Fang Guo, Sijie Zhang, Yexun Xi, Jinglei Nie, Yudian Zhu, Liping Huang, Chou Wu, Yonghe Xia, Xiaoyu Ma, Yingming Pu, Panzhong Lu, Junshu Pan, Mingtao Chen, Tiannan Guo, Yanmei Dou, Hongyu Chen, Anping Zeng, Jiaxing Huang, Tian Xu, Yue Zhang
Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, and difficult to share. Creating a single platform for standardized data digitization needs to overcome the inherent challenge of balancing between universality (supporting the diverse, ever-evolving needs of various disciplines) and standardization (enforcing consistent formats to fully enable AI). No existing platform accommodates both facets. Building a truly multidisciplinary platform requires integrating scientific domain knowledge with sophisticated computing skills. Researchers often lack the computational expertise to design customized and standardized data recording methods, whereas platform developers rarely grasp the intricate needs of multiple scientific domains. These gaps impede research data standardization and hamper AI-driven progress. In this study, we address these challenges by developing Airalogy (https://airalogy.com), the world’s first AI- and community-driven platform that balances universality and standardization for digitizing research data across multiple disciplines. Airalogy represents entire research workflows using customizable, standardized data records and offers an advanced AI research copilot for intelligent Q&A, automated data entry, analysis, and research automation. Already deployed in laboratories across all four schools of Westlake University, Airalogy has the potential to accelerate and automate scientific innovation in universities, industry, and the global research community-ultimately benefiting humanity as a whole.
nan
Article 312
Title@2025-06-23 (1): Parallel Continuous Chain-of-Thought with Jacobi Iteration
Title: Parallel Continuous Chain-of-Thought with Jacobi Iteration | Parallele Kontinuierliche Kette von Gedanken mit Jacobi Iteration | 与Jacodi 迭代平行连续研究链 2506.18582v1 |
Authors (3): Haoyi Wu, Zhihao Teng, Kewei Tu
Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at https://github.com/whyNLP/PCCoT.
nan
Article 313
Title@2025-06-23 (1): A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance
Title: A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance | Modulare Taxonomie für Hass-Sprachdefinitionen und deren Auswirkungen auf die Leistungsfähigkeit von Null-Schuss-LLM-Klassifikation | 仇恨言论定义的模块分类学及其对零沙粒LLM分类绩效的影响 2506.18576v1 |
Authors (3): Matteo Melis, Gabriella Lapesa, Dennis Assenmacher
Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it, and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 Conceptual Elements-building blocks that capture different aspects of hate speech definitions, such as references to the target of hate (individual or groups) or of the potential consequences of it. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.
nan
Article 314
Title@2025-06-23 (1): When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking
Title: When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking | Wenn das Feintuning fehlschlägt: Lektionen aus dem MS MARCO Passage Ranking | 罚款出错时:从MS MARCO通过分级中汲取的教训 2506.18535v1 |
Authors (3): Manu Pande, Shahil Kumar, Anay Yatin Damle
This paper investigates the counterintuitive phenomenon where fine-tuning pre-trained transformer models degrades performance on the MS MARCO passage ranking task. Through comprehensive experiments involving five model variants-including full parameter fine-tuning and parameter efficient LoRA adaptations-we demonstrate that all fine-tuning approaches underperform the base sentence-transformers/all- MiniLM-L6-v2 model (MRR@10: 0.3026). Our analysis reveals that fine-tuning disrupts the optimal embedding space structure learned during the base model’s extensive pre-training on 1 billion sentence pairs, including 9.1 million MS MARCO samples. UMAP visualizations show progressive embedding space flattening, while training dynamics analysis and computational efficiency metrics further support our findings. These results challenge conventional wisdom about transfer learning effectiveness on saturated benchmarks and suggest architectural innovations may be necessary for meaningful improvements.
nan
Article 315
Title@2025-06-23 (1): LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Inconsistencies
Title: LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Inconsistencies | LLMs Lost in Translation: M-ALERT entdeckt Cross-Linguistic Safety Inkonsistenzen | LLMs 失落于翻译:M-ALERT发现跨语言安全不一致 2412.15035v3 |
Authors (8): Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting
Building safe Large Language Models (LLMs) across multiple languages is essential in ensuring both safe access and linguistic diversity. To this end, we conduct a large-scale, comprehensive safety evaluation of the current LLM landscape. For this purpose, we introduce M-ALERT, a multilingual benchmark that evaluates the safety of LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, with category-wise annotations. Our extensive experiments on 39 state-of-the-art LLMs highlight the importance of language-specific safety analysis, revealing that models often exhibit significant inconsistencies in safety across languages and categories. For instance, Llama3.2 shows high unsafety in category crime_tax for Italian but remains safe in other languages. Similar inconsistencies can be observed across all models. In contrast, certain categories, such as substance_cannabis and crime_propaganda, consistently trigger unsafe responses across models and languages. These findings underscore the need for robust multilingual safety practices in LLMs to ensure responsible usage across diverse communities.
nan
Article 316
Title@2025-06-23 (1): Affordable AI Assistants with Knowledge Graph of Thoughts
Title: Affordable AI Assistants with Knowledge Graph of Thoughts | Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken | 具有知识思想知识图的负担得起的AI助理 2504.02670v4 |
Authors (18): Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
nan
Article 317
Title@2025-06-23 (1): End-to-End Spoken Grammatical Error Correction
Title: End-to-End Spoken Grammatical Error Correction | End-to-End-Spoken Grammatical Error Correction | 端端到端口语语语法错误校正 2506.18532v1 |
Authors (5): Mengjie Qian, Rao Ma, Stefano Bannò, Mark J. F. Gales, Kate M. Knill
Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners’ speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I) corpus show that the proposed approaches significantly boost E2E SGEC performance.
nan
Article 318
Title@2025-06-23 (1): Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?
Title: Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic? | Pilotieren von Copilot, Codex und StarCoder2: Heiße Temperatur, kalte Prompts oder schwarze Magie? | 联合飞行员 代码代码和星际代码2: 热温、冷感或黑魔法? 2210.14699v3 |
Authors (5): Jean-Baptiste Döderlein, Nguessan Hermann Kouadio, Mathieu Acher, Djamel Eddine Khelladi, Benoit Combemale
Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.
nan
Article 319
Title@2025-06-23 (1): ASCenD-BDS: Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping
Title: ASCenD-BDS: Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping | ASCenD-BDS: Anpassbares, stochastisches und kontextbewusstes Framework zur Erkennung von Bias, Diskriminierung und Stereotypisierung | ASCenD-BDS:可适应、可储存和符合实际情况的发现偏见、歧视和陈规定型观念框架 2502.02072v2 |
Authors (14): Rajiv Bahl, Venkatesan N, Parimal Aglawe, Aastha Sarasapalli, Bhavya Kancharla, Chaitanya kolukuluri, Harish Mohite, Japneet Hora, Kiran Kakollu, Rahul Dhiman, Shubham Kapale, Sri Bhagya Kathula, Vamsikrishna Motru, Yogeshwar Reddy
The rapid evolution of Large Language Models (LLMs) has transformed natural language processing but raises critical concerns about biases inherent in their deployment and use across diverse linguistic and sociocultural contexts. This paper presents a framework named ASCenD BDS (Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping). The framework presents approach to detecting bias, discrimination, stereotyping across various categories such as gender, caste, age, disability, socioeconomic status, linguistic variations, etc., using an approach which is Adaptive, Stochastic and Context-Aware. The existing frameworks rely heavily on usage of datasets to generate scenarios for detection of Bias, Discrimination and Stereotyping. Examples include datasets such as Civil Comments, Wino Gender, WinoBias, BOLD, CrowS Pairs and BBQ. However, such an approach provides point solutions. As a result, these datasets provide a finite number of scenarios for assessment. The current framework overcomes this limitation by having features which enable Adaptability, Stochasticity, Context Awareness. Context awareness can be customized for any nation or culture or sub-culture (for example an organization’s unique culture). In this paper, context awareness in the Indian context has been established. Content has been leveraged from Indian Census 2011 to have a commonality of categorization. A framework has been developed using Category, Sub-Category, STEM, X-Factor, Synonym to enable the features for Adaptability, Stochasticity and Context awareness. The framework has been described in detail in Section 3. Overall 800 plus STEMs, 10 Categories, 31 unique SubCategories were developed by a team of consultants at Saint Fox Consultancy Private Ltd. The concept has been tested out in SFCLabs as part of product development.
nan
Article 320
Title@2025-06-23 (1): HiRAG: Retrieval-Augmented Generation with Hierarchical Knowledge
Title: HiRAG: Retrieval-Augmented Generation with Hierarchical Knowledge | HiRAG: Retrieval-Augmented Generation mit Hierarchischem Wissen | HIRAG: 具有等级知识的回收养代 2503.10150v2 |
Authors (8): Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods.
nan
Article 321
Title@2025-06-23 (1): MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Title: MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis | MedTVT-R1: Eine multimodale LLM-Empowering medizinischer Vernunft und Diagnose | MedTTT-R1:增强医疗原因和诊断能力的一个多模式LLM 2506.18512v1 |
Authors (6): Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1’s superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.
nan
Article 322
Title@2025-06-23 (1): Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts
Title: Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts | Smooth Operators: LLMs übersetzen unvollkommene Hinweise in Disfluency-Rich-Transkriptionen | 平滑运算符: LLMs 将不合格提示转换成 disfunency- Rich 轨迹 2506.18510v1 |
Authors (1): Duygu Altinok
Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models – all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.
nan
Article 323
Title@2025-06-23 (1): Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance
Title: Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance | Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung | 国家劳工政策关键任务:力量、弱点和具体具体绩效 2506.18501v1 |
Authors (2): Wael Etaiwi, Bushra Alhijawi
The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.
nan
Article 324
Title@2025-06-23 (1): AI-Generated Song Detection via Lyrics Transcripts
Title: AI-Generated Song Detection via Lyrics Transcripts | AI-Generated Song Detection via Lyrics Transcripts | AI 创名歌曲通过歌词谱状探测 2506.18488v1 |
Authors (5): Markus Frohmann, Elena V. Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
nan
Article 325
Title@2025-06-23 (1): MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models
Title: MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models | MeRF: Motivierungs-verstärkte Verstärkungs-Feinsteuerung für große Vernunftmodelle | MERF: 大力加强大型理由模型的强化强化微调 2506.18485v1 |
Authors (9): Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Language Models (LLMs) to tackle complex reasoning tasks. However, existing RLVR methods overlook one of the most distinctive capabilities of LLMs, their in-context learning ability, as prominently demonstrated by the success of Chain-of-Thought (CoT) prompting. This motivates us to explore how reinforcement learning can be effectively combined with in-context learning to better improve the reasoning capabilities of LLMs. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning} (MeRF), an intuitive yet effective method enhancing reinforcement learning of LLMs by involving ``telling LLMs the rules of the game’’. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective. This simple modification leverages the in-context learning ability of LLMs aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations on the Knights and Knaves~(K&K) logic puzzle reasoning benchmark demonstrate that \texttt{MeRF} achieves substantial performance gains over baselines. Moreover, ablation studies show that performance improves with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement learning.
nan
Article 326
Title@2025-06-23 (1): MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems
Title: MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems | MORTAR: Multiturn Metamorphic Testing für LLM-basierte Dialogsysteme | MORTAR:以LLM为基础的对话系统的多轨变形测试 2412.15557v3 |
Authors (6): Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen
With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.
nan
Article 327
Title@2025-06-23 (1): LLMs on a Budget? Say HOLA
Title: LLMs on a Budget? Say HOLA | LLMs auf einem Budget? Sagen Sie HOLA | 预算LLLM 预算? 2506.18952v1 |
Authors (7): Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, Usman Naseem
Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano–proving both scalable and production-ready.
nan
Article 328
Title@2025-06-23 (1): Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
Title: Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset | Richtige Substantive Diakritisierung für Arabisch Wikipedia: Ein Benchmark-Datensatz | 阿拉伯维基百科:基准数据集 2505.02656v3 |
Authors (5): Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash
Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun diacritization.
nan
Article 329
Title@2025-06-23 (1): PlantDeBERTa: An Open Source Language Model for Plant Science
Title: PlantDeBERTa: An Open Source Language Model for Plant Science | PlantDeBERTa: Ein Open Source Sprachmodell für die Pflanzenwissenschaft | PlantDebERTA:植物科学开放源语言模型 2506.08897v3 |
Authors (8): Hiba Khey, Amine Lakhder, Salma Rouichi, Imane El Ghabi, Kamal Hejjaoui, Younes En-nahli, Fahd Kalloubi, Moez Amri
The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantDeBERTa, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantDeBERTa is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantDeBERTa to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantDeBERTa exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields.By providing a scalable and reproducible framework for high-resolution entity recognition, PlantDeBERTa bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
nan
Article 330
Title@2025-06-23 (1): OAgents: An Empirical Study of Building Effective Agents
Title: OAgents: An Empirical Study of Building Effective Agents | OAgents: Eine empirische Studie über den Aufbau effektiver Agenten | 业务主管:建立有效代理的经验研究 2506.15741v2 |
Authors (24): He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, Wangchunshu Zhou
Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.
nan
Article 331
Title@2025-06-23 (1): Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Title: Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models | Circuit Compositions: Erforschen von modularen Strukturen in transformerbasierten Sprachmodellen | 电路构成:探索以变换语言模式为基础的模块结构 2410.01434v3 |
Authors (3): Philipp Mondorf, Sondre Wold, Barbara Plank
A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in mechanistic interpretability have made progress in identifying $\textit{circuits}$, which represent the minimal computational subgraphs responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits $\textit{relate}$ to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through set operations to represent more complex functional model capabilities.
nan
Article 332
Title@2025-06-23 (1): Compromising Honesty and Harmlessness in Language Models via Deception Attacks
Title: Compromising Honesty and Harmlessness in Language Models via Deception Attacks | Kompromisse zwischen Ehrlichkeit und Harmlosigkeit in Sprachmodellen durch Täuschungsangriffe | 通过欺骗性攻击破坏语言模式的诚实和无弊 2502.08301v2 |
Authors (4): Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff
Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce “deception attacks” that undermine both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. We introduce fine-tuning methods that cause models to selectively deceive users on targeted topics while remaining accurate on others. Through a series of experiments, we show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects. In addition, we find that deceptive fine-tuning often compromises other safety properties: deceptive models are more likely to produce toxic content, including hate speech and stereotypes. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.
nan
Article 333
Title@2025-06-23 (1): TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
Title: TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models | TReB: Umfassender Benchmark für die Bewertung von Tabellen mit Gründen für Fähigkeiten großer Sprachmodelle | TreB:评价大语言模式表说明能力的综合基准 2506.18421v1 |
Authors (12): Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao, Jingjing Yang, Zhendong Wang, Kexin Yang, Boshen Shi, Xing Wang, Chao Deng, Junlan Feng
The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on [HuggingFace] and the framework on [GitHub].
nan
Article 334
Title@2025-06-23 (1): Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Title: Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models | Infi-MMR: Curriculumbasiertes Entsperren multimodaler Vernunft durch schrittweises Verstärktes Lernen in multimodalen Small Language-Modellen | Infi-MMMR:通过在多模式小型语言模式中分阶段强化学习,以课程为基础解锁多模式原因 2505.23091v3 |
Authors (12): Zeyu Liu, Yuhang Liu, Guanghao Zhu, Congkai Xie, Zhen Li, Jianbo Yuan, Xinyao Wang, Qing Li, Shing-Chi Cheung, Shengyu Zhang, Fei Wu, Hongxia Yang
Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model’s logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini). Resources are available at https://huggingface.co/Reallm-Labs/Infi-MMR-3B.
nan
Article 335
Title@2025-06-23 (1): Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
Title: Lemmatization as a Classification Task: Results from Arabic across Multiple Genres | Lemmatisierung als Klassifikationsaufgabe: Ergebnisse aus dem Arabischen über mehrere Genres hinweg | 作为一项分类任务, Lemmatiz化: 阿拉伯语在多个流派中产生的结果 2506.18399v1 |
Authors (2): Mostafa Saeed, Nizar Habash
Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.
nan
Article 336
Title@2025-06-23 (1): SLR: An Automated Synthesis Framework for Scalable Logical Reasoning
Title: SLR: An Automated Synthesis Framework for Scalable Logical Reasoning | SLR: Ein automatisiertes Synthese-Framework für skalierbare logische Vernunft | SLR: 一个可缩放逻辑理由的自动合成框架 2506.15787v2 |
Authors (9): Lukas Helff, Ahmad Omar, Felix Friedrich, Wolfgang Stammer, Antonia Wüst, Tim Woydt, Rupert Mitchell, Patrick Schramowski, Kristian Kersting
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR enables scalable, automated synthesis of inductive reasoning tasks with precisely controlled difficulty. For each task, SLR synthesizes (i) a latent ground-truth rule, (ii) an executable validation program used by a symbolic judge to deterministically verify model outputs, and (iii) an instruction prompt for the reasoning task. Using SLR, we create SLR-Bench, a benchmark comprising over 19k prompts spanning 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs do somewhat better, but incur substantial increases in test-time compute, sometimes exceeding 15k completion tokens. Finally, logic-tuning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. SLR is fully automated, requires no human annotation, ensures dataset novelty, and offers a scalable environment for probing and advancing LLMs’ reasoning capabilities.
nan
Article 337
Title@2025-06-23 (1): Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics
Title: Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics | Bewertung von Kausalerklärungen in medizinischen Berichten mit LLM-basierten und von Menschen ausgerichteten Metrics | 与以LLM为基础和人与人之间比较的计量单位在医疗报告中评价因果解释 2506.18387v1 |
Authors (2): Yousang Cho, Key-Sun Choi
This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of LLM-based evaluation for tasks requiring interpretability and causal reasoning.
nan
Article 338
Title@2025-06-23 (1): Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control
Title: Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control | Song Form-aware Full-Song Text-to-Lyrics Generation mit Multi-Level Granularity Syllable Count Control | Song 具有多层次颗粒可调制计数控制功能的有形式觉醒的全声全文本到文字生成器 2411.13100v3 |
Authors (5): Yunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee
Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phrase, line, and paragraph levels, aware of song form. Our approach generates complete lyrics conditioned on input text and song form, ensuring alignment with specified syllable constraints. Generated lyrics samples are available at: https://tinyurl.com/lyrics9999
nan
Article 339
Title@2025-06-23 (1): A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages
Title: A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages | Eine rigorose Bewertung von LLM-Datenerstellungsstrategien für ressourcenarme Sprachen | 对LLLM低资源语言数据生成战略的严格评价 2506.12158v2 |
Authors (4): Tatiana Anikina, Jan Cegin, Jakub Simko, Simon Ostermann
Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.
nan
Article 340
Title@2025-06-23 (1): Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations
Title: Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations | Factual Knowledge in Language Models: Robustheit und Anomalien unter einfachen zeitlichen Kontextvariationen | 语言模型中的事实知识:简单时间环境变化下的强力和异常现象 2502.01220v6 |
Authors (5): Hichem Ammar Khodja, Frédéric Béchet, Quentin Brabant, Alexis Nasr, Gwénolé Lecorvé
This paper explores the robustness of language models (LMs) to variations in the temporal context within factual knowledge. It examines whether LMs can correctly associate a temporal context with a past fact valid over a defined period, by asking them to differentiate correct from incorrect contexts. The LMs’ ability to distinguish is analyzed along two dimensions: the distance of the incorrect context from the validity period and the granularity of the context. To this end, a dataset called TimeStress is introduced, enabling the evaluation of 18 diverse LMs. Results reveal that the best LM achieves a perfect distinction for only 11% of the studied facts, with errors, certainly rare, but critical that humans would not make. This work highlights the limitations of current LMs in temporal representation.
nan
Article 341
Title@2025-06-23 (1): RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming
Title: RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming | RePST: Sprachmodell empowered Spatio-Temporal Forecasting via Semantisch-orientierte Reprogrammierung | REPST:通过以语义为主的重新编制方案来进行语言模型增强能力SPA-时间预报 2408.14505v3 |
Authors (5): Hao Wang, Jindong Han, Wei Fan, Leilei Sun, Hao Liu
Spatio-temporal forecasting is pivotal in numerous real-world applications, including transportation planning, energy management, and climate monitoring. In this work, we aim to harness the reasoning and generalization abilities of Pre-trained Language Models (PLMs) for more effective spatio-temporal forecasting, particularly in data-scarce scenarios. However, recent studies uncover that PLMs, which are primarily trained on textual data, often falter when tasked with modeling the intricate correlations in numerical time series, thereby limiting their effectiveness in comprehending spatio-temporal data. To bridge the gap, we propose RePST, a semantic-oriented PLM reprogramming framework tailored for spatio-temporal forecasting. Specifically, we first propose a semantic-oriented decomposer that adaptively disentangles spatially correlated time series into interpretable sub-components, which facilitates PLM to understand sophisticated spatio-temporal dynamics via a divide-and-conquer strategy. Moreover, we propose a selective discrete reprogramming scheme, which introduces an expanded spatio-temporal vocabulary space to project spatio-temporal series into discrete representations. This scheme minimizes the information loss during reprogramming and enriches the representations derived by PLMs. Extensive experiments on real-world datasets show that the proposed RePST outperforms twelve state-of-the-art baseline methods, particularly in data-scarce scenarios, highlighting the effectiveness and superior generalization capabilities of PLMs for spatio-temporal forecasting. Our codes can be found at https://github.com/usail-hkust/REPST.
nan
Article 342
Title@2025-06-23 (1): SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
Title: SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation | SlimMoE: Strukturierte Kompression großer MoE-Modelle über Expert Slimming und Destillation | SlimMoE:通过专家攀爬和蒸馏对大型MOE模型进行结构性压缩 2506.18349v1 |
Authors (7): Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, Tuo Zhao
The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters) using only 400B tokens–less than 10% of the original model’s training data. These compressed models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them highly suitable for academic and resource-limited settings. Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models. For instance, Phi-mini-MoE achieves similar or better performance to Phi-3-mini using only 2/3 of the activated parameters and yields comparable MMLU scores to Llama 3.1 8B despite having significantly lower latency. Our findings demonstrate that structured pruning combined with staged distillation offers an effective path to creating high-quality, compact MoE models, paving the way for broader adoption of MoE architectures. We make our models publicly available at https://huggingface.co/microsoft/Phi-mini-MoE-instruct and https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .
nan
Article 343
Title@2025-06-23 (1): Systematic Reward Gap Optimization for Mitigating VLM Hallucinations
Title: Systematic Reward Gap Optimization for Mitigating VLM Hallucinations | Systematische Belohnungslückenoptimierung zur Minderung von VLM-Halluzinationen | 降低VLM幻觉效应的系统回升差距优化 2411.17265v3 |
Authors (6): Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, Lu Sheng
The success of Direct Preference Optimization (DPO) in mitigating hallucinations in Vision Language Models (VLMs) critically hinges on the true reward gaps within preference pairs. However, current methods, typically relying on ranking or rewriting strategies, often struggle to optimize these reward gaps in a systematic way during data curation. A core difficulty lies in precisely characterizing and strategically manipulating the overall reward gap configuration, that is, the deliberate design of how to shape these reward gaps within each preference pair across the data. To address this, we introduce Topic-level Preference Rewriting(TPR), a novel framework designed for the systematic optimization of reward gap configuration. Through selectively replacing semantic topics within VLM responses with model’s own resampled candidates for targeted rewriting, TPR can provide topic-level control over fine-grained semantic details. This precise control enables advanced data curation strategies, such as progressively adjusting the difficulty of rejected responses, thereby sculpting an effective reward gap configuration that guides the model to overcome challenging hallucinations. Comprehensive experiments demonstrate TPR achieves state-of-the-art performance on multiple hallucination benchmarks, outperforming previous methods by an average of 20%. Notably, it significantly reduces hallucinations by up to 93% on ObjectHal-Bench, and also exhibits superior data efficiency towards robust and cost-effective VLM alignment.
nan
Article 344
Title@2025-06-23 (1): Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLMs
Title: Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLMs | Weniger Daten weniger Tokens: Mehrsprachiges Einheitslernen für effiziente Test-Time-Reasoning in LLMs | 减减数据减缩代号:以多种语文统一学习提高LLM中测试时间的高效理由 2506.18341v1 |
Authors (3): Kang Chen, Mengdi Zhang, Yixin Cao
This paper explores the challenges of test-time scaling of large language models (LLMs), regarding both the data and inference efficiency. We highlight the diversity of multi-lingual reasoning based on our pilot studies, and then introduce a novel approach, (L^2) multi-lingual unification learning with a decoding intervention strategy for further investigation. The basic idea of (L^2) is that the reasoning process varies across different languages, which may be mutually beneficial to enhance both model performance and efficiency. In specific, there are two types of multi-lingual data: the entire long chain-of-thought annotations in different languages and the step-wise mixture of languages. By further tuning based on them, we show that even small amounts of data can significantly improve reasoning capabilities. Our findings suggest that multilingual learning reduces both the required data and the number of inference tokens while maintaining a comparable performance. Furthermore, (L^2) is orthogonal to other data efficient methods. Thus, we also emphasize the importance of diverse data selection. The (L^2) method offers a promising solution to the challenges of data collection and test-time compute efficiency in LLMs.
nan
Article 345
Title@2025-06-23 (1): Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)
Title: Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs) | Position ist Macht: Systemprompts als Mechanismus von Bias in großen Sprachmodellen (LLMs) | 位置是电源:系统提示作为大语言模型比阿语机制(LLMs) 2505.21091v3 |
Authors (4): Anna Neumann, Elisabeth Kirsten, Muhammad Bilal Zafar, Jatinder Singh
System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others’ additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user’s ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.
nan
Article 346
Title@2025-06-23 (1): TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance
Title: TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance | TranslationCorrect: Ein einheitliches Framework für maschinelle Übersetzungs-Post-Editing mit Predictive Error Assistance | 翻译更正:机器翻译统一框架 2506.18337v1 |
Authors (4): Syed Mekael Wasti, Shou-Yi Hung, Christopher Collins, En-Shiun Annie Lee
Machine translation (MT) post-editing and research data collection often rely on inefficient, disconnected workflows. We introduce TranslationCorrect, an integrated framework designed to streamline these tasks. TranslationCorrect combines MT generation using models like NLLB, automated error prediction using models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive post-editing interface within a single environment. Built with human-computer interaction (HCI) principles in mind to minimize cognitive load, as confirmed by a user study. For translators, it enables them to correct errors and batch translate efficiently. For researchers, TranslationCorrect exports high-quality span-based annotations in the Error Span Annotation (ESA) format, using an error taxonomy inspired by Multidimensional Quality Metrics (MQM). These outputs are compatible with state-of-the-art error detection models and suitable for training MT or post-editing systems. Our user study confirms that TranslationCorrect significantly improves translation efficiency and user satisfaction over traditional annotation methods.
nan
Article 347
Title@2025-06-23 (1): HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
Title: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States | HiddenDetect: Ermitteln von Jailbreak Attacken gegen große Vision-Sprache Modelle durch Überwachung Hidden States | 通过监测隐藏国,侦查对大型视觉-语言模型的越狱袭击 2502.14744v4 |
Authors (7): Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.
nan
Article 348
Title@2025-06-23 (1): Enhancing Entity Aware Machine Translation with Multi-task Learning
Title: Enhancing Entity Aware Machine Translation with Multi-task Learning | Erweitern Entity Aware Machine Translation mit Multi-Task-Lernen | 以多任务学习方式增强实体意识机器翻译 2506.18318v1 |
Authors (3): An Trieu, Phuong Nguyen, Minh Le Nguyen
Entity-aware machine translation (EAMT) is a complicated task in natural language processing due to not only the shortage of translation data related to the entities needed to translate but also the complexity in the context needed to process while translating those entities. In this paper, we propose a method that applies multi-task learning to optimize the performance of the two subtasks named entity recognition and machine translation, which improves the final performance of the Entity-aware machine translation task. The result and analysis are performed on the dataset provided by the organizer of Task 2 of the SemEval 2025 competition.
nan
Article 349
Title@2025-06-23 (1): Team LA at SCIDOCA shared task 2025: Citation Discovery via relation-based zero-shot retrieval
Title: Team LA at SCIDOCA shared task 2025: Citation Discovery via relation-based zero-shot retrieval | Team LA bei SCIDOCA geteilte Aufgabe 2025: Citation Discovery über relationsbasierte Null-Shot-Retrieval | SCIDOCOCA的LA小组共同承担2025年任务:通过基于关系零发检索发现引文 2506.18316v1 |
Authors (3): Trieu An, Long Nguyen, Minh Le Nguyen
The Citation Discovery Shared Task focuses on predicting the correct citation from a given candidate pool for a given paragraph. The main challenges stem from the length of the abstract paragraphs and the high similarity among candidate abstracts, making it difficult to determine the exact paper to cite. To address this, we develop a system that first retrieves the top-k most similar abstracts based on extracted relational features from the given paragraph. From this subset, we leverage a Large Language Model (LLM) to accurately identify the most relevant citation. We evaluate our framework on the training dataset provided by the SCIDOCA 2025 organizers, demonstrating its effectiveness in citation prediction.
nan
Article 350
Title@2025-06-23 (1): Enhancing Document Retrieval in COVID-19 Research: Leveraging Large Language Models for Hidden Relation Extraction
Title: Enhancing Document Retrieval in COVID-19 Research: Leveraging Large Language Models for Hidden Relation Extraction | Verbesserung der Dokument-Retrieval in COVID-19 Forschung: Nutzung großer Sprachmodelle für versteckte Beziehungsextraktion | 在COVID-19研究中加强文件检索:利用大语言模型进行隐藏关系采掘 2506.18311v1 |
Authors (5): Hoang-An Trieu, Dinh-Truong Do, Chau Nguyen, Vu Tran, Minh Le Nguyen
In recent years, with the appearance of the COVID-19 pandemic, numerous publications relevant to this disease have been issued. Because of the massive volume of publications, an efficient retrieval system is necessary to provide researchers with useful information if an unexpected pandemic happens so suddenly, like COVID-19. In this work, we present a method to help the retrieval system, the Covrelex-SE system, to provide more high-quality search results. We exploited the power of the large language models (LLMs) to extract the hidden relationships inside the unlabeled publication that cannot be found by the current parsing tools that the system is using. Since then, help the system to have more useful information during retrieval progress.
nan
Article 351
Title@2025-06-23 (1): PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
Title: PlanGenLLMs: A Modern Survey of LLM Planning Capabilities | PlanGenLLMs: Eine moderne Erhebung der LLM-Planungskapazitäten | PlanGenLLLMs:LLM规划能力现代调查 2502.11221v3 |
Authors (6): Hui Wei, Zihao Zhang, Shenghua He, Tian Xia, Shijia Pan, Fei Liu
LLMs have immense potential for generating plans, transforming an initial world state into a desired goal state. A large body of research has explored the use of LLMs for various planning tasks, from web navigation to travel planning and database querying. However, many of these systems are tailored to specific problems, making it challenging to compare them or determine the best approach for new tasks. There is also a lack of clear and consistent evaluation criteria. Our survey aims to offer a comprehensive overview of current LLM planners to fill this gap. It builds on foundational work by Kartam and Wilkins (1990) and examines six key performance criteria: completeness, executability, optimality, representation, generalization, and efficiency. For each, we provide a thorough analysis of representative works and highlight their strengths and weaknesses. Our paper also identifies crucial future directions, making it a valuable resource for both practitioners and newcomers interested in leveraging LLM planning to support agentic workflows.
nan
Article 352
Title@2025-06-23 (1): AlzheimerRAG: Multimodal Retrieval Augmented Generation for Clinical Use Cases using PubMed articles
Title: AlzheimerRAG: Multimodal Retrieval Augmented Generation for Clinical Use Cases using PubMed articles | AlzheimerRAG: Multimodale Retrieval Augmented Generation für klinische Anwendungsfälle mit PubMed Artikeln | SDRRAG: 使用PubMed 文章临床使用案例的多式回收增加代数 2412.16701v2 |
Authors (2): Aritra Kumar Lahiri, Qinmin Vivian Hu
Recent advancements in generative AI have fostered the development of highly adept Large Language Models (LLMs) that integrate diverse data types to empower decision-making. Among these, multimodal retrieval-augmented generation (RAG) applications are promising because they combine the strengths of information retrieval and generative models, enhancing their utility across various domains, including clinical use cases. This paper introduces AlzheimerRAG, a Multimodal RAG application for clinical use cases, primarily focusing on Alzheimer’s Disease case studies from PubMed articles. This application incorporates cross-modal attention fusion techniques to integrate textual and visual data processing by efficiently indexing and accessing vast amounts of biomedical literature. Our experimental results, compared to benchmarks such as BioASQ and PubMedQA, have yielded improved performance in the retrieval and synthesis of domain-specific information. We also present a case study using our multimodal RAG in various Alzheimer’s clinical scenarios. We infer that AlzheimerRAG can generate responses with accuracy non-inferior to humans and with low rates of hallucination.
nan
Article 353
Title@2025-06-23 (1): LoRA vs Full Fine-tuning: An Illusion of Equivalence
Title: LoRA vs Full Fine-tuning: An Illusion of Equivalence | LoRA vs. Full Fine-Tuning: Eine Illusion der Gleichwertigkeit | LoRA 与 完全微调: 等同的幻象 2410.21228v2 |
Authors (4): Reece Shuttleworth, Jacob Andreas, Antonio Torralba, Pratyusha Sharma
Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to effectively fine-tune LLMs with an extreme reduction in trainable parameters. But, \emph{are their learned solutions really equivalent?} We study how LoRA and full-finetuning change pre-trained models by analyzing the model’s weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension – by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. Moreover, scaling them down significantly improves modeling of the pre-training distribution with a minimal drop in downstream task performance. Given this, we should expect accumulating intruder dimensions to be harmful and lead to more forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings.
nan
Article 354
Title@2025-06-23 (1): When Large Language Models Meet Vector Databases: A Survey
Title: When Large Language Models Meet Vector Databases: A Survey | Wenn große Sprachmodelle Vektordatenbanken treffen: Eine Umfrage | 当大语言模型与矢量数据库相匹配时:调查 2402.01763v4 |
Authors (8): Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang
This survey explores the synergistic potential of Large Language Models (LLMs) and Vector Databases (VecDBs), a burgeoning but rapidly evolving research area. With the proliferation of LLMs comes a host of challenges, including hallucinations, outdated knowledge, prohibitive commercial application costs, and memory issues. VecDBs emerge as a compelling solution to these issues by offering an efficient means to store, retrieve, and manage the high-dimensional vector representations intrinsic to LLM operations. Through this nuanced review, we delineate the foundational principles of LLMs and VecDBs and critically analyze their integration’s impact on enhancing LLM functionalities. This discourse extends into a discussion on the speculative future developments in this domain, aiming to catalyze further research into optimizing the confluence of LLMs and VecDBs for advanced data handling and knowledge extraction capabilities.
nan
Article 355
Title@2025-06-23 (1): FutureFill: Fast Generation from Convolutional Sequence Models
Title: FutureFill: Fast Generation from Convolutional Sequence Models | FutureFill: Schnelle Generation aus konvolutionären Sequenzmodellen | 未来金融危机:从变序模型中快速繁衍 2410.03766v3 |
Authors (9): Naman Agarwal, Xinyi Chen, Evan Dogariu, Devan Shah, Hubert Strauss, Vlad Feinberg, Daniel Suo, Peter Bartlett, Elad Hazan
We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill, a general-purpose fast generation method for any sequence prediction algorithm based on convolutional operators. FutureFill reduces generation time from quadratic to quasilinear in the context length. Moreover, when generating from a prompt, it requires a prefill cache whose size grows only with the number of tokens to be generated, often much smaller than the caches required by standard convolutional or attention based models. We validate our theoretical claims with experiments on synthetic tasks and demonstrate substantial efficiency gains when generating from a deep convolutional sequence prediction model.
nan
Article 356
Title@2025-06-23 (1): AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Title: AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining | AdaLRS: Loss-Guided Adaptive Learning Rate Suche nach effizientem Foundation Model Pretraining | AdaLRS: 为高效基础基础示范培训前而寻找学习率 2506.13274v2 |
Authors (5): Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng, Jiao Ran
Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide experiment results to show that the optimization of training loss and loss descent velocity in foundation model pretraining are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, and base learning rate scheduler choices.
nan
Article 357
Title@2025-06-23 (1): RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
Title: RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding | RAPID: Lang-Kontext-Schlussfolgerung mit retrieval-Augmented Spekulative Decodierung | RAPID: 利用回溯性增强的投机性代谢法进行长文本推理 2502.20330v2 |
Authors (5): Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh
The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.
nan
Article 358
Title@2025-06-23 (1): Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework
Title: Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework | Sykopanz in Vision-Language-Modellen: Eine systematische Analyse und ein Inferenz-Zeit-Mitigation-Framework | 远景 – – 语言模型的求视 – – 语言模型的求爱:系统分析和推论 – – 时间减缓框架 2408.11261v2 |
Authors (7): Yunpu Zhao, Rui Zhang, Junbin Xiao, Changxin Ke, Ruibo Hou, Yifan Hao, Ling Li
Large Vision-Language Models (LVLMs) have shown significant capability in vision-language understanding. However, one critical issue that persists in these models is sycophancy, where models are unduly influenced by leading or deceptive prompts, resulting in biased outputs and hallucinations. Despite the rapid development of LVLMs, evaluating and mitigating sycophancy remains largely under-explored. In this work, we fill this gap by systematically analyzing sycophancy across multiple vision-language benchmarks and propose an inference-time mitigation framework. We curate leading queries and quantify the susceptibility of state-of-the-art LVLMs to prompt-induced bias, revealing consistent performance degradation and instability across models and tasks. Our analysis further uncovers model-specific behavioral traits, such as sentiment sensitivity and prediction polarity shifts under sycophancy. To mitigate these issues, we propose a training-free, model-agnostic framework that operates entirely at inference time. Our approach first employs a query neutralizer, leveraging an language model to suppress implicit sycophantic bias in user queries. We then introduce a sycophancy-aware contrastive decoding mechanism that dynamically recalibrates token-level output distributions by contrasting responses to neutralized and leading queries. Finally, an adaptive logits refinement module further modifies the contrasted logits by integrating both a adaptive plausibility filter and query sentiment scaler, ensuring coherent and robust generation. Extensive experiments demonstrate that this framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts. Our results suggest that sycophancy in LVLMs is a general and urgent challenge, and that inference-time strategies offer a promising path toward trustworthy multimodal reasoning.
nan
Article 359
Title@2025-06-23 (1): RLPR: Extrapolating RLVR to General Domains without Verifiers
Title: RLPR: Extrapolating RLVR to General Domains without Verifiers | RLPR: Extrapolieren von RLVR auf allgemeine Domains ohne Prüfer | RLPR: 将RLVR外推至普通域域,无验证符 2506.18254v1 |
Authors (12): Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM’s intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM’s own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.
nan
Article 360
Title@2025-06-23 (1): Craw4LLM: Efficient Web Crawling for LLM Pretraining
Title: Craw4LLM: Efficient Web Crawling for LLM Pretraining | Craw4LLM: Effizientes Web-Crawling für LLM Pretraining | Craw4LLLM: 高效的LLM培训前网络排网 2502.13347v3 |
Authors (3): Shi Yu, Zhiyuan Liu, Chenyan Xiong
Web crawl is a main source of large language models’ (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Craw4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler’s scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine’s index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Craw4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Craw4LLM.
nan
Article 361
Title@2025-06-23 (1): Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
Title: Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models | Chain-of-Experts: Entsperren der Kommunikationskraft von Mixture-of-Experts-Modellen | 专家链:解锁混合专家模型的通信能力 2506.18945v1 |
Authors (10): Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model’s representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE’s benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.
nan
Article 362
Title@2025-06-23 (1): From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents
Title: From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents | Von der RAG zur Agentur: Validierung islamisch-medizinischer Reaktionen mit LLM-Agenten | 从RAG到AG 剂:用LLM代理物验证伊斯兰药物反应 2506.15911v2 |
Authors (5): Mohammad Amaan Sayeed, Mohammed Talha Alam, Raza Imam, Shahab Saquib Sohail, Amir Hussain
Centuries-old Islamic medical texts like Avicenna’s Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.
nan
Article 363
Title@2025-06-23 (1): AdapThink: Adaptive Thinking Preferences for Reasoning Language Model
Title: AdapThink: Adaptive Thinking Preferences for Reasoning Language Model | AdapThink: Adaptive Denkeinstellungen für das Sprachmodell der Vernunft | AapThink:对理由语言模式的适应性思维偏好 2506.18237v1 |
Authors (6): Xu Wan, Wei Wang, Wenyue Xu, Wotao Yin, Jie Song, Mingyang Sun
Reinforcement Learning (RL)-based post-training has significantly advanced the complex reasoning capabilities of language models, fostering sophisticated self-reflection processes. However, this ``slow thinking’’ paradigm presents a critical challenge to reasoning efficiency: models may expend excessive computation on simple questions and shift reasoning prematurely for complex ones. Previous mechanisms typically rely on static length budgets or predefined rules, lacking the adaptability for varying question complexities and models’ evolving capabilities. To this end, we propose AdapThink, an adaptive post-training framework designed to induce more efficient thinking while maintaining the performance of reasoning language models. Specifically, AdapThink incorporates two key mechanisms: 1) A group-relative reward function that leverages model confidence and response’s characteristic to dynamically adjust the preference of reflection-related transition words without resorting to a fixed length preference. 2) A diversity-aware sampling mechanism that balances the training group’s solution accuracy with reasoning diversity via an entropy-guided score. Experiments on several mathematical reasoning datasets with DeepSeek-distilled models demonstrate AdapThink’s advantages in enabling adaptive reasoning patterns and mitigating the inefficiencies.
nan
Article 364
Title@2025-06-23 (1): NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts
Title: NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts | NovelHopQA: Diagnose von Multi-Hop-Vernünftigkeitsfehlern in langen narrativen Kontexten | 新创HopQA:在长期叙述背景下诊断多种原因的失败 2506.02000v2 |
Authors (5): Abhay Gupta, Michael Lu, Kevin Zhu, Sean O’Brien, Vasu Sharma
Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate seven state-of-the-art models and apply oracle-context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We additionally present retrieval-augmented generation (RAG) evaluations to test model performance when only selected passages are provided instead of the full context. We noticed consistent accuracy drops with increased hops and context length increase, even for frontier models-revealing that sheer scale does not guarantee robust reasoning. Failure-mode analysis highlights common breakdowns such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to test multi-hop reasoning at scale. All code and datasets are available at https://novelhopqa.github.io.
nan
Article 365
Title@2025-06-23 (1): Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models
Title: Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models | Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalisable ASR Models | 推进非洲集中的语音识别:为可通用的 ASR 模型选择突发性不确定性驱动数据 2306.02105v7 |
Authors (1): Bonaventure F. P. Dossou
Accents play a pivotal role in shaping human communication, enhancing our ability to convey and comprehend messages with clarity and cultural nuance. While there has been significant progress in Automatic Speech Recognition (ASR), African-accented English ASR has been understudied due to a lack of training datasets, which are often expensive to create and demand colossal human labor. Combining several active learning paradigms and the core-set approach, we propose a new multi-rounds adaptation process that uses epistemic uncertainty to automate the annotation process, significantly reducing the associated costs and human labor. This novel method streamlines data annotation and strategically selects data samples contributing most to model uncertainty, enhancing training efficiency. We define a new U-WER metric to track model adaptation to hard accents. We evaluate our approach across several domains, datasets, and high-performing speech models. Our results show that our approach leads to a 27\% WER relative average improvement while requiring on average 45\% less data than established baselines. Our approach also improves out-of-distribution generalization for very low-resource accents, demonstrating its viability for building generalizable ASR models in the context of accented African ASR. We open-source the code here: https://github.com/bonaventuredossou/active_learning_african_asr.
nan
Article 366
Title@2025-06-22 (7): Shrinking the Generation-Verification Gap with Weak Verifiers
Title: Shrinking the Generation-Verification Gap with Weak Verifiers | Schrumpfung der Generation-Verifikationslücke mit schwachen Prüfern | 利用薄弱的验证器缩小代代核查差距 2506.18203v1 |
Authors (12): Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Ré
Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver’s effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver’s combined output scores.
nan
Article 367
Title@2025-06-22 (7): Supernova Event Dataset: Interpreting Large Language Models’ Personality through Critical Event Analysis
Title: Supernova Event Dataset: Interpreting Large Language Models’ Personality through Critical Event Analysis | Supernova Event Dataset: Die Persönlichkeit großer Sprachmodelle durch kritische Ereignisanalyse interpretieren | 超新星事件数据集:通过重大事件分析解释大语言模型的个性 2506.12189v2 |
Authors (2): Pranav Agarwal, Ioana Ciucă
Large Language Models (LLMs) are increasingly integrated into everyday applications. As their influence grows, understanding their decision making and underlying personality becomes essential. In this work, we interpret model personality using our proposed Supernova Event Dataset, a novel dataset with diverse articles spanning biographies, historical events, news, and scientific discoveries. We use this dataset to benchmark LLMs on extracting and ranking key events from text, a subjective and complex challenge that requires reasoning over long-range context and modeling causal chains. We evaluate small models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as Claude 3.7, Gemini 2.5, and OpenAI o3, and propose a framework where another LLM acts as a judge to infer each model’s personality based on its selection and classification of events. Our analysis shows distinct personality traits: for instance, Orca 2 demonstrates emotional reasoning focusing on interpersonal dynamics, while Qwen 2.5 displays a more strategic, analytical style. When analyzing scientific discovery events, Claude Sonnet 3.7 emphasizes conceptual framing, Gemini 2.5 Pro prioritizes empirical validation, and o3 favors step-by-step causal reasoning. This analysis improves model interpretability, making them user-friendly for a wide range of diverse applications. Project Page - https://www.supernova-event.ai/
nan
Article 368
Title@2025-06-22 (7): Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications
Title: Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications | Entschlüsselung von Emotionen in Kindergeschichten: Eine vergleichende Analyse multimodaler LLMs in pädagogischen Anwendungen | 儿童故事书中的破除情感:教育应用中多种模式LMs的比较分析 2506.18201v1 |
Authors (5): Bushra Asseri, Estabraq Abdelaziz, Maha Al Mogren, Tayef Alhefdhi, Areej Al-Wabil
Emotion recognition capabilities in multimodal AI systems are crucial for developing culturally responsive educational technologies, yet remain underexplored for Arabic language contexts where culturally appropriate learning tools are critically needed. This study evaluates the emotion recognition performance of two advanced multimodal large language models, GPT-4o and Gemini 1.5 Pro, when processing Arabic children’s storybook illustrations. We assessed both models across three prompting strategies (zero-shot, few-shot, and chain-of-thought) using 75 images from seven Arabic storybooks, comparing model predictions with human annotations based on Plutchik’s emotional framework. GPT-4o consistently outperformed Gemini across all conditions, achieving the highest macro F1-score of 59% with chain-of-thought prompting compared to Gemini’s best performance of 43%. Error analysis revealed systematic misclassification patterns, with valence inversions accounting for 60.7% of errors, while both models struggled with culturally nuanced emotions and ambiguous narrative contexts. These findings highlight fundamental limitations in current models’ cultural understanding and emphasize the need for culturally sensitive training approaches to develop effective emotion-aware educational technologies for Arabic-speaking learners.
nan
Article 369
Title@2025-06-22 (7): Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review
Title: Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review | Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Moslems in Large Language Models: A Systematic Review | 减轻大语言模式中针对阿拉伯人和穆斯林的文化偏见的迅速工程技术:系统审查 2506.18199v1 |
Authors (3): Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil
Large language models have demonstrated remarkable capabilities across various domains, yet concerns about cultural bias - particularly towards Arabs and Muslims - pose significant ethical challenges by perpetuating harmful stereotypes and marginalization. Despite growing recognition of bias in LLMs, prompt engineering strategies specifically addressing Arab and Muslim representation remain understudied. This mixed-methods systematic review examines such techniques, offering evidence-based guidance for researchers and practitioners. Following PRISMA guidelines and Kitchenham’s systematic review methodology, we analyzed 8 empirical studies published between 2021-2024 investigating bias mitigation strategies. Our findings reveal five primary prompt engineering approaches: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Although all approaches show potential for reducing bias, effectiveness varied substantially across studies and bias types. Evidence suggests that certain bias types may be more resistant to prompt-based mitigation than others. Structured multi-step pipelines demonstrated the highest overall effectiveness, achieving up to 87.7% reduction in bias, though they require greater technical expertise. Cultural prompting offers broader accessibility with substantial effectiveness. These results underscore the accessibility of prompt engineering for mitigating cultural bias without requiring access to model parameters. The limited number of studies identified highlights a significant research gap in this critical area. Future research should focus on developing culturally adaptive prompting techniques, creating Arab and Muslim-specific evaluation resources, and integrating prompt engineering with complementary debiasing methods to address deeper stereotypes while maintaining model utility.
nan
Article 370
Title@2025-06-22 (7): ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
Title: ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists | ExpertLongBench: Benchmarking-Sprachmodelle auf Expertenebene Langform-Erstellungsaufgaben mit strukturierten Checklisten | 专家关系:专家级长期世代任务的语言模式基准与结构化核对清单 2506.01241v2 |
Authors (17): Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.
nan
Article 371
Title@2025-06-22 (7): CareLab at #SMM4H-HeaRD 2025: Insomnia Detection and Food Safety Event Extraction with Domain-Aware Transformers
Title: CareLab at #SMM4H-HeaRD 2025: Insomnia Detection and Food Safety Event Extraction with Domain-Aware Transformers | CareLab auf der #SMM4H-HeaRD 2025: Schlaflosigkeitserkennung und Lebensmittelsicherheitsveranstaltung Extraktion mit Domain-Aware Transformern | #SMM4H-HeARD 2025:与域-软件变换器一起的失眠检测和食品安全事件提取 2506.18185v1 |
Authors (4): Zihan Liang, Ziwen Pan, Sumon Kanti Dey, Azra Ismail
This paper presents our system for the SMM4H-HeaRD 2025 shared tasks, specifically Task 4 (Subtasks 1, 2a, and 2b) and Task 5 (Subtasks 1 and 2). Task 4 focused on detecting mentions of insomnia in clinical notes, while Task 5 addressed the extraction of food safety events from news articles. We participated in all subtasks and report key findings across them, with particular emphasis on Task 5 Subtask 1, where our system achieved strong performance-securing first place with an F1 score of 0.958 on the test set. To attain this result, we employed encoder-based models (e.g., RoBERTa), alongside GPT-4 for data augmentation. This paper outlines our approach, including preprocessing, model architecture, and subtask-specific adaptations
nan
Article 372
Title@2025-06-22 (7): Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know?
Title: Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know? | Vernunft über Ungewissheit: Wissen Vernunftmodelle, wenn sie es nicht wissen? | 关于不确定性的原因:理性模型知道他们不知道什么时候知道吗? 2506.18183v1 |
Authors (6): Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar
Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.
nan
Article 373
Title@2025-06-22 (7): Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness
Title: Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness | Mehrsprachige retrieval Augmented Generation für kulturell-sensitive Aufgaben: Ein Benchmark für übergreifende Robustheit | 文化敏感性任务多语种检索增强一代人:跨语种强力基准 2410.01171v3 |
Authors (10): Bryan Li, Fiona Luo, Samar Haider, Adwait Agashe, Tammy Li, Runqi Liu, Muqing Miao, Shriya Ramakrishnan, Yuan Yuan, Chris Callison-Burch
The paradigm of retrieval-augmented generated (RAG) helps mitigate hallucinations of large language models (LLMs). However, RAG also introduces biases contained within the retrieved documents. These biases can be amplified in scenarios which are multilingual and culturally-sensitive, such as territorial disputes. We thus introduce BordIRLines, a dataset of territorial disputes paired with retrieved Wikipedia documents, across 49 languages. We evaluate the cross-lingual robustness of this RAG setting by formalizing several modes for multilingual retrieval. Our experiments on several LLMs show that incorporating perspectives from diverse languages can in fact improve robustness; retrieving multilingual documents best improves response consistency and decreases geopolitical bias over RAG with purely in-language documents. We also consider how RAG responses utilize presented documents, finding a much wider variance in the linguistic distribution of response citations, when querying in low-resource languages. Our further analyses investigate the various aspects of a cross-lingual RAG pipeline, from retrieval to document contents. We release our benchmark and code to support continued research towards equitable information access across languages at https://huggingface.co/datasets/borderlines/bordirlines.
nan
Article 374
Title@2025-06-22 (7): QuranMorph: Morphologically Annotated Quranic Corpus
Title: QuranMorph: Morphologically Annotated Quranic Corpus | KoranMorph: Morphologisch kommentierter Koran Corpus | 《古兰经》: 体质说明 2506.18148v1 |
Authors (3): Diyam Akra, Tymaa Hammouda, Mustafa Jarrar
We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (https://sina.birzeit.edu/quran)
nan
Article 375
Title@2025-06-22 (7): Enhancing LLM Knowledge Learning through Generalization
Title: Enhancing LLM Knowledge Learning through Generalization | Verbesserung des LLM-Wissenslernens durch Verallgemeinerung | 通过普遍化加强LLM知识学习 2503.03705v2 |
Authors (6): Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
As Large language models (LLMs) are increasingly deployed in diverse applications, faithfully integrating evolving factual knowledge into these models remains a critical challenge. Continued pre-training on paraphrased data has shown empirical promise for enhancing knowledge acquisition. However, this approach is often costly and unreliable, as it relies on external models or manual effort for rewriting, and may inadvertently alter the factual content. In this work, we hypothesize and empirically show that an LLM’s ability to continually predict the same factual knowledge tokens given diverse paraphrased contexts is positively correlated with its capacity to extract that knowledge via question-answering. Based on this view and aiming to improve generalization to diverse paraphrased contexts, we introduce two strategies to enhance LLMs’ ability to predict the same knowledge tokens given varied contexts, thereby enhancing knowledge acquisition. First, we propose formatting-based data augmentation, which diversifies documents conveying the same knowledge by altering document formats rather than their content, thereby preserving factual integrity. Second, we adopt sharpness-aware minimization as the optimizer to better improve generalization. Extensive experiments demonstrate our methods’ effectiveness in both continued pre-training and instruction tuning, and further gains can be achieved by combining with paraphrased data.
nan
Article 376
Title@2025-06-22 (7): Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models
Title: Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models | Sparse-Feature-Koaktivierung zeigt kombinierbare semantische Module in großen Sprachmodellen | 大语言模型中可合成的语义模块 2506.18141v1 |
Authors (8): Ruixuan Deng, Xiaoyang Hu, Miles Gilberti, Shane Storks, Aman Taxali, Mike Angstadt, Chandra Sripada, Joyce Chai
We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on country-relation tasks, we show that ablating semantic components for countries and relations changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and country components yields compound counterfactual outputs. We find that, whereas most country components emerge from the very first layer, the more abstract relation components are concentrated in later layers. Furthermore, within relation components themselves, nodes from later layers tend to have a stronger causal impact on model outputs. Overall, these findings suggest a modular organization of knowledge within LLMs and advance methods for efficient, targeted model manipulation.
nan
Article 377
Title@2025-06-22 (7): Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs
Title: Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs | Collage: Ablösbares Rapid Prototyping für Informationsextraktion auf wissenschaftlichen PDFs | 结合:可分解的用于科学文件格式信息提取的快速原型 2410.23478v2 |
Authors (9): Sireesh Gururaja, Yueheng Zhang, Guannan Tang, Tianhao Zhang, Kevin Murphy, Yu-Tsen Yi, Junwon Seo, Anthony Rollett, Emma Strubell
Recent years in NLP have seen the continued development of domain-specific information extraction tools for scientific documents, alongside the release of increasingly multimodal pretrained transformer models. While the opportunity for scientists outside of NLP to evaluate and apply such systems to their own domains has never been clearer, these models are difficult to compare: they accept different input formats, are often black-box and give little insight into processing failures, and rarely handle PDF documents, the most common format of scientific publication. In this work, we present Collage, a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs. Collage allows the use and evaluation of any HuggingFace token classifier, several LLMs, and multiple other task-specific models out of the box, and provides extensible software interfaces to accelerate experimentation with new models. Further, we enable both developers and users of NLP-based tools to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing. We demonstrate our system in the context of information extraction to assist with literature review in materials science.
nan
Article 378
Title@2025-06-22 (7): SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging
Title: SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging | SE-Merging: Ein selbsterweiterter Ansatz für dynamisches Modellverschmelzen | SE-Miging:动态模型合并的自我加强办法 2506.18135v1 |
Authors (6): Zijun Chen, Zhanpeng Zhou, Bo Zhang, Weinan Zhang, Xi Sun, Junchi Yan
Model merging has gained increasing attention due to its intriguing property: interpolating the parameters of different task-specific fine-tuned models leads to multi-task abilities. However, despite its empirical success, the underlying mechanisms of model merging remain poorly understood. In this work, we delve into the mechanism behind model merging from a representation perspective. Our analysis reveals that model merging achieves multi-task abilities through two key capabilities: i) distinguishing samples from different tasks, and ii) adapting to the corresponding expert model for each sample. These two capabilities allow the merged model to retain task-specific expertise, enabling efficient multi-task adaptation. Building on these insights, we propose \texttt{SE-Merging}, a self-enhanced model merging framework that leverages these two characteristics to dynamically identify the corresponding task for each sample and then adaptively rescales the merging coefficients to further enhance task-specific expertise in the merged model. Notably, \texttt{SE-Merging} achieves dynamic model merging without additional training. Extensive experiments demonstrate that \texttt{SE-Merging} achieves significant performance improvements while remaining compatible with existing model merging techniques.
nan
Article 379
Title@2025-06-22 (7): $φ^{\infty}$: Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models
Title: $φ^{\infty}$: Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models | $φ^{\infty}$: Clause Purification, Einbettung von Neuausrichtung und die totale Unterdrückung des Em Dash in autoregressiven Sprachmodellen | $infty}$: 条款净化、嵌入的调整和在自动递减语言模式中全面禁止 “ Em Dash “ 。 2506.18129v1 |
Authors (2): Bugra Kilictas, Faruk Alpay
We identify a critical vulnerability in autoregressive transformer language models where the em dash token induces recursive semantic drift, leading to clause boundary hallucination and embedding space entanglement. Through formal analysis of token-level perturbations in semantic lattices, we demonstrate that em dash insertion fundamentally alters the model’s latent representations, causing compounding errors in long-form generation. We propose a novel solution combining symbolic clause purification via the phi-infinity operator with targeted embedding matrix realignment. Our approach enables total suppression of problematic tokens without requiring model retraining, while preserving semantic coherence through fixed-point convergence guarantees. Experimental validation shows significant improvements in generation consistency and topic maintenance. This work establishes a general framework for identifying and mitigating token-level vulnerabilities in foundation models, with immediate implications for AI safety, model alignment, and robust deployment of large language models in production environments. The methodology extends beyond punctuation to address broader classes of recursive instabilities in neural text generation systems.
nan
Article 380
Title@2025-06-22 (7): The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English
Title: The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English | The Syntactic Acceptability Dataset (Preview): Eine Ressource für maschinelles Lernen und sprachliche Analyse des Englischen | 同步可接受数据集(预评):英语机器学习和语言分析资源 2506.18120v1 |
Authors (1): Tom S Juzek
We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status (“well-formedness” according to syntactic formalisms) extracted from the literature, as well as its acceptability status (“intuitive goodness” as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that “in-betweenness” occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.
nan
Article 381
Title@2025-06-22 (7): Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives
Title: Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives | Psychische Gesundheit in LLMs: Multi-Hop-Frage beantworten, um verstärkte und stillschweigende Perspektiven zu erkennen | LLMs中的心理健康平等:利用多种渠道问题回答发现放大和沉默的视角 2506.18116v1 |
Authors (4): Batool Haider, Atmika Gorti, Aman Chadha, Manas Gaur
Large Language Models (LLMs) in mental healthcare risk propagating biases that reinforce stigma and harm marginalized groups. While previous research identified concerning trends, systematic methods for detecting intersectional biases remain limited. This work introduces a multi-hop question answering (MHQA) framework to explore LLM response biases in mental health discourse. We analyze content from the Interpretable Mental Health Instruction (IMHI) dataset across symptom presentation, coping mechanisms, and treatment approaches. Using systematic tagging across age, race, gender, and socioeconomic status, we investigate bias patterns at demographic intersections. We evaluate four LLMs: Claude 3.5 Sonnet, Jamba 1.6, Gemma 3, and Llama 4, revealing systematic disparities across sentiment, demographics, and mental health conditions. Our MHQA approach demonstrates superior detection compared to conventional methods, identifying amplification points where biases magnify through sequential reasoning. We implement two debiasing techniques: Roleplay Simulation and Explicit Bias Reduction, achieving 66-94% bias reductions through few-shot prompting with BBQ dataset examples. These findings highlight critical areas where LLMs reproduce mental healthcare biases, providing actionable insights for equitable AI development.
nan
Article 382
Title@2025-06-22 (7): Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
Title: Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use | Chengyu-Bench: Benchmarking von großen Sprachmodellen für das Verständnis und die Verwendung chinesischer Sprache | Chengyuy-Bench:为中语语言理解和使用制定大语言模式基准 2506.18105v1 |
Authors (5): Yicheng Fu, Zhemin Huang, Liuxin Yang, Yumeng Lu, Zhongdongming Dai
Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.
nan
Article 383
Title@2025-06-22 (7): InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
Title: InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating | InspireDebatte: Multi-Dimensional Subjektive-Zielive Evaluation-geführte Begründung und Optimierung zur Debattierung | 预测性辩论:多维可主观性-客观评价-指导性逻辑和最佳评议优化 2506.18102v1 |
Authors (4): Fuyu Wang, Jiangtong Li, Kun Zhu, Changjun Jiang
With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions$-$including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement$-$thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) $\textbf{InspireScore}$, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) $\textbf{InspireDebate}$, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that $\textbf{InspireScore}$ achieves 44$\%$ higher correlation with expert judgments compared to existing methods, while $\textbf{InspireDebate}$ shows significant improvements, outperforming baseline models by 57$\%$. Source code is available at https://github.com/fywang12/InspireDebate.
nan
Article 384
Title@2025-06-22 (7): Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution
Title: Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution | Bewertung prompt-basierter und feinverstellter Ansätze zur tschechischen Anaphora-Resolution | 评估对捷克阿原光病决议的即时和精确推荐方法 2506.18091v1 |
Authors (2): Patrik Stano, Aleš Horák
Anaphora resolution plays a critical role in natural language understanding, especially in morphologically rich languages like Czech. This paper presents a comparative evaluation of two modern approaches to anaphora resolution on Czech text: prompt engineering with large language models (LLMs) and fine-tuning compact generative models. Using a dataset derived from the Prague Dependency Treebank, we evaluate several instruction-tuned LLMs, including Mistral Large 2 and Llama 3, using a series of prompt templates. We compare them against fine-tuned variants of the mT5 and Mistral models that we trained specifically for Czech anaphora resolution. Our experiments demonstrate that while prompting yields promising few-shot results (up to 74.5% accuracy), the fine-tuned models, particularly mT5-large, outperform them significantly, achieving up to 88% accuracy while requiring fewer computational resources. We analyze performance across different anaphora types, antecedent distances, and source corpora, highlighting key strengths and trade-offs of each approach.
nan
Article 385
Title@2025-06-22 (7): RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Title: RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation | RoboTwin 2.0: Ein skalierbarer Datengenerator und Benchmark mit starker Domain Randomisierung für robuste bimanuelle Robotermanipulation | RoboTwin 2. 0 : 一个可缩放数据生成器和基准, 具有强力域随机化功能, 用于机械二手机器人操纵的可缩放数据生成器和基准 2506.18088v1 |
Authors (26): Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, Yao Mu
Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two challenges: (1) the lack of an efficient, scalable data generation method for novel tasks, and (2) oversimplified simulation environments that fail to capture real-world complexity. We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. We first construct RoboTwin-OD, a large-scale object library comprising 731 instances across 147 categories, each annotated with semantic and manipulation-relevant labels. Building on this foundation, we develop an expert data synthesis pipeline that combines multimodal large language models (MLLMs) with simulation-in-the-loop refinement to generate task-level execution code automatically. To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby enhancing data diversity and policy robustness. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories. Empirical results show a 10.9% gain in code generation success and improved generalization to novel real-world scenarios. A VLA model fine-tuned on our dataset achieves a 367% relative improvement (42.0% vs. 9.0%) on unseen scene real-world tasks, while zero-shot models trained solely on our synthetic data achieve a 228% relative gain, highlighting strong generalization without real-world supervision. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation.
nan
Article 386
Title@2025-06-22 (7): TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking
Title: TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking | TrumorGPT: Graph-basiertes Retrieval-Augmented Large Language Modell für die Fact-Checking | TrumorGPT: 用于实况调查的基于图表的检索增强型大语言模型 2505.07891v2 |
Authors (3): Ching Nam Hang, Pei-Duo Yu, Chee Wei Tan
In the age of social media, the rapid spread of misinformation and rumors has led to the emergence of infodemics, where false information poses a significant threat to society. To combat this issue, we introduce TrumorGPT, a novel generative artificial intelligence solution designed for fact-checking in the health domain. TrumorGPT aims to distinguish “trumors”, which are health-related rumors that turn out to be true, providing a crucial tool in differentiating between mere speculation and verified facts. This framework leverages a large language model (LLM) with few-shot learning for semantic health knowledge graph construction and semantic reasoning. TrumorGPT incorporates graph-based retrieval-augmented generation (GraphRAG) to address the hallucination issue common in LLMs and the limitations of static training data. GraphRAG involves accessing and utilizing information from regularly updated semantic health knowledge graphs that consist of the latest medical news and health information, ensuring that fact-checking by TrumorGPT is based on the most recent data. Evaluating with extensive healthcare datasets, TrumorGPT demonstrates superior performance in fact-checking for public health claims. Its ability to effectively conduct fact-checking across various platforms marks a critical step forward in the fight against health-related misinformation, enhancing trust and accuracy in the digital information age.
nan
Article 387
Title@2025-06-22 (7): Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
Title: Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity | 1568 Tokens in einen einzigen Vektor und wieder zurück krammen: Die Grenzen der Einbettung von Raumkapazität erkunden | 将1568吨撞成单一矢量和后向:探索嵌入空间能力的极限 2502.13063v3 |
Authors (4): Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches are focused on reduction of the amount of compute in existing language models rather than minimization of number of bits needed to store text. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.
nan
Article 388
Title@2025-06-22 (7): FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLMs
Title: FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLMs | FinGPT: Verbesserung der Sentiment-Based Stock Movement Prediction mit Verbreitungs-Bewusst und Kontext-angereicherten LLMs | FINGPT:利用传播软件和内容丰富的LMs,加强基于情绪的库存流动预测 2412.10823v2 |
Authors (6): Yixuan Liang, Yuncong Liu, Neng Wang, Hongyang Yang, Boyu Zhang, Christina Dan Wang
Financial sentiment analysis is crucial for understanding the influence of news on stock prices. Recently, large language models (LLMs) have been widely adopted for this purpose due to their advanced text analysis capabilities. However, these models often only consider the news content itself, ignoring its dissemination, which hampers accurate prediction of short-term stock movements. Additionally, current methods often lack sufficient contextual data and explicit instructions in their prompts, limiting LLMs’ ability to interpret news. In this paper, we propose a data-driven approach that enhances LLM-powered sentiment-based stock movement predictions by incorporating news dissemination breadth, contextual data, and explicit instructions. We cluster recent company-related news to assess its reach and influence, enriching prompts with more specific data and precise instructions. This data is used to construct an instruction tuning dataset to fine-tune an LLM for predicting short-term stock price movements. Our experimental results show that our approach improves prediction accuracy by 8\% compared to existing methods.
nan
Article 389
Title@2025-06-22 (7): The Democratic Paradox in Large Language Models’ Underestimation of Press Freedom
Title: The Democratic Paradox in Large Language Models’ Underestimation of Press Freedom | Das demokratische Paradox in großen Sprachmodellen’ Unterschätzung der Pressefreiheit | 《大语言模式中的民主悖论》对新闻自由的低估 2506.18045v1 |
Authors (4): I. Loaiza, R. Vestrelli, A. Fronzetti Colladon, R. Rigobon
As Large Language Models (LLMs) increasingly mediate global information access for millions of users worldwide, their alignment and biases have the potential to shape public understanding and trust in fundamental democratic institutions, such as press freedom. In this study, we uncover three systematic distortions in the way six popular LLMs evaluate press freedom in 180 countries compared to expert assessments of the World Press Freedom Index (WPFI). The six LLMs exhibit a negative misalignment, consistently underestimating press freedom, with individual models rating between 71% to 93% of countries as less free. We also identify a paradoxical pattern we term differential misalignment: LLMs disproportionately underestimate press freedom in countries where it is strongest. Additionally, five of the six LLMs exhibit positive home bias, rating their home countries’ press freedoms more favorably than would be expected given their negative misalignment with the human benchmark. In some cases, LLMs rate their home countries between 7% to 260% more positively than expected. If LLMs are set to become the next search engines and some of the most important cultural tools of our time, they must ensure accurate representations of the state of our human and civic rights globally.
nan
Article 390
Title@2025-06-22 (7): Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Title: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation | Kreuz von links nach rechts Gehirn: Adaptiver Texttraum für Vision-und-Sprachen-Navigation | 从左脑到右脑交叉:愿景和语言导航的适应性文本梦想者 2505.20897v2 |
Authors (10): Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li
Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.
nan
Article 391
Title@2025-06-22 (7): MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval
Title: MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval | MM-R5: MultiModal reasoning-enhanced ReRanker über Verstärkungs-Lernen für Dokument-Retrieval | MM-R5:通过文件检索强化学习加强文件检索,多模式合理改进Reanker 2506.12364v2 |
Authors (8): Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, Hengxing Cai
Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline. Our code is available at https://github.com/i2vec/MM-R5 .
nan
Article 392
Title@2025-06-22 (7): Markov-Enhanced Clustering for Long Document Summarization: Tackling the ‘Lost in the Middle’ Challenge with Large Language Models
Title: Markov-Enhanced Clustering for Long Document Summarization: Tackling the ‘Lost in the Middle’ Challenge with Large Language Models | Markov-erweiterte Clustering für lange Dokumentzusammenfassung: Die Herausforderung ‘Verloren in der Mitte’ mit großen Sprachmodellen bewältigen | Markov-加强长文档摘要集群:用大语言模式应对“中中途迷”挑战 2506.18036v1 |
Authors (2): Aziz Amari, Mohamed Achref Ben Ammar
The rapid expansion of information from diverse sources has heightened the need for effective automatic text summarization, which condenses documents into shorter, coherent texts. Summarization methods generally fall into two categories: extractive, which selects key segments from the original text, and abstractive, which generates summaries by rephrasing the content coherently. Large language models have advanced the field of abstractive summarization, but they are resourceintensive and face significant challenges in retaining key information across lengthy documents, which we call being “lost in the middle”. To address these issues, we propose a hybrid summarization approach that combines extractive and abstractive techniques. Our method splits the document into smaller text chunks, clusters their vector embeddings, generates a summary for each cluster that represents a key idea in the document, and constructs the final summary by relying on a Markov chain graph when selecting the semantic order of ideas.
nan
Article 393
Title@2025-06-22 (7): Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices
Title: Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices | Splitformer: Eine verbesserte Early-Exit-Architektur für die automatische Spracherkennung an Kantengeräten | 分解器:改进了边缘设备自动语音识别的提前退出结构 2506.18035v1 |
Authors (3): Maxence Lasbordes, Daniele Falavigna, Alessio Brutti
The ability to dynamically adjust the computational load of neural models during inference in a resource aware manner is crucial for on-device processing scenarios, characterised by limited and time-varying computational resources. Early-exit architectures represent an elegant and effective solution, since they can process the input with a subset of their layers, exiting at intermediate branches (the upmost layers are hence removed from the model). From a different perspective, for automatic speech recognition applications there are memory-efficient neural architectures that apply variable frame rate analysis, through downsampling/upsampling operations in the middle layers, reducing the overall number of operations and improving significantly the performance on well established benchmarks. One example is the Zipformer. However, these architectures lack the modularity necessary to inject early-exit branches. With the aim of improving the performance in early-exit models, we propose introducing parallel layers in the architecture that process downsampled versions of their inputs. % in conjunction with standard processing layers. We show that in this way the speech recognition performance on standard benchmarks significantly improve, at the cost of a small increase in the overall number of model parameters but without affecting the inference time.
nan
Article 394
Title@2025-06-22 (7): PDF Retrieval Augmented Question Answering
Title: PDF Retrieval Augmented Question Answering | PDF Retrieval Augmented Frage beantworten | PDF PDF 检索递增问题解答 2506.18027v1 |
Authors (2): Thi Thu Uyen Hoang, Viet Anh Nguyen
This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs–including text, images, vector diagrams, graphs, and tables–poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.
nan
Article 395
Title@2025-06-22 (7): LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Title: LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs | LongLlaDA: Entsperren langer Kontextkapazitäten in Diffusions-LLMs | LongLLALDA:释放扩散长程距离能力 2506.14429v2 |
Authors (6): Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably stable perplexity during direct context extrapolation. Moreover, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct local perception phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first length extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs. The code is available at https://github.com/OpenMOSS/LongLLaDA.
nan
Article 396
Title@2025-06-22 (7): Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Title: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data | Lernen aus Referenzantworten: Vielseitige Sprachmodellausrichtung ohne Binäre menschliche Präferenzdaten | 从参考资料解答中学习:通用语言模型调整,无二元人类优先数据 2504.09895v2 |
Authors (3): Shuai Zhao, Linchao Zhu, Yi Yang
Large language models~(LLMs) are expected to be helpful, harmless, and honest. In alignment scenarios such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but essential for transferring human preference. In this work, we explore using the similarity between sampled generations and high-quality reference answers as an alternative reward function choice for LLM alignment. Similarity reward circumvents binary preference data collection and reward modeling when unary high-quality reference answers are available. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reference or reward models. RefAlign utilizes similarity metrics, such as BERTScore between sampled generations and reference answers as surrogate rewards. Beyond general human preference optimization, RefAlign can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives. In various scenarios, RefAlign demonstrates comparable performance to previous alignment methods without binary preference data and reward models.
nan
Article 397
Title@2025-06-22 (7): AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs
Title: AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs | AlphaDecay: Modulweises Gewichtsdecay für schweres Balancing in LLMs | AlphaDecay:LLMM中重帆平衡的舱型偏重衰减 2506.14562v2 |
Authors (7): Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify “heavy-tailedness.” Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.
nan
Article 398
Title@2025-06-22 (7): GeAR: Graph-enhanced Agent for Retrieval-augmented Generation
Title: GeAR: Graph-enhanced Agent for Retrieval-augmented Generation | GeAR: Graphenverstärkter Agent für retrieval-augmentierte Generation | GeAR: 回收后再生一代的图形增强剂 2412.18431v2 |
Authors (15): Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Enting Chen, Damien Graux, Andre Melo, Ruofei Lai, Zeren Jiang, Zhongyang Li, YE QI, Yang Ren, Dandan Tu, Jeff Z. Pan
Retrieval-augmented Generation (RAG) relies on effective retrieval capabilities, yet traditional sparse and dense retrievers inherently struggle with multi-hop retrieval scenarios. In this paper, we introduce GeAR, a system that advances RAG performance through two key innovations: (i) an efficient graph expansion mechanism that augments any conventional base retriever, such as BM25, and (ii) an agent framework that incorporates the resulting graph-based retrieval into a multi-step retrieval framework. Our evaluation demonstrates GeAR’s superior retrieval capabilities across three multi-hop question answering datasets. Notably, our system achieves state-of-the-art results with improvements exceeding 10% on the challenging MuSiQue dataset, while consuming fewer tokens and requiring fewer iterations than existing multi-step retrieval systems. The project page is available at https://gear-rag.github.io.
nan
Article 399
Title@2025-06-22 (7): Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures
Title: Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures | Cross-Entropy-Spiele für Sprachmodelle: Vom Impliziten Wissen bis hin zu allgemeinen Fähigkeiten | 语文模式跨英语运动会:从隐隐知识到一般能力措施 2506.06832v2 |
Authors (2): Clément Hongler, Andrew Emil
Large Language Models (LLMs) define probability measures on text. By considering the implicit knowledge question of what it means for an LLM to know such a measure and what it entails algorithmically, we are naturally led to formulate a series of tasks that go beyond generative sampling, involving forms of summarization, counterfactual thinking, anomaly detection, originality search, reverse prompting, debating, creative solving, etc. These tasks can be formulated as games based on LLM measures, which we call Cross-Entropy (Xent) Games. Xent Games can be single-player or multi-player. They involve cross-entropy scores and cross-entropy constraints, and can be expressed as simple computational graphs and programs. We show the Xent Game space is large enough to contain a wealth of interesting examples, while being constructible from basic game-theoretic consistency axioms. We then discuss how the Xent Game space can be used to measure the abilities of LLMs. This leads to the construction of Xent Game measures: finite families of Xent Games that can be used as capability benchmarks, built from a given scope, by extracting a covering measure. To address the unbounded scope problem associated with the challenge of measuring general abilities, we propose to explore the space of Xent Games in a coherent fashion, using ideas inspired by evolutionary dynamics.
nan
Article 400
Title@2025-06-22 (7): Reinforcement Learning Teachers of Test Time Scaling
Title: Reinforcement Learning Teachers of Test Time Scaling | Verstärktes Lernen von Lehrern der Testzeitskalierung | 测试时间尺度强化学习教师 2506.08388v2 |
Authors (3): Edoardo Cetin, Tianyu Zhao, Yujin Tang
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL’s exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply “connect-the-dots” with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem’s solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.
nan
Article 401
Title@2025-06-22 (7): A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment
Title: A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment | Umfassendes Graphen-Framework für die Beantwortung von Fragen mit Modesuche-Präferenzausrichtung | 以模式寻找优先调整方式回答问题的综合图表框架 2506.17951v1 |
Authors (7): Quanwei Tang, Sophia Yat Mei Lee, Junshuang Wu, Dong Zhang, Shoushan Li, Erik Cambria, Guodong Zhou
Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference alignment. Our approach constructs a hierarchical document graph using a general similarity measurement, mimicking human cognitive processes for information understanding and synthesis. Additionally, we introduce mode-seeking preference optimization to better align model outputs with human preferences through probability-matching constraints. Extensive experiments on six datasets demonstrate the effectiveness of our \href{https://github.com/tangquanwei/GraphMPA}{GraphMPA}.
nan
Article 402
Title@2025-06-22 (7): Scatter-Based Innovation Propagation in Large Language Models for Multi-Stage Process Adaptation
Title: Scatter-Based Innovation Propagation in Large Language Models for Multi-Stage Process Adaptation | Scatter-based Innovation Propagation in großen Sprachmodellen für Multi-Stage-Prozessanpassung | 多阶段程序适应大语言模型中基于散散的革新创新推广 2506.17949v1 |
Authors (1): Hong Su
Large Language Models (LLMs) exhibit strong capabilities in reproducing and extending patterns observed during pretraining but often struggle to generalize novel ideas beyond their original context. This paper addresses the challenge of applying such localized innovations - introduced at a specific stage or component - to other parts of a multi-stage process. We propose a scatter-based innovation expansion model (innovation scatter model) that guides the LLM through a four-step process: (1) identifying the core innovation by comparing the user’s input with its surrounding context, (2) generalizing the innovation by removing references to specific stages or components, (3) determining whether the generalized innovation applies to a broader scope beyond the original stage, and (4) systematically applying it to other structurally similar stages using the LLM. This model leverages structural redundancy across stages to improve the applicability of novel ideas. Verification results demonstrate that the innovation scatter model enables LLMs to extend innovations across structurally similar stages, thereby enhancing generalization and reuse.
nan
Article 403
Title@2025-06-22 (7): Tutorial: $\varphi$-Transductions in OpenFst via the Gallic Semiring
Title: Tutorial: $\varphi$-Transductions in OpenFst via the Gallic Semiring | Tutorial: $\varphi$-Übersetzungen in OpenFst über den Gallischen Semiring | 教学: $\ varphie$- 通过加仑半径在 OpenFst 传输 2506.17942v1 |
Authors (2): Marco Cognetta, Cyril Allauzen
OpenFst, a popular finite-state transducer library, supports $\varphi$-transitions but, due to an implementation constraint, they cannot be used with transducers in a straightforward way. In this short tutorial, we describe how one can use other functionality provided by OpenFst (namely, the Gallic semiring) to correctly implement $\varphi$-transductions and demonstrate it by implementing the MaxMatch (WordPiece) tokenization algorithm (Devlin et al., 2019; Song et al., 2021). Accompanying self-contained code examples are provided. https://www.openfst.org/twiki/pub/Contrib/FstContrib/phi_transduction_tutorial_code.tgz
nan
Article 404
Title@2025-06-22 (7): Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Title: Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model | Stream-Omni: Gleichzeitige multimodale Interaktionen mit großem Sprach-Vision-Sprachmodell | 流流-奥米尼:与大语言-视觉-语音模型同时使用的多模式互动 2506.13642v2 |
Authors (5): Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.
nan
Article 405
Title@2025-06-22 (7): Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective
Title: Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective | Evolving Prompts In-Context: Eine offene, sich selbst replizierende Perspektive | 不断演变的加速:一个开放的、自我复制的视角 2506.17930v1 |
Authors (3): Jianyu Wang, Zhiqiang Hu, Lidong Bing
We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent “gibberish” can remarkably improve performance across diverse tasks. Notably, the “gibberish” always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature–such as symbiosis and self-organization–arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.
nan
Article 406
Title@2025-06-22 (7): Improving the Efficiency of Long Document Classification using Sentence Ranking Approach
Title: Improving the Efficiency of Long Document Classification using Sentence Ranking Approach | Verbesserung der Effizienz der Langdokumentklassifikation mittels Sentence-Ranking-Ansatz | 采用判决分级办法提高长文件分类的效率 2506.07248v2 |
Authors (4): Prathamesh Kokate, Mitali Sarnaik, Manavi Khopade, Raviraj Joshi
Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.
nan
Article 407
Title@2025-06-22 (7): LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference
Title: LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference | LightRetriever: Eine LLM-basierte Hybrid-Retrieval-Architektur mit 1000x schnellerer Query-Inferenz | 光探索光: 基于 LLM 的混合回收结构, 具有 1000x 快速查询推断 2505.12260v2 |
Authors (6): Guangyuan Ma, Yongliang Ma, Xuanrui Gou, Zhenpeng Su, Ming Zhou, Songlin Hu
Large Language Models (LLMs)-based hybrid retrieval uses LLMs to encode queries and documents into low-dimensional dense or high-dimensional sparse vectors. It retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based hybrid retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full-sized LLM on an H800 GPU, our approach achieves over a 1000x speedup for query inference with GPU acceleration, and even a 20x speedup without GPU. Experiments on large-scale retrieval benchmarks demonstrate that our method generalizes well across diverse retrieval tasks, retaining an average of 95% full-sized performance.
nan
Article 408
Title@2025-06-22 (7): Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Title: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset | Perle: Ein multimodaler kulturbewusster arabischer Unterrichtsdatensatz | 珍珠:多式文化-知识阿拉伯文教学数据集 2505.21979v2 |
Authors (45): Fakhraddin Alwajih, Samar Mohamed Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmeh, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Razan Khassib, Lina Hamad, Mohammed Anwar AL-Ghrawi, Fatimah Alshamari, Cheikh Malainine, Doaa Qawasmeh, Aminetou Yacoub, Tfeil moilid, Ruwa AbuHweidi, Ahmed Aboeitta, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Adel Ammar, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Sara Shatnawi, Alcides Alcoba Inciarte, AbdelRahim A. Elmadany, Mohamedou cheikh tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
nan
Article 409
Title@2025-06-22 (7): Effective Red-Teaming of Policy-Adherent Agents
Title: Effective Red-Teaming of Policy-Adherent Agents | Effektives Red-Teaming von Policy-Adherent Agents | 有效的政策协调代理人红队 2506.09600v2 |
Authors (6): Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent’s robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks
nan
Article 410
Title@2025-06-22 (7): DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models
Title: DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models | DSGram: Dynamische Gewichtung von Sub-Metriken zur Korrektur von Grammatikfehlern im Zeitalter großer Sprachmodelle | DSGram:大语言模型时代外貌错误校正动态加权子计量法 2412.12832v2 |
Authors (4): Jinxiang Xie, Yilin Li, Xunjian Yin, Xiaojun Wan
Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.
nan
Article 411
Title@2025-06-22 (7): LGAI-EMBEDDING-Preview Technical Report
Title: LGAI-EMBEDDING-Preview Technical Report | LGAI-EMBEDDING-Vorschau Technischer Bericht | LGAI-EMBEDD-审查技术报告 2506.07438v2 |
Authors (9): Jooyoung Choi, Hyun Kim, Hansol Jang, Changwook Jun, Kyunghoon Bae, Hyewon Choi, Stanley Jungkyu Choi, Honglak Lee, Chulmin Yun
This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out semantically ambiguous negatives based on their similarity to positive examples, thereby enhancing training stability and retrieval robustness. Our model is evaluated on the newly introduced MTEB (English, v2) benchmark, covering 41 tasks across seven categories. Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score, outperforming several larger or fully fine-tuned baselines. These findings highlight the effectiveness of combining in-context prompting, soft supervision, and adaptive sampling for scalable, high-quality embedding generation.
nan
Article 412
Title@2025-06-22 (7): SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback
Title: SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback | SIPDO: Closed-Loop Prompt Optimierung über Synthetic Data Feedback | SIPDO:通过合成数据反馈,通过闭闭电话快速优化 2505.19514v2 |
Authors (5): Yaoning Yu, Ye Yu, Kai Wei, Haojing Luo, Haohan Wang
Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
nan
Article 413
Title@2025-06-22 (7): Large Language Models for Disease Diagnosis: A Scoping Review
Title: Large Language Models for Disease Diagnosis: A Scoping Review | Große Sprachmodelle für Krankheitsdiagnosen: Eine Bewertung | 疾病诊断大语言模型:范围审查 2409.00097v3 |
Authors (17): Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Yi Fang, Sirui Ding, Jiashuo Wang, Kaishuai Xu, Liqiao Xia, Jeremy Yeung, Daochen Zha, Dongming Cai, Genevieve B. Melton, Mingquan Lin, Rui Zhang
Automatic disease diagnosis has become increasingly valuable in clinical practice. The advent of large language models (LLMs) has catalyzed a paradigm shift in artificial intelligence, with growing evidence supporting the efficacy of LLMs in diagnostic tasks. Despite the increasing attention in this field, a holistic view is still lacking. Many critical aspects remain unclear, such as the diseases and clinical data to which LLMs have been applied, the LLM techniques employed, and the evaluation methods used. In this article, we perform a comprehensive review of LLM-based methods for disease diagnosis. Our review examines the existing literature across various dimensions, including disease types and associated clinical specialties, clinical data, LLM techniques, and evaluation methods. Additionally, we offer recommendations for applying and evaluating LLMs for diagnostic tasks. Furthermore, we assess the limitations of current research and discuss future directions. To our knowledge, this is the first comprehensive review for LLM-based disease diagnosis.
nan
Article 414
Title@2025-06-22 (7): ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training
Title: ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training | ECHO-LlaMA: Effizientes Caching für Hochleistungs-LLaMA-Schulungen | ECHO-LLAMA: 高效率的高绩效拉马培训 2505.17331v2 |
Authors (8): Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu
This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.
nan
Article 415
Title@2025-06-22 (7): Multi-turn Jailbreaking via Global Refinement and Active Fabrication
Title: Multi-turn Jailbreaking via Global Refinement and Active Fabrication | Multiturn Jailbreaking über globale Veredelung und aktive Fabrikation | 通过全球精炼和积极制造 2506.17881v1 |
Authors (6): Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin
Large Language Models (LLMs) have achieved exceptional performance across a wide range of tasks. However, they still pose significant safety risks due to the potential misuse for malicious purposes. Jailbreaks, which aim to elicit models to generate harmful content, play a critical role in identifying the underlying security threats. Recent jailbreaking primarily focuses on single-turn scenarios, while the more complicated multi-turn scenarios remain underexplored. Moreover, existing multi-turn jailbreaking techniques struggle to adapt to the evolving dynamics of dialogue as the interaction progresses. To address this limitation, we propose a novel multi-turn jailbreaking method that refines the jailbreaking path globally at each interaction. We also actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent questions. Experimental results demonstrate the superior performance of our method compared with existing single-turn and multi-turn jailbreaking techniques across six state-of-the-art LLMs. Our code is publicly available at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.
nan
Article 416
Title@2025-06-22 (7): How Alignment Shrinks the Generative Horizon
Title: How Alignment Shrinks the Generative Horizon | Wie Alignment den generativen Horizont schrumpft | 协同一致如何缩小生成地平线 2506.17871v1 |
Authors (2): Chenghao Yang, Ari Holtzman
Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model’s output distribution. To quantify this concentration, we introduce the Branching Factor (BF) – a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model’s output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model’s behavior, but instead steers it toward stylistic tokens (e.g., “Sure”) that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.
nan
Article 417
Title@2025-06-22 (7): QueueEDIT: Structural Self-Correction for Sequential Model Editing in LLMs
Title: QueueEDIT: Structural Self-Correction for Sequential Model Editing in LLMs | QueueEDIT: Strukturelle Selbstkorrektion für sequentielle Modellbearbeitung in LLMs | QueeEDIT: LLM 中序列模型编辑结构自校校 2506.17864v1 |
Authors (6): Taolin Zhang, Haidong Kang, Dongyang Li, Qizhou Chen, Chengyu Wang Xiaofeng He, Richang Hong
Recently, large language models (LLMs) have demonstrated impressive results but still suffer from hallucinations. Model editing has been proposed to correct factual inaccuracies in LLMs. A challenging case is sequential model editing (SME), which aims to rectify errors continuously rather than treating them as a one-time task. During SME, the general capabilities of LLMs can be negatively affected due to the introduction of new parameters. In this paper, we propose a queue-based self-correction framework (QueueEDIT) that not only enhances SME performance by addressing long-sequence dependency but also mitigates the impact of parameter bias on the general capabilities of LLMs. Specifically, we first introduce a structural mapping editing loss to map the triplets to the knowledge-sensitive neurons within the Transformer layers of LLMs. We then store the located parameters for each piece of edited knowledge in a queue and dynamically align previously edited parameters. In each edit, we select queue parameters most relevant to the currently located parameters to determine whether previous knowledge needs realignment. Irrelevant parameters in the queue are frozen, and we update the parameters at the queue head to the LLM to ensure they do not harm general abilities. Experiments show that our framework significantly outperforms strong baselines across various SME settings and maintains competitiveness in single-turn editing. The resulting LLMs also preserve high capabilities in general NLP tasks throughout the SME process.
nan
Article 418
Title@2025-06-22 (7): LLMs for Customized Marketing Content Generation and Evaluation at Scale
Title: LLMs for Customized Marketing Content Generation and Evaluation at Scale | LLMs für maßgeschneiderte Marketing Content Generierung und Evaluation auf Scale | 定制营销内容生成和评估规模评估LLM 2506.17863v1 |
Authors (4): Haoran Liu, Amir Tahmasbi, Ehtesham Sam Haque, Purak Jain
Offsite marketing is essential in e-commerce, enabling businesses to reach customers through external platforms and drive traffic to retail websites. However, most current offsite marketing content is overly generic, template-based, and poorly aligned with landing pages, limiting its effectiveness. To address these limitations, we propose MarketingFM, a retrieval-augmented system that integrates multiple data sources to generate keyword-specific ad copy with minimal human intervention. We validate MarketingFM via offline human and automated evaluations and large-scale online A/B tests. In one experiment, keyword-focused ad copy outperformed templates, achieving up to 9% higher CTR, 12% more impressions, and 0.38% lower CPC, demonstrating gains in ad ranking and cost efficiency. Despite these gains, human review of generated ads remains costly. To address this, we propose AutoEval-Main, an automated evaluation system that combines rule-based metrics with LLM-as-a-Judge techniques to ensure alignment with marketing principles. In experiments with large-scale human annotations, AutoEval-Main achieved 89.57% agreement with human reviewers. Building on this, we propose AutoEval-Update, a cost-efficient LLM-human collaborative framework to dynamically refine evaluation prompts and adapt to shifting criteria with minimal human input. By selectively sampling representative ads for human review and using a critic LLM to generate alignment reports, AutoEval-Update improves evaluation consistency while reducing manual effort. Experiments show the critic LLM suggests meaningful refinements, improving LLM-human agreement. Nonetheless, human oversight remains essential for setting thresholds and validating refinements before deployment.
nan
Article 419
Title@2025-06-22 (7): Learning to Reason under Off-Policy Guidance
Title: Learning to Reason under Off-Policy Guidance | Unter außerpolitischer Anleitung zur Vernunft lernen | 根据非政策指导学习理由 2504.14945v5 |
Authors (8): Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy’’, limiting learning to a model’s own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.
nan
Article 420
Title@2025-06-21 (6): Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild
Title: Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild | Prototypische Mensch-AI-Kollaboration-Verhalten von LLM-Assisted Writing in the Wild | 野外LLM协助协助写作者的合作行为 2505.16023v3 |
Authors (4): Sheshera Mysore, Debarati Das, Hancheng Cao, Bahareh Sarrafzadeh
As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users’ intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.
nan
Article 421
Title@2025-06-21 (6): THCM-CAL: Temporal-Hierarchical Causal Modelling with Conformal Calibration for Clinical Risk Prediction
Title: THCM-CAL: Temporal-Hierarchical Causal Modelling with Conformal Calibration for Clinical Risk Prediction | THCM-CAL: Zeitlich-hierarchische Kausalmodellierung mit konformer Kalibrierung für klinische Risikovorhersage | THCM-CAL: 临床风险预测与临床风险预测常规校准相结合的时高等级因果关系模型 2506.17844v1 |
Authors (5): Xin Zhang, Qiyu Wei, Yingjie Zhu, Fanyi Wu, Sophia Ananiadou
Automated clinical risk prediction from electronic health records (EHRs) demands modeling both structured diagnostic codes and unstructured narrative notes. However, most prior approaches either handle these modalities separately or rely on simplistic fusion strategies that ignore the directional, hierarchical causal interactions by which narrative observations precipitate diagnoses and propagate risk across admissions. In this paper, we propose THCM-CAL, a Temporal-Hierarchical Causal Model with Conformal Calibration. Our framework constructs a multimodal causal graph where nodes represent clinical entities from two modalities: Textual propositions extracted from notes and ICD codes mapped to textual descriptions. Through hierarchical causal discovery, THCM-CAL infers three clinically grounded interactions: intra-slice same-modality sequencing, intra-slice cross-modality triggers, and inter-slice risk propagation. To enhance prediction reliability, we extend conformal prediction to multi-label ICD coding, calibrating per-code confidence intervals under complex co-occurrences. Experimental results on MIMIC-III and MIMIC-IV demonstrate the superiority of THCM-CAL.
nan
Article 422
Title@2025-06-21 (6): Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
Title: Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach | Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz | 通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配 2506.17828v1 |
Authors (9): Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Kaixiang Lin, Songtao Lu, Alfredo Garcia, Mingyi Hong
Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI’s reinforcement fine-tuning (RFT), but without requiring access to the model weights.
nan
Article 423
Title@2025-06-21 (6): Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight Task-Specific Adapters for Automatic Scoring
Title: Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight Task-Specific Adapters for Automatic Scoring | Effiziente Multi-Task-Inferenzierung mit einem geteilten Backbone und leichten Task-Spezifischen Adaptern für die automatische Bewertung | 与共享的后骨和轻型任务特定适应器进行自动 Scorting 2412.21065v2 |
Authors (2): Ehsan Latif, Xiaoming Zhai
The integration of Artificial Intelligence (AI) in education requires scalable and efficient frameworks that balance performance, adaptability, and cost. This paper addresses these needs by proposing a shared backbone model architecture enhanced with lightweight LoRA adapters for task-specific fine-tuning, targeting the automated scoring of student responses across 27 mutually exclusive tasks. By achieving competitive performance (average QWK of 0.848 compared to 0.888 for fully fine-tuned models) while reducing GPU memory consumption by 60% and inference latency by 40%, the framework demonstrates significant efficiency gains. This approach aligns with the workshop’s focus on improving language models for educational tasks, creating responsible innovations for cost-sensitive deployment, and supporting educators by streamlining assessment workflows. The findings underscore the potential of scalable AI to enhance learning outcomes while maintaining fairness and transparency in automated scoring systems.
nan
Article 424
Title@2025-06-21 (6): Evaluating LLMs with Multiple Problems at once
Title: Evaluating LLMs with Multiple Problems at once | Bewertung von LLMs mit mehreren Problemen auf einmal | 立即评价有多重问题的LLMs 2406.10786v3 |
Authors (3): Zhengxiang Wang, Jordan Kodner, Owen Rambow
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.
nan
Article 425
Title@2025-06-21 (6): Bayesian Social Deduction with Graph-Informed Language Models
Title: Bayesian Social Deduction with Graph-Informed Language Models | Bayesische soziale Deduktion mit Graphen-informierten Sprachmodellen | 采用图形化语言模型的巴伊斯社会衰退 2506.17788v1 |
Authors (7): Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, Joseph Campbell
Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp-lab-purdue.github.io/bayesian-social-deduction/
nan
Article 426
Title@2025-06-21 (6): Beyond instruction-conditioning, MoTE: Mixture of Task Experts for Multi-task Embedding Models
Title: Beyond instruction-conditioning, MoTE: Mixture of Task Experts for Multi-task Embedding Models | Über die Instruktionskonditionierung hinaus, MoTE: Mischung von Task-Experten für Multi-Task-Einbettungsmodelle | 超越教学-调控,MOTE:多任务嵌入模型任务专家混合 2506.17781v1 |
Authors (5): Miguel Romero, Shuoyang Ding, Corey D. Barret, Georgiana Dinu, George Karypis
Dense embeddings are fundamental to modern machine learning systems, powering Retrieval-Augmented Generation (RAG), information retrieval, and representation learning. While instruction-conditioning has become the dominant approach for embedding specialization, its direct application to low-capacity models imposes fundamental representational constraints that limit the performance gains derived from specialization. In this paper, we analyze these limitations and introduce the Mixture of Task Experts (MoTE) transformer block, which leverages task-specialized parameters trained with Task-Aware Contrastive Learning (\tacl) to enhance the model ability to generate specialized embeddings. Empirical results show that MoTE achieves $64\%$ higher performance gains in retrieval datasets ($+3.27 \rightarrow +5.21$) and $43\%$ higher performance gains across all datasets ($+1.81 \rightarrow +2.60$). Critically, these gains are achieved without altering instructions, training data, inference time, or number of active parameters.
nan
Article 427
Title@2025-06-21 (6): Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5
Title: Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5 | Benchmarking und Aufbau eines Null-Shot Hindi Retrieval Modells mit Hindi-BEIR und NBLB-E5 | 与印地语-BEIR和NLLB-E5一起制定基准和构建零热印地山脉回收模型 2409.05401v3 |
Authors (4): Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges that impact Hindi retrieval performance. Building on the insights from these results, we introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data. We believe our contributions, which include the release of the Hindi-BEIR benchmark and the NLLB-E5 model, will prove to be a valuable resource for researchers and promote advancements in multilingual retrieval models.
nan
Article 428
Title@2025-06-21 (6): DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
Title: DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training | DUMP: Automatisiertes Lehrplanlernen auf Verteilungsebene für RL-basiertes LLM-Post-Training | DDMP: 以LLLLM为基础的LLM培训后课程自动分发级别课程学习 2504.09710v2 |
Authors (5): Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, Wentian Zhao
Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.
nan
Article 429
Title@2025-06-21 (6): HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations
Title: HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations | HIDE und Suche: Halluzinationen in Sprachmodellen über entkoppelte Repräsentationen erkennen | HIDE & Sear:通过拆分代表方式检测语言模型中的幻觉 2506.17748v1 |
Authors (3): Anwoy Chatterjee, Yash Goel, Tanmoy Chakraborty
Contemporary Language Models (LMs), while impressively fluent, often generate content that is factually incorrect or unfaithful to the input context - a critical issue commonly referred to as ‘hallucination’. This tendency of LMs to generate hallucinated content undermines their reliability, especially because these fabrications are often highly convincing and therefore difficult to detect. While several existing methods attempt to detect hallucinations, most rely on analyzing multiple generations per input, leading to increased computational cost and latency. To address this, we propose a single-pass, training-free approach for effective Hallucination detectIon via Decoupled rEpresentations (HIDE). Our approach leverages the hypothesis that hallucinations result from a statistical decoupling between an LM’s internal representations of input context and its generated output. We quantify this decoupling using the Hilbert-Schmidt Independence Criterion (HSIC) applied to hidden-state representations extracted while generating the output sequence. We conduct extensive experiments on four diverse question answering datasets, evaluating both faithfulness and factuality hallucinations across six open-source LMs of varying scales and properties. Our results demonstrate that HIDE outperforms other single-pass methods in almost all settings, achieving an average relative improvement of ~29% in AUC-ROC over the best-performing single-pass strategy across various models and datasets. Additionally, HIDE shows competitive and often superior performance with multi-pass state-of-the-art methods, obtaining an average relative improvement of ~3% in AUC-ROC while consuming ~51% less computation time. Our findings highlight the effectiveness of exploiting internal representation decoupling in LMs for efficient and practical hallucination detection.
nan
Article 430
Title@2025-06-21 (6): Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages
Title: Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages | Enthüllungsfaktoren für ein verbessertes POS-Tagging: Eine Studie über ressourcenarme mittelalterliche romanische Sprachen | 强化POS贴标签的难解因素:低资源中世纪罗姆语言研究 2506.17715v1 |
Authors (7): Matthias Schöffel, Esteban Garces Arias, Marinus Wiedner, Paula Ruppert, Meimingwei Li, Christian Heumann, Matthias Aßenmacher
Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines, particularly critical for historical text analysis at the intersection of computational linguistics and digital humanities. Despite significant advancements in modern large language models (LLMs) for ancient languages, their application to Medieval Romance languages presents distinctive challenges stemming from diachronic linguistic evolution, spelling variations, and labeled data scarcity. This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts, spanning biblical, hagiographical, medical, and dietary domains. Through rigorous experimentation, we evaluate how fine-tuning approaches, prompt engineering, model architectures, decoding strategies, and cross-lingual transfer learning techniques affect tagging accuracy. Our results reveal both notable limitations in LLMs’ ability to process historical language variations and non-standardized spelling, as well as promising specialized techniques that effectively address the unique challenges presented by low-resource historical languages.
nan
Article 431
Title@2025-06-21 (6): Aged to Perfection: Machine-Learning Maps of Age in Conversational English
Title: Aged to Perfection: Machine-Learning Maps of Age in Conversational English | Gealtert bis zur Perfektion: Machine-Learning Alterskarten im Conversational English | 成熟至完美:计算机学习时代图,英语 2506.17708v1 |
Authors (1): MingZe Tang
The study uses the British National Corpus 2014, a large sample of contemporary spoken British English, to investigate language patterns across different age groups. Our research attempts to explore how language patterns vary between different age groups, exploring the connection between speaker demographics and linguistic factors such as utterance duration, lexical diversity, and word choice. By merging computational language analysis and machine learning methodologies, we attempt to uncover distinctive linguistic markers characteristic of multiple generations and create prediction models that can consistently estimate the speaker’s age group from various aspects. This work contributes to our knowledge of sociolinguistic diversity throughout the life of modern British speech.
nan
Article 432
Title@2025-06-21 (6): The Evolution of Natural Language Processing: How Prompt Optimization and Language Models are Shaping the Future
Title: The Evolution of Natural Language Processing: How Prompt Optimization and Language Models are Shaping the Future | Die Evolution der natürlichen Sprachverarbeitung: Wie schnell Optimierung und Sprachmodelle die Zukunft gestalten | 《自然语言处理过程的演变:如何迅速优化和语言模式正在塑造未来》 2506.17700v1 |
Authors (4): Summra Saleem, Muhammad Nabeel Asim, Shaista Zulfiqar, Andreas Dengel
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) by automating traditional labor-intensive tasks and consequently accelerated the development of computer-aided applications. As researchers continue to advance this field with the introduction of novel language models and more efficient training/finetuning methodologies, the idea of prompt engineering and subsequent optimization strategies with LLMs has emerged as a particularly impactful trend to yield a substantial performance boost across diverse NLP tasks. To best of our knowledge numerous review articles have explored prompt engineering, however, a critical gap exists in comprehensive analyses of prompt optimization strategies. To bridge this gap this paper provides unique and comprehensive insights about the potential of diverse prompt optimization strategies. It analyzes their underlying working paradigms and based on these principles, categorizes them into 11 distinct classes. Moreover, the paper provides details about various NLP tasks where these prompt optimization strategies have been employed, along with details of different LLMs and benchmark datasets used for evaluation. This comprehensive compilation lays a robust foundation for future comparative studies and enables rigorous assessment of prompt optimization and LLM-based predictive pipelines under consistent experimental settings: a critical need in the current landscape. Ultimately, this research will centralize diverse strategic knowledge to facilitate the adaptation of existing prompt optimization strategies for development of innovative predictors across unexplored tasks.
nan
Article 433
Title@2025-06-21 (6): Zero-Shot Conversational Stance Detection: Dataset and Approaches
Title: Zero-Shot Conversational Stance Detection: Dataset and Approaches | Zero-Shot Conversational Stance Detection: Datensatz und Ansätze | 零热对调调检测:数据集和方法 2506.17693v1 |
Authors (8): Yuzhe Ding, Kang He, Bobo Li, Li Zheng, Haijun He, Fei Li, Chong Teng, Donghong Ji
Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the increasing number of online debates among social media users, conversational stance detection has become a crucial research area. However, existing conversational stance detection datasets are restricted to a limited set of specific targets, which constrains the effectiveness of stance detection models when encountering a large number of unseen targets in real-world applications. To bridge this gap, we manually curate a large-scale, high-quality zero-shot conversational stance detection dataset, named ZS-CSD, comprising 280 targets across two distinct target types. Leveraging the ZS-CSD dataset, we propose SITPCL, a speaker interaction and target-aware prototypical contrastive learning model, and establish the benchmark performance in the zero-shot setting. Experimental results demonstrate that our proposed SITPCL model achieves state-of-the-art performance in zero-shot conversational stance detection. Notably, the SITPCL model attains only an F1-macro score of 43.81%, highlighting the persistent challenges in zero-shot conversational stance detection.
nan
Article 434
Title@2025-06-21 (6): Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering
Title: Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering | Ressourcenfreundliche dynamische Verbesserungskette für Multi-Hop-Fragebeantwortung | 多种问题解答资源友好型动态增强链 2506.17692v1 |
Authors (9): Binquan Ji, Haibo Luo, Yifei Lu, Lei Hei, Jiaqi Wang, Tingjing Liao, Lingyu Wang, Shichao Wang, Feiliang Ren
Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges -such as hallucinations and semantic drift-for lightweight LLMs with fewer parameters. This work proposes a novel framework called DEC (Dynamic Enhancement Chain). DEC first decomposes complex questions into logically coherent subquestions to form a hallucination-free reasoning chain. It then iteratively refines these subquestions through context-aware rewriting to generate effective query formulations. For retrieval, we introduce a lightweight discriminative keyword extraction module that leverages extracted keywords to achieve targeted, precise document recall with relatively low computational overhead. Extensive experiments on three multi-hop QA datasets demonstrate that DEC performs on par with or surpasses state-of-the-art benchmarks while significantly reducing token consumption. Notably, our approach attains state-of-the-art results on models with 8B parameters, showcasing its effectiveness in various scenarios, particularly in resource-constrained environments.
nan
Article 435
Title@2025-06-21 (6): Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models
Title: Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models | Verbesserung des wenigen Schlagwörters Spotting Performance durch vorgefertigte selbstüberwachte Sprachmodelle | 通过培训前自我监督的演讲模式,加强 “ 通过培训前自我监督的演讲模式 “ 微小的 “ 关键词 “ 突出成绩 “ 2506.17686v1 |
Authors (4): Alican Gok, Oguzhan Buyuksolak, Osman Erman Okman, Murat Saraclar
Keyword Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices. Few-Shot Keyword Spotting (FS-KWS) addresses the scalability and adaptability challenges of traditional systems by enabling recognition of custom keywords with only a few examples. However, existing FS-KWS systems achieve subpar accuracy at desirable false acceptance rates, particularly in resource-constrained edge environments. To address these issues, we propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation. The teacher model, based on Wav2Vec 2.0 is trained using Sub-center ArcFace loss, which enhances inter-class separability and intra-class compactness. To enable efficient deployment on edge devices, we introduce attention-based dimensionality reduction and train a standard lightweight ResNet15 student model. We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets. Notably, the proposed training method improves the 10-shot classification accuracy from 33.4% to 74.1% on 11 classes at 1% false alarm accuracy on the GSC dataset, thus making it significantly better-suited for a real use case scenario.
nan
Article 436
Title@2025-06-21 (6): Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Title: Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference | Vernunftschaltungen in Sprachmodellen: Eine mechanistische Interpretation der syllogistischen Inferenz | 语言模型中说明理由的电路:对音频推断的机械解释 2408.08590v3 |
Authors (3): Geonhee Kim, Marco Valentino, André Freitas
Recent studies on reasoning in language models (LMs) have sparked a debate on whether they can learn systematic inferential principles or merely exploit superficial patterns in the training data. To understand and uncover the mechanisms adopted for formal reasoning in LMs, this paper presents a mechanistic interpretation of syllogistic inference. Specifically, we present a methodology for circuit discovery aimed at interpreting content-independent and formal reasoning mechanisms. Through two distinct intervention methods, we uncover a sufficient and necessary circuit involving middle-term suppression that elucidates how LMs transfer information to derive valid conclusions from premises. Furthermore, we investigate how belief biases manifest in syllogistic inference, finding evidence of partial contamination from additional attention heads responsible for encoding commonsense and contextualized knowledge. Finally, we explore the generalization of the discovered mechanisms across various syllogistic schemes, model sizes and architectures. The identified circuit is sufficient and necessary for syllogistic schemes on which the models achieve high accuracy (>60%), with compatible activation patterns across models of different families. Overall, our findings suggest that LMs learn transferable content-independent reasoning mechanisms, but that, at the same time, such mechanisms do not involve generalizable and abstract logical primitives, being susceptible to contamination by the same world knowledge acquired during pre-training.
nan
Article 437
Title@2025-06-21 (6): Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Title: Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization | Robustes LLM-Unlearning mit MUDMAN: Meta-Unlearning mit Disruptionsmasken und Normalisierung | 与 MUDMAN 一起重新学习: 以干扰蒙蔽和正常化的方式重新学习 2506.12484v2 |
Authors (4): Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40\%, setting a new state-of-the-art for robust unlearning.
nan
Article 438
Title@2025-06-21 (6): FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies
Title: FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies | FaithfulSAE: Auf dem Weg zur Erfassung treuer Funktionen mit Sparse Autoencodern ohne externe Datensatzabhängigkeiten | 忠实的SAE:在没有外部数据集依赖性的情况下, 与粗略自动解析器一起获取忠实的特征 2506.17673v1 |
Authors (6): Seonglae Cho, Harryn Oh, Donghyun Lee, Luis Eduardo Rodrigues Vieira, Andrew Bermingham, Ziad El Sayed
Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model’s generalisation capabilities. This can result in hallucinated SAE features, which we term “Fake Features”, that misrepresent the model’s internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model’s own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.
nan
Article 439
Title@2025-06-21 (6): TPTT: Transforming Pretrained Transformer into Titans
Title: TPTT: Transforming Pretrained Transformer into Titans | TPTT: Transformieren des vortrainierten Transformers in Titanen | TPTT: 将预训练变形器转换成巨人 2506.17671v1 |
Authors (1): Fabien Furfaro
Recent advances in large language models (LLMs) have led to remarkable progress in natural language processing, but their computational and memory demands remain a significant challenge, particularly for long-context inference. We introduce TPTT (Transforming Pretrained Transformer into Titans), a novel framework for enhancing pretrained Transformer models with efficient linearized attention mechanisms and advanced memory management. TPTT employs techniques such as Memory as Gate (MaG) and mixed linearized attention (LiZA). It is fully compatible with the Hugging Face Transformers library, enabling seamless adaptation of any causal LLM through parameter-efficient fine-tuning (LoRA) without full retraining. We show the effectiveness of TPTT on the MMLU benchmark with models of approximately 1 billion parameters, observing substantial improvements in both efficiency and accuracy. For instance, Titans-Llama-3.2-1B achieves a 20% increase in Exact Match (EM) over its baseline. Statistical analyses and comparisons with recent state-of-the-art methods confirm the practical scalability and robustness of TPTT. Code is available at https://github.com/fabienfrfr/tptt . Python package at https://pypi.org/project/tptt/ .
nan
Article 440
Title@2025-06-21 (6): Stop Overvaluing Multi-Agent Debate – We Must Rethink Evaluation and Embrace Model Heterogeneity
Title: Stop Overvaluing Multi-Agent Debate – We Must Rethink Evaluation and Embrace Model Heterogeneity | Mehr-Agenten-Debatte stoppen – Wir müssen Bewertung neu denken und Modell Heterogenität umarmen | 停止高估多机构辩论 – – 我们必须重新思考评价和拥抱模型多样性 2502.08788v3 |
Authors (8): Hangfan Zhang, Zhiyao Cui, Jianhao Chen, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, Shuyue Hu
Multi-agent debate (MAD) has gained significant attention as a promising line of research to improve the factual accuracy and reasoning capabilities of large language models (LLMs). Despite its conceptual appeal, current MAD research suffers from critical limitations in evaluation practices, including limited benchmark coverage, weak baseline comparisons, and inconsistent setups. This paper presents a systematic evaluation of 5 representative MAD methods across 9 benchmarks using 4 foundational models. Surprisingly, our findings reveal that MAD often fail to outperform simple single-agent baselines such as Chain-of-Thought and Self-Consistency, even when consuming significantly more inference-time computation. To advance MAD research, we further explore the role of model heterogeneity and find it as a universal antidote to consistently improve current MAD frameworks. Based on our findings, we argue that the field must stop overvaluing MAD in its current form; for true advancement, we must critically rethink evaluation paradigms and actively embrace model heterogeneity as a core design principle.
nan
Article 441
Title@2025-06-21 (6): How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMs
Title: How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMs | Wie numerische Präzision die Fähigkeit von LLMs zur Arithmetik beeinflusst | 数字精确度如何影响LLM 的理理原因能力 2410.13857v2 |
Authors (9): Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, Liwei Wang
Despite the remarkable success of Transformer-based large language models (LLMs) across various domains, understanding and enhancing their mathematical capabilities remains a significant challenge. In this paper, we conduct a rigorous theoretical analysis of LLMs’ mathematical abilities, with a specific focus on their arithmetic performances. We identify numerical precision as a key factor that influences their effectiveness in arithmetical tasks. Our results show that Transformers operating with low numerical precision fail to address arithmetic tasks, such as iterated addition and integer multiplication, unless the model size grows super-polynomially with respect to the input length. In contrast, Transformers with standard numerical precision can efficiently handle these tasks with significantly smaller model sizes. We further support our theoretical findings through empirical experiments that explore the impact of varying numerical precision on arithmetic tasks, providing valuable insights for improving the mathematical reasoning capabilities of LLMs.
nan
Article 442
Title@2025-06-21 (6): Comba: Improving Bilinear RNNs with Closed-loop Control
Title: Comba: Improving Bilinear RNNs with Closed-loop Control | Comba: Bilineare RNNs mit Closed-Loop-Steuerung verbessern | Comba: 改进有闭环控制的双线区域网网 2506.02475v3 |
Authors (8): Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, Weigao Sun
Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.
nan
Article 443
Title@2025-06-21 (6): Step-Opt: Boosting Optimization Modeling in LLMs through Iterative Data Synthesis and Structured Validation
Title: Step-Opt: Boosting Optimization Modeling in LLMs through Iterative Data Synthesis and Structured Validation | Schritt-Opt: Steigerung der Optimierungsmodellierung in LLMs durch iterative Datensynthese und strukturierte Validierung | 通过迭代数据合成和结构化校验,促进通过迭代数据合成和结构化校验,在LLMs中建立优化优化模型模型 2506.17637v1 |
Authors (6): Yang Wu, Yifan Zhang, Yurong Wu, Yuran Wang, Junkai Zhang, Jian Cheng
Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Instruct employs iterative problem generation to systematically increase problem complexity and stepwise validation to rigorously verify data, preventing error propagation and ensuring the quality of the generated dataset. Leveraging this framework, we fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt–a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR. Extensive experiments demonstrate the superior performance of Step-Opt, especially in addressing complex OR tasks, with a notable 17.01\% improvement in micro average accuracy on difficult problems. These findings highlight the effectiveness of combining structured validation with gradual problem refinement to advance the automation of decision-making processes using LLMs.The code and dataset are available at https://github.com/samwu-learn/Step.
nan
Article 444
Title@2025-06-21 (6): Self-Preference Bias in LLM-as-a-Judge
Title: Self-Preference Bias in LLM-as-a-Judge | Selbstpräferenz Bias in LLM-as-a-Richter | 以LLM-as-a-jujuds为主的自选比亚斯语 2410.21819v2 |
Authors (3): Koki Wataoka, Tsubasa Takahashi, Ryokan Ri
Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
nan
Article 445
Title@2025-06-21 (6): Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs
Title: Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs | Antwort-Centric oder Reasoning-Driven? Enthüllen des Latent Memory Ankers in LLMs | 解答中心或理由驱动? 解开 LLMS 中隐藏的内存点锚 2506.17630v1 |
Authors (8): Yang Wu, Yifan Zhang, Yiwei Wang, Yujun Cai, Yurong Wu, Yuran Wang, Ning Xu, Jian Cheng
While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, growing evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference. In this work, we investigate a central question: are LLMs primarily anchored to final answers or to the textual pattern of reasoning chains? We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis. Experiments across state-of-the-art LLMs reveal a strong and consistent reliance on explicit answers. The performance drops by 26.90\% when answer cues are masked, even with complete reasoning chains. These findings suggest that much of the reasoning exhibited by LLMs may reflect post-hoc rationalization rather than true inference, calling into question their inferential depth. Our study uncovers the answer-anchoring phenomenon with rigorous empirical validation and underscores the need for a more nuanced understanding of what constitutes reasoning in LLMs.
nan
Article 446
Title@2025-06-21 (6): UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
Title: UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation | UniMoT: Unified Molecule-Text Language Model mit diskreter Token-Darstellung | UniMoT: 具有分立调制调制解析器表示式的统一分子文字语言模式 2408.00863v2 |
Authors (6): Shuhan Guo, Yatao Bian, Ruibing Wang, Nan Yin, Zhen Wang, Quanming Yao
The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.
nan
Article 447
Title@2025-06-21 (6): CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Title: CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning | CLiViS: Unleashing Kognitive Karte durch Linguistisch-Visuelle Synergie für eingedickte visuelle Vernunft | CLiVVIS:通过视觉机能理性的语言-视觉协同法解析认知图 2506.17629v1 |
Authors (6): Kailing Li, Qi’ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang
Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.
nan
Article 448
Title@2025-06-21 (6): Dual Debiasing for Noisy In-Context Learning for Text Generation
Title: Dual Debiasing for Noisy In-Context Learning for Text Generation | Dual Debiasing für lautes In-Context-Lernen für Textgenerierung | 为产生文本进行有噪音的内文学习双向偏差 2506.00418v2 |
Authors (4): Siqi Liang, Sumyeong Ahn, Paramveer S. Dhillon, Jiayu Zhou
In context learning (ICL) relies heavily on high quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed. We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level. Extensive experiments demonstrate our method’s superior noise detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.
nan
Article 449
Title@2025-06-21 (6): A Closer Look into Mixture-of-Experts in Large Language Models
Title: A Closer Look into Mixture-of-Experts in Large Language Models | Ein genauerer Blick in Mixture-of-Experts in großen Sprachmodellen | 更密切地研究大语言模型混合专家 2406.18219v3 |
Authors (5): Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three popular MoE-based models and reveal some intriguing observations, including 1) Neurons act like fine-grained experts; 2) The router of MoE usually selects experts with larger output norms; 3) The expert diversity increases as the layer increases, while the last layer is an outlier, which is further validated by an initial experiment. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.
nan
Article 450
Title@2025-06-21 (6): Anthropocentric bias in language model evaluation
Title: Anthropocentric bias in language model evaluation | Anthropozentrische Voreingenommenheit in der Sprachmodellbewertung | 语言模式评价中的人文中心偏见 2407.03859v2 |
Authors (2): Raphaël Millière, Charles Rathkopf
Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence (“auxiliary oversight”), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent (“mechanistic chauvinism”). Mitigating these biases necessitates an empirically-driven, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, which can be done by supplementing carefully designed behavioral experiments with mechanistic studies.
nan
Article 451
Title@2025-06-21 (6): A Dual-Directional Context-Aware Test-Time Learning for Text Classification
Title: A Dual-Directional Context-Aware Test-Time Learning for Text Classification | Ein Dual-Directional Context-Aware Test-Time Learning für die Textklassifikation | 文本分类双调背景-软件-测试-时间学习 2503.15469v5 |
Authors (5): Dong Xu, Mengyao Liao, Zhenglin Lai, Xueliang Li, Junkai Ji
Text classification assigns text to predefined categories. Traditional methods struggle with complex structures and long-range dependencies. Deep learning with recurrent neural networks and Transformer models has improved feature extraction and context awareness. However, these models still trade off interpretability, efficiency and contextual range. We propose the Dynamic Bidirectional Elman Attention Network (DBEAN). DBEAN combines bidirectional temporal modeling and self-attention. It dynamically weights critical input segments and preserves computational efficiency.
nan
Article 452
Title@2025-06-21 (6): OpusLM: A Family of Open Unified Speech Language Models
Title: OpusLM: A Family of Open Unified Speech Language Models | OpusLM: Eine Familie von offenen, einheitlichen Sprachmodellen | OpusLM: “ 开放统一语言模式 “ 家庭 2506.17611v1 |
Authors (12): Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe
This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research
nan
Article 453
Title@2025-06-21 (6): TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting
Title: TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting | TyphoFormer: Sprachgesteigerter Transformer für präzise Typhoon-Track-Prognose | 台风前台风:用于准确预报台风轨道的语文增强变换器 2506.17609v1 |
Authors (6): Lincan Li, Eren Erman Ozguven, Yue Zhao, Guang Wang, Yiqun Xie, Yushun Dong
Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such as typhoon tracks. To address this challenge, we propose TyphoFormer, a novel framework that incorporates natural language descriptions as auxiliary prompts to improve typhoon trajectory forecasting. For each time step, we use Large Language Model (LLM) to generate concise textual descriptions based on the numerical attributes recorded in the North Atlantic hurricane database. The language descriptions capture high-level meteorological semantics and are embedded as auxiliary special tokens prepended to the numerical time series input. By integrating both textual and sequential information within a unified Transformer encoder, TyphoFormer enables the model to leverage contextual cues that are otherwise inaccessible through numerical features alone. Extensive experiments are conducted on HURDAT2 benchmark, results show that TyphoFormer consistently outperforms other state-of-the-art baseline methods, particularly under challenging scenarios involving nonlinear path shifts and limited historical observations.
nan
Article 454
Title@2025-06-21 (6): Mind the Gap: Assessing Wiktionary’s Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages
Title: Mind the Gap: Assessing Wiktionary’s Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages | Mind the Gap: Bewertung von Wiktionarys massenhaft linguistischem Wissen über morphologische Lücken in zwei verwandten Sprachen | 认识差距:评估维基多里人对两种相关语言的病理差距的密集语言知识 2506.17603v1 |
Authors (2): Jonathan Sakunkoo, Annabella Sakunkoo
Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.
nan
Article 455
Title@2025-06-21 (6): Steering LLMs for Formal Theorem Proving
Title: Steering LLMs for Formal Theorem Proving | Lenkung LLMs für formale Theorem Proving | 正式理论证明指导LLMs 2502.15507v4 |
Authors (2): Shashank Kirtania, Arun Iyer
Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean. However, current state of the art language models struggles to predict next step in proofs leading practitioners to use different sampling techniques to improve LLMs capabilities. We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics, affecting the overall selection process. To overcome this hurdle, we use activation steering to guide LLMs responses to improve the generations at the time of inference. Our results suggest that activation steering offers a promising lightweight alternative to specialized fine-tuning for enhancing theorem proving capabilities in LLMs, particularly valuable in resource-constrained environments.
nan
Article 456
Title@2025-06-21 (6): Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
Title: Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models | Cite Pretrain: Retrieval-freie Wissenszuweisung für große Sprachmodelle | Cite Prettrain: 大语言模型的检索-无知识归属 2506.17585v1 |
Authors (5): Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra
Trustworthy language models should provide both correct and verifiable answers. While language models can sometimes attribute their outputs to pretraining data, their citations are often unreliable due to hallucination. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during (continual) pretraining–without test-time retrieval–by revising the training process. To evaluate this, we release CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel, unseen documents and probes both short-form (single fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to bind facts to persistent document identifiers, and (2) instruction tuning to elicit citation behavior. We find that simple Passive Indexing, which appends an identifier to each document, helps memorize verbatim text but fails on paraphrased or compositional facts. Instead, we propose Active Indexing, which continually pretrains on synthetic QA pairs that (1) restate each fact in diverse compositional forms, and (2) require bidirectional source-to-fact and fact-to-source generation, jointly teaching the model to generate content from a cited source and to attribute its own answers. Experiments with Qwen2.5-7B and 3B show that Active Indexing consistently outperforms Passive Indexing across all tasks and models, with citation precision gains up to 30.2 percent. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16 times the original token count.
nan
Article 457
Title@2025-06-21 (6): AgriCHN: A Comprehensive Cross-domain Resource for Chinese Agricultural Named Entity Recognition
Title: AgriCHN: A Comprehensive Cross-domain Resource for Chinese Agricultural Named Entity Recognition | AgriCHN: Eine umfassende Cross-Domain-Ressource für die Anerkennung der chinesischen Landwirtschaftseinheit | AGRICHN:中国农业命名实体认证综合跨部门资源 2506.17578v1 |
Authors (10): Lingxiao Zeng, Yiqi Tong, Wei Guo, Huarui Wu, Lihao Ge, Yijun Ye, Fuzhen Zhuang, Deqing Wang, Wei Guo, Cheng Chen
Agricultural named entity recognition is a specialized task focusing on identifying distinct agricultural entities within vast bodies of text, including crops, diseases, pests, and fertilizers. It plays a crucial role in enhancing information extraction from extensive agricultural text resources. However, the scarcity of high-quality agricultural datasets, particularly in Chinese, has resulted in suboptimal performance when employing mainstream methods for this purpose. Most earlier works only focus on annotating agricultural entities while overlook the profound correlation of agriculture with hydrology and meteorology. To fill this blank, we present AgriCHN, a comprehensive open-source Chinese resource designed to promote the accuracy of automated agricultural entity annotation. The AgriCHN dataset has been meticulously curated from a wealth of agricultural articles, comprising a total of 4,040 sentences and encapsulating 15,799 agricultural entity mentions spanning 27 diverse entity categories. Furthermore, it encompasses entities from hydrology to meteorology, thereby enriching the diversity of entities considered. Data validation reveals that, compared with relevant resources, AgriCHN demonstrates outstanding data quality, attributable to its richer agricultural entity types and more fine-grained entity divisions. A benchmark task has also been constructed using several state-of-the-art neural NER models. Extensive experimental results highlight the significant challenge posed by AgriCHN and its potential for further research.
nan
Article 458
Title@2025-06-21 (6): Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions
Title: Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions | Deep Binding of Language Model Virtual Personas: eine Studie über die Annäherung der politischen Partisanen-Misswahrnehmungen | 语言模拟虚拟人:关于政治党派近似误解的研究 2504.11673v3 |
Authors (6): Minwoo Kang, Suhong Moon, Seung Hyeong Lee, Ayush Raj, Joseph Suh, David M. Chan
Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses to various surveys and polls. However, the questions in these surveys usually reflect socially understood attitudes: the patterns of attitudes of old/young, liberal/conservative, as understood by both members and non-members of those groups. It is not clear whether the LLM binding is \emph{deep}, meaning the LLM answers as a member of a particular in-group would, or \emph{shallow}, meaning the LLM responds as an out-group member believes an in-group member would. To explore this difference, we use questions that expose known in-group/out-group biases. This level of fidelity is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories” generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87\% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies of in-group/out-group biases. Altogether, our work extends the applicability of LLMs beyond estimating socially understood responses, enabling their use in a broader range of human studies.
nan
Article 459
Title@2025-06-21 (6): SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
Title: SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning | SRPO: Verbesserung der multimodalen LLM-Reasoning durch Reflection-Aware-Verstärkung | SRPO: 通过反射-软件强化学习,加强多式LLM 2506.01713v2 |
Authors (14): Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
nan
Article 460
Title@2025-06-21 (6): LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning
Title: LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning | LLM-getriebene medizinische Report Generierung über kommunikationseffizientes Heterogenes Federated Learning | LLM 驱动的通过通信效率高的异质联邦学习编写医学报告 2506.17562v1 |
Authors (6): Haoxuan Che, Haibo Jin, Zhengrui Guo, Yi Lin, Cheng Jin, Hao Chen
LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.
nan
Article 461
Title@2025-06-21 (6): ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation
Title: ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation | Parammute: Unterdrückende wissenskritische FFNs für treue retrieval-erweiterte Generation | 分量:制止知识-关键FFFF,以用于忠实检索-养殖一代 2502.15543v3 |
Authors (11): Pengcheng Huang, Zhenghao Liu, Yukun Yan, Haiyan Zhao, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong
Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG. All codes are available at https://github.com/OpenBMB/ParamMute.
nan
Article 462
Title@2025-06-21 (6): Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception
Title: Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception | Probing for Phonology in Self-Supervised Speech Representations: Eine Fallstudie zur beschleunigten Wahrnehmung | 自我监督演讲代表中的声学研究:关于接受程度的案例研究 2506.17542v1 |
Authors (3): Nitin Venkateswaran, Kevin Tang, Ratree Wayland
Traditional models of accent perception underestimate the role of gradient variations in phonological features which listeners rely upon for their accent judgments. We investigate how pretrained representations from current self-supervised learning (SSL) models of speech encode phonological feature-level variations that influence the perception of segmental accent. We focus on three segments: the labiodental approximant, the rhotic tap, and the retroflex stop, which are uniformly produced in the English of native speakers of Hindi as well as other languages in the Indian sub-continent. We use the CSLU Foreign Accented English corpus (Lander, 2007) to extract, for these segments, phonological feature probabilities using Phonet (V'asquez-Correa et al., 2019) and pretrained representations from Wav2Vec2-BERT (Barrault et al., 2023) and WavLM (Chen et al., 2022) along with accent judgements by native speakers of American English. Probing analyses show that accent strength is best predicted by a subset of the segment’s pretrained representation features, in which perceptually salient phonological features that contrast the expected American English and realized non-native English segments are given prominent weighting. A multinomial logistic regression of pretrained representation-based segment distances from American and Indian English baselines on accent ratings reveals strong associations between the odds of accent strength and distances from the baselines, in the expected directions. These results highlight the value of self-supervised speech representations for modeling accent perception using interpretable phonological features.
nan
Article 463
Title@2025-06-21 (6): DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning
Title: DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning | DuaShepherd: Integration von Schrittweiser Korrektheit und potenziellen Belohnungen für mathematische Vernunft | DuaShepherd: 整合数学理由的逐步纠正和潜在奖励 2506.17533v1 |
Authors (5): Yuanhao Wu, Juntong Song, Hanning Zhang, Tong Zhang, Cheng Niu
In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.
nan
Article 464
Title@2025-06-21 (6): Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning
Title: Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning | Datenqualitätsfragen in mehrsprachigen Sprachdatensätzen: Der Bedarf an soziolinguistischer Sensibilisierung und proaktiver Sprachplanung | 多语言语言数据集的数据质量问题:社会语言意识和前瞻性语言规划的必要性 2506.17525v1 |
Authors (6): Mingfei Lau, Qian Chen, Yeming Fang, Tingting Xu, Tongzhou Chen, Pavel Golik
Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and VoxPopuli - shows that in some languages, these datasets suffer from significant quality issues. We believe addressing these issues will make these datasets more useful as training and evaluation sets, and improve downstream models. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness in creating robust and reliable speech data resources.
nan
Article 465
Title@2025-06-20 (5): Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
Title: Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards | Agent-RLVR: Training Software Engineering Agents über Beratung und Umwelt Belohnungen | RLVR: 通过指导和环境奖励培训软件工程代理 2506.11425v2 |
Authors (6): Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean Hendryx
Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent’s errors and environmental interactions, emulate a teacher’s guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.
nan
Article 466
Title@2025-06-20 (5): $L^*LM$: Learning Automata from Examples using Natural Language Oracles
Title: $L^*LM$: Learning Automata from Examples using Natural Language Oracles | $L^*LM$: Automata lernen aus Beispielen mit natürlichen Sprach-Orakeln | $LLM$:从使用自然语言甲骨文的例子中学习自动地图 2402.07051v2 |
Authors (5): Marcell Vazquez-Chanlatte, Karim Elmaaroufi, Stefan J. Witwicki, Matei Zaharia, Sanjit A. Seshia
Expert demonstrations have proven an easy way to indirectly specify complex tasks. Recent algorithms even support extracting unambiguous formal specifications, e.g. deterministic finite automata (DFA), from demonstrations. Unfortunately, these techniques are generally not sample efficient. In this work, we introduce $L^LM$, an algorithm for learning DFAs from both demonstrations and natural language. Due to the expressivity of natural language, we observe a significant improvement in the data efficiency of learning DFAs from expert demonstrations. Technically, $L^LM$ leverages large language models to answer membership queries about the underlying task. This is then combined with recent techniques for transforming learning from demonstrations into a sequence of labeled example learning problems. In our experiments, we observe the two modalities complement each other, yielding a powerful few-shot learner.
nan
Article 467
Title@2025-06-20 (5): VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM
Title: VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM | VeriLocc: End-to-End Cross-Architektur Register Zuordnung über LLM | VeriLocc:通过LLM进行端至端跨建筑登记册分配 2506.17506v1 |
Authors (4): Lesheng Jin, Zhenyuan Ruan, Haohui Mai, Jingbo Shang
Modern GPUs evolve rapidly, yet production compilers still rely on hand-crafted register allocation heuristics that require substantial re-tuning for each hardware generation. We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures. VeriLocc fine-tunes an LLM to translate intermediate representations (MIRs) into target-specific register assignments, aided by static analysis for cross-architecture normalization and generalization and a verifier-guided regeneration loop to ensure correctness. Evaluated on matrix multiplication (GEMM) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100. Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.
nan
Article 468
Title@2025-06-20 (5): When can isotropy help adapt LLMs’ next word prediction to numerical domains?
Title: When can isotropy help adapt LLMs’ next word prediction to numerical domains? | Wann kann Isotropie helfen, die nächste Wortvorhersage von LLMs an numerische Domänen anzupassen? | 何时才能帮助LLMS的下一个字词预测适应数字域? 2505.17135v4 |
Authors (4): Rashed Shelim, Shengzhe Xu, Walid Saad, Naren Ramakrishnan
Vector representations of contextual embeddings learned by pre-trained large language models (LLMs) are effective in various downstream tasks in numerical domains such as time series forecasting. Despite their significant benefits, the tendency of LLMs to hallucinate in such domains can have severe consequences in applications such as energy, nature, finance, healthcare, retail and transportation, among others. To guarantee prediction reliability and accuracy in numerical domains, it is necessary to open the black box behind the LLM and provide performance guarantees through explanation. However, there is little theoretical understanding of when pre-trained language models help solve numerical downstream tasks. This paper seeks to bridge this gap by understanding when the next-word prediction capability of LLMs can be adapted to numerical domains through a novel analysis based on the concept of isotropy in the contextual embedding space. Specifically, a log-linear model for LLMs is considered in which numerical data can be predicted from its context through a network with softmax in the output layer of LLMs (i.e., language model head in self-attention). For this model, it is demonstrated that, in order to achieve state-of-the-art performance in numerical domains, the hidden representations of the LLM embeddings must possess a structure that accounts for the shift-invariance of the softmax function. By formulating a gradient structure of self-attention in pre-trained models, it is shown how the isotropic property of LLM embeddings in contextual embedding space preserves the underlying structure of representations, thereby resolving the shift-invariance problem and providing a performance guarantee. Experiments show that different characteristics of numerical data and model architectures have different impacts on isotropy, and this variability directly affects the performances.
nan
Article 469
Title@2025-06-20 (5): Language Models Grow Less Humanlike beyond Phase Transition
Title: Language Models Grow Less Humanlike beyond Phase Transition | Sprachenmodelle wachsen weniger menschlich jenseits des Phasenübergangs | 超越阶段过渡后,语言模式人文化程度逐渐降低 2502.18802v2 |
Authors (2): Tatsuya Aoyama, Ethan Wilcox
LMs’ alignment with human reading behavior (i.e. psychometric predictive power; PPP) is known to improve during pretraining up to a tipping point, beyond which it either plateaus or degrades. Various factors, such as word frequency, recency bias in attention, and context size, have been theorized to affect PPP, yet there is no current account that explains why such a tipping point exists, and how it interacts with LMs’ pretraining dynamics more generally. We hypothesize that the underlying factor is a pretraining phase transition, characterized by the rapid emergence of specialized attention heads. We conduct a series of correlational and causal experiments to show that such a phase transition is responsible for the tipping point in PPP. We then show that, rather than producing attention patterns that contribute to the degradation in PPP, phase transitions alter the subsequent learning dynamics of the model, such that further training keeps damaging PPP.
nan
Article 470
Title@2025-06-20 (5): Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems
Title: Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems | Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems | 理解大语言模型对书写和信息生态系统的影响的计算方法 2506.17467v1 |
Authors (1): Weixin Liang
Large language models (LLMs) have shown significant potential to change how we write, communicate, and create, leading to rapid adoption across society. This dissertation examines how individuals and institutions are adapting to and engaging with this emerging technology through three research directions. First, I demonstrate how the institutional adoption of AI detectors introduces systematic biases, particularly disadvantaging writers of non-dominant language varieties, highlighting critical equity concerns in AI governance. Second, I present novel population-level algorithmic approaches that measure the increasing adoption of LLMs across writing domains, revealing consistent patterns of AI-assisted content in academic peer reviews, scientific publications, consumer complaints, corporate communications, job postings, and international organization press releases. Finally, I investigate LLMs’ capability to provide feedback on research manuscripts through a large-scale empirical analysis, offering insights into their potential to support researchers who face barriers in accessing timely manuscript feedback, particularly early-career researchers and those from under-resourced settings.
nan
Article 471
Title@2025-06-20 (5): Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages
Title: Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages | Überbrückung des Transkriptions-Bottlenecks: Feinabstimmungs-ASR-Modelle für extrem ressourcenarme Feldarbeitssprachen | 打破剪裁瓶颈:极低资源外地工作语言的微调 ASR 模型 2506.17459v1 |
Authors (2): Siyu Liang, Gina-Anne Levow
Automatic Speech Recognition (ASR) has reached impressive accuracy for high-resource languages, yet its utility in linguistic fieldwork remains limited. Recordings collected in fieldwork contexts present unique challenges, including spontaneous speech, environmental noise, and severely constrained datasets from under-documented languages. In this paper, we benchmark the performance of two fine-tuned multilingual ASR models, MMS and XLS-R, on five typologically diverse low-resource languages with control of training data duration. Our findings show that MMS is best suited when extremely small amounts of training data are available, whereas XLS-R shows parity performance once training data exceed one hour. We provide linguistically grounded analysis for further provide insights towards practical guidelines for field linguists, highlighting reproducible ASR adaptation approaches to mitigate the transcription bottleneck in language documentation.
nan
Article 472
Title@2025-06-20 (5): Directional Gradient Projection for Robust Fine-Tuning of Foundation Models
Title: Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Richtgradientenprojektion für robuste Feinsteuerung von Fundamentmodellen | 基金会模型硬性精美调整方向梯度预测 2502.15895v2 |
Authors (5): Chengyue Huang, Junjiao Tian, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira
Robust fine-tuning aims to adapt large foundation models to downstream tasks while preserving their robustness to distribution shifts. Existing methods primarily focus on constraining and projecting current model towards the pre-trained initialization based on the magnitudes between fine-tuned and pre-trained weights, which often require extensive hyper-parameter tuning and can sometimes result in underfitting. In this work, we propose Directional Gradient Projection (DiGraP), a novel layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization. Besides demonstrating our method on image classification, as another contribution we generalize this area to the multi-modal evaluation settings for robust fine-tuning. Specifically, we first bridge the uni-modal and multi-modal gap by performing analysis on Image Classification reformulated Visual Question Answering (VQA) benchmarks and further categorize ten out-of-distribution (OOD) VQA datasets by distribution shift types and degree (i.e. near versus far OOD). Experimental results show that DiGraP consistently outperforms existing baselines across Image Classfication and VQA tasks with discriminative and generative backbones, improving both in-distribution (ID) generalization and OOD robustness.
nan
Article 473
Title@2025-06-20 (5): FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Title: FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering | FRAMES-VQA: Benchmarking Fine-Tuning Robustheit über Multi-Modal Shifts in der visuellen Fragestellung | FRAMES-VQA:确定视觉问题解答中多模式变化的精确调整强度基准 2505.21755v2 |
Authors (4): Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .
nan
Article 474
Title@2025-06-20 (5): Beyond the Link: Assessing LLMs’ ability to Classify Political Content across Global Media
Title: Beyond the Link: Assessing LLMs’ ability to Classify Political Content across Global Media | Beyond the Link: Bewertung der Fähigkeit von LLMs, politische Inhalte in globalen Medien zu klassifizieren | 超越链接:评估LLMs在全球媒体对政治内容进行分类的能力 2506.17435v1 |
Authors (4): Alberto Martinez-Serra, Alejandro De La Fuente, Nienke Viescher, Ana S. Cardenal
The use of large language models (LLMs) is becoming common in the context of political science, particularly in studies that analyse individuals use of digital media. However, while previous research has demonstrated LLMs ability at labelling tasks, the effectiveness of using LLMs to classify political content (PC) from just URLs is not yet well explored. The work presented in this article bridges this gap by evaluating whether LLMs can accurately identify PC vs. non-PC from both the article text and the URLs from five countries (France, Germany, Spain, the UK, and the US) and different languages. Using cutting-edge LLMs like GPT, Llama, Mistral, Deepseek, Qwen and Gemma, we measure model performance to assess whether URL-level analysis can be a good approximation for full-text analysis of PC, even across different linguistic and national contexts. Model outputs are compared with human-labelled articles, as well as traditional supervised machine learning techniques, to set a baseline of performance. Overall, our findings suggest the capacity of URLs to embed most of the news content, providing a vital perspective on accuracy-cost balancing. We also account for contextual limitations and suggest methodological recommendations to use LLMs within political science studies.
nan
Article 475
Title@2025-06-20 (5): UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making
Title: UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making | UProp: Untersuchung der Unsicherheitsausbreitung von LLMs in mehrstufiger agentischer Entscheidungsfindung | UPROP:调查多级制剂决策中LLMs的不确定性传播情况 2506.17419v1 |
Authors (6): Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, Kaidi Xu
As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.
nan
Article 476
Title@2025-06-20 (5): Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study
Title: Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study | LLMs zur Bewertung von Tutorenbewegungen in Real-Life-Dialogen zu nutzen: Eine Machbarkeitsstudie | 利用LLMs来评估现实生活对话中的导师动向:可行性研究 2506.17410v1 |
Authors (9): Danielle R. Thomas, Conrad Borchers, Jionghao Lin, Sanjit Kakarla, Shambhavi Bhushan, Erin Gatz, Shivang Gupta, Ralph Abboud, Kenneth R. Koedinger
Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors’ application of two tutor skills: delivering effective praise and responding to student math errors. All models reliably detected relevant situations, for example, tutors providing praise to students (94-98% accuracy) and a student making a math error (82-88% accuracy) and effectively evaluated the tutors’ adherence to tutoring best practices, aligning closely with human judgments (83-89% and 73-77%, respectively). We propose a cost-effective prompting strategy and discuss practical implications for using large language models to support scalable assessment in authentic settings. This work further contributes LLM prompts to support reproducibility and research in AI-supported learning.
nan
Article 477
Title@2025-06-20 (5): Exploring the Potential of Encoder-free Architectures in 3D LMMs
Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs | Erforschung des Potenzials von encoderfreien Architekturen in 3D-LMMs | 探索3DLMMs中无编码器建筑的潜力 2502.09620v3 |
Authors (11): Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.10%, 50.98%, and 43.10% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL
nan
Article 478
Title@2025-06-20 (5): AQA-Bench: An Interactive Benchmark for Evaluating LLMs’ Sequential Reasoning Ability
Title: AQA-Bench: An Interactive Benchmark for Evaluating LLMs’ Sequential Reasoning Ability | AQA-Bench: Ein interaktiver Benchmark für die Bewertung der sequenziellen Begründungsfähigkeit von LLMs | AQA- “ AQA-区 “ :评估LLLMs按顺序推理能力的互动基准 2402.09404v2 |
Authors (3): Siwei Yang, Bingchen Zhao, Cihang Xie
This paper introduces AQA-Bench, a novel benchmark to assess the sequential reasoning capabilities of large language models (LLMs) in algorithmic contexts, such as depth-first search (DFS). The key feature of our evaluation benchmark lies in its interactive evaluation protocol - for example, in DFS, the availability of each node’s connected edge is contingent upon the model’s traversal to that node, thereby necessitating the LLM’s ability to effectively remember visited nodes and strategize subsequent moves considering the possible environmental feedback in the future steps. We comprehensively build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search, and to evaluate the sequential reasoning ability of 14 different LLMs. Our investigations reveal several interesting findings: (1) Closed-source models like GPT-4 and Gemini generally show much stronger sequential reasoning ability, significantly outperforming open-source LLMs. (2) Naively providing in-context examples may inadvertently hurt few-shot performance in an interactive environment due to over-fitting to examples. (3) Instead of using optimal steps from another test case as the in-context example, a very limited number of predecessor steps in the current test case following the optimal policy can substantially boost small models’ performance. (4) The performance gap between weak models and strong models is greatly due to the incapability of weak models to start well. (5) The scaling correlation between performance and model size is not always significant, sometimes even showcasing an inverse trend. We hope our study can catalyze future work on advancing the understanding and enhancement of LLMs’ capabilities in sequential reasoning. The code is available at https://github.com/UCSC-VLAA/AQA-Bench.
nan
Article 479
Title@2025-06-20 (5): Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency
Title: Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency | Feintuning senkt die Sicherheit und beeinträchtigt die Bewertungskonsistenz | 安全性和干扰性评价一致性 2506.17209v1 |
Authors (4): Kathleen C. Fraser, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko
Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users. However, fine-tuning is known to remove the safety alignment features of the model, even when the fine-tuning data does not contain any harmful content. We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the “attack”. Most well-intentioned developers are likely unaware that they are deploying an LLM with reduced safety. On the other hand, this known vulnerability can be easily exploited by malicious actors intending to bypass safety guardrails. To make any meaningful progress in mitigating this issue, we first need reliable and reproducible safety evaluations. In this work, we investigate how robust a safety benchmark is to trivial variations in the experimental procedure, and the stochastic nature of LLMs. Our initial experiments expose surprising variance in the results of the safety evaluation, even when seemingly inconsequential changes are made to the fine-tuning setup. Our observations have serious implications for how researchers in this field should report results to enable meaningful comparisons in the future.
nan
Article 480
Title@2025-06-20 (5): Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems
Title: Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems | Desecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems | SWE-区领导板拆解:LLM-和代理修理系统的分析提交者和结构 2506.17208v1 |
Authors (2): Matias Martinez, Xavier Franch
The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards, SWE-Bench Lite and SWE-Bench Verified, have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (68 entries) and Verified (79 entries) leaderboards, analyzing 67 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.
nan
Article 481
Title@2025-06-20 (5): High-Dimensional Interlingual Representations of Large Language Models
Title: High-Dimensional Interlingual Representations of Large Language Models | Hochdimensionale interlinguale Darstellungen großer Sprachmodelle | 大语言模式的多种语文间高语言代表 2503.11280v4 |
Authors (4): Bryan Wilie, Samuel Cahyawijaya, Junxian He, Pascale Fung
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs–a shared subspace in the representation space. However, evidence regarding this phenomenon is mixed, leaving it unclear whether these models truly develop unified interlingual representations, or present a partially aligned constructs. We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions; and find that multilingual LLMs exhibit inconsistent cross-lingual alignments. To address this, we propose an interlingual representation framework identifying both the shared interlingual semantic subspace and fragmented components, existed due to representational limitations. We introduce Interlingual Local Overlap (ILO) score to quantify interlingual alignment by comparing the local neighborhood structures of high-dimensional representations. We utilize ILO to investigate the impact of single-language fine-tuning on the interlingual representations in multilingual LLMs. Our results indicate that training exclusively on a single language disrupts the alignment in early layers, while freezing these layers preserves the alignment of interlingual representations, leading to improved cross-lingual generalization. These results validate our framework and metric for evaluating interlingual representation, and further underscore that interlingual alignment is crucial for scalable multilingual learning.
nan
Article 482
Title@2025-06-20 (5): Towards AI Search Paradigm
Title: Towards AI Search Paradigm | Auf dem Weg zur KI-Suche Paradigma | 走向 AI 搜索范式 2506.17188v1 |
Authors (21): Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen, Jun Yang, Haojie Zhang, Jiayi Li, Jiayi Wu, Yiqun Chen, Changle Qu, Keyi Kong, Wenwen Ye, Lixin Su, Xinyu Ma, Long Xia, Daiting Shi, Jiashu Zhao, Haoyi Xiong, Shuaiqiang Wang, Dawei Yin
In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.
nan
Article 483
Title@2025-06-20 (5): CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models
Title: CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models | CLEAR-3K: Bewertung von Kausalerklärbarkeiten in Sprachmodellen | CLEAR-3K:评估语文模式中的原因解释能力 2506.17180v1 |
Authors (3): Naiming Liu, Richard Baraniuk, Shashank Sonkar
We introduce CLEAR-3K, a dataset of 3,000 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question present an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews Correlation Coefficient plateaus at just 0.55, even for the best-performing models.Hence, CLEAR-3K provides a crucial benchmark for developing and evaluating genuine causal reasoning in language models, which is an essential capability for applications that require accurate assessment of causal relationships.
nan
Article 484
Title@2025-06-20 (5): TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Title: TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models | TALE: Ein Tool-Augmented Framework zur referenzfreien Bewertung großer Sprachmodelle | TALE:无参考资料大语言模式评估工具增强框架 2504.07385v2 |
Authors (3): Sher Badshah, Ali Emami, Hassan Sajjad
As Large Language Models (LLMs) become increasingly integrated into real-world, autonomous applications, relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers. Unlike conventional metrics that compare to fixed references or depend solely on LLM-as-a-judge knowledge, TALE employs an agent with tool-access capabilities that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By shifting away from static references, TALE aligns with free-form question-answering tasks common in real-world scenarios. Experimental results on multiple free-form QA benchmarks show that TALE not only outperforms standard reference-based metrics for measuring response accuracy but also achieves substantial to near-perfect agreement with human evaluations. TALE enhances the reliability of LLM evaluations in real-world, dynamic scenarios without relying on static references.
nan
Article 485
Title@2025-06-20 (5): LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning
Title: LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning | LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning | LaRS: 研究连锁理据的后备理据技能 2312.04684v4 |
Authors (6): Zifan Xu, Haozhu Wang, Dmitriy Bespalov, Xian Wu, Peter Stone, Yanjun Qi
Chain-of-thought (CoT) prompting is a popular in-context learning (ICL) approach for large language models (LLMs), especially when tackling complex reasoning tasks. Traditional ICL approaches construct prompts using examples that contain questions similar to the input question. However, CoT prompting, which includes crucial intermediate reasoning steps (rationales) within its examples, necessitates selecting examples based on these rationales rather than the questions themselves. Existing methods require human experts or pre-trained LLMs to describe the skill, a high-level abstraction of rationales, to guide the selection. These methods, however, are often costly and difficult to scale. Instead, this paper introduces a new approach named Latent Reasoning Skills (LaRS) that employs unsupervised learning to create a latent space representation of rationales, with a latent variable called a reasoning skill. Concurrently, LaRS learns a reasoning policy to determine the required reasoning skill for a given question. Then the ICL examples are selected by aligning the reasoning skills between past examples and the question. This approach is theoretically grounded and compute-efficient, eliminating the need for auxiliary LLM inference or manual prompt design. Empirical results demonstrate that LaRS consistently outperforms SOTA skill-based selection methods, processing example banks four times faster, reducing LLM inferences during the selection stage by half, and showing greater robustness to sub-optimal example banks.
nan
Article 486
Title@2025-06-20 (5): Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
Title: Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM | Zuschauen und hören: Audio-Visual-Sprechmomente mit multimodaler LLM verstehen | 观看和收听: 了解多式LM的视听话语道 2505.18110v2 |
Authors (8): Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke
Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like “A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding” requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense’s multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.
nan
Article 487
Title@2025-06-20 (5): Watermarking Language Models through Language Models
Title: Watermarking Language Models through Language Models | Wasserzeichen von Sprachmodellen durch Sprachmodelle | 通过语言模型建立语言模型 2411.05091v2 |
Authors (3): Agnibh Dasgupta, Abdullah Tanvir, Xin Zhong
Watermarking the outputs of large language models (LLMs) is critical for provenance tracing, content regulation, and model accountability. Existing approaches often rely on access to model internals or are constrained by static rules and token-level perturbations. Moreover, the idea of steering generative behavior via prompt-based instruction control remains largely underexplored. We introduce a prompt-guided watermarking framework that operates entirely at the input level and requires no access to model parameters or decoding logits. The framework comprises three cooperating components: a Prompting LM that synthesizes watermarking instructions from user prompts, a Marking LM that generates watermarked outputs conditioned on these instructions, and a Detecting LM trained to classify whether a response carries an embedded watermark. This modular design enables dynamic watermarking that adapts to individual prompts while remaining compatible with diverse LLM architectures, including both proprietary and open-weight models. We evaluate the framework over 25 combinations of Prompting and Marking LMs, such as GPT-4o, Mistral, LLaMA3, and DeepSeek. Experimental results show that watermark signals generalize across architectures and remain robust under fine-tuning, model distillation, and prompt-based adversarial attacks, demonstrating the effectiveness and robustness of the proposed approach.
nan
Article 488
Title@2025-06-20 (5): Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement
Title: Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement | Kalibrierung vortrainierter Sprachklassifikatoren auf LLM-generierten Noisy-Labels über iterative Veredelung | 通过迭代精炼校准LLM产生的噪音标签上的训练前语言分类校准 2505.19675v2 |
Authors (4): Liqin Ye, Agam Shah, Chao Zhang, Sudheer Chava
The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model’s generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier’s prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.
nan
Article 489
Title@2025-06-20 (5): Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Title: Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs? | Cache mich, wenn Sie können: Wie viele KVs benötigen Sie für effektive Lang-Kontext LMs? | 如果可以的话, 缓存我 : 您需要多少 KV 才能有效长文本 LM ? 2506.17121v1 |
Authors (5): Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen
Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult. In this paper, we propose the KV footprint as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory. We evaluate methods based on the smallest footprint they attain while preserving performance in both long-context understanding and generation, with context lengths of up to 128K tokens. This metric reveals the high peak memory of prior KV eviction methods. One class of methods – post-fill eviction – has a high footprint due to being incompatible with eviction during pre-filling. We adapt these methods to be able to evict KVs during pre-filling, achieving substantially lower KV footprints. We then turn to recency eviction methods, wherein we propose PruLong, an end-to-end optimization method for learning which attention heads need to retain the full KV cache and which do not. PruLong saves memory while preserving long-context performance, achieving 12% smaller KV footprint than prior methods while retaining performance in challenging recall tasks. Our paper clarifies the complex tangle of long-context inference methods and paves the way for future development to minimize the KV footprint.
nan
Article 490
Title@2025-06-20 (5): MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
Title: MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation | MEXA: Auf dem Weg zu einer allgemeinen multimodalen Vernunft mit dynamischer Multi-Expert-Aggregation | MEXA:争取采用动态多专家聚合的通用多模式理由 2506.17113v1 |
Authors (5): Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal
Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
nan
Article 491
Title@2025-06-20 (5): Are Bias Evaluation Methods Biased ?
Title: Are Bias Evaluation Methods Biased ? | Sind Bias Evaluation Methoden Biased ? | 是否对评估方法有偏见? 2506.17111v1 |
Authors (4): Lina Berrayana, Sean Rooney, Luis Garcés-Erice, Ioana Giurgiu
The creation of benchmarks to evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approaches with distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approaches to rank a set of representative models for bias and compare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks.
nan
Article 492
Title@2025-06-20 (5): Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving
Title: Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving | Auf dem Weg zu einer fortgeschrittenen mathematischen Begründung für LLMs über Logic Theorem-Proving erster Ordnung | 争取通过一阶逻辑理论验证,为LLMs提供高级数学理由 2506.17104v1 |
Authors (10): Chuxue Cao, Mengze Li, Juntao Dai, Jinluan Yang, Zijian Zhao, Shengyu Zhang, Weijie Shi, Chengzhong Liu, Sirui Han, Yike Guo
Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B’s low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs’ generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs’ mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
nan
Article 493
Title@2025-06-20 (5): Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation
Title: Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation | Chain-of-Thought Prompting Obscures Halluzination Cues in großen Sprachmodellen: Eine empirische Bewertung | 引导大语言模型中传译锥体:经验评价 2506.17088v1 |
Authors (8): Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li
Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM’s internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://anonymous.4open.science/r/cot-hallu-detect.
nan
Article 494
Title@2025-06-20 (5): ScholarSearch: Benchmarking Scholar Searching Ability of LLMs
Title: ScholarSearch: Benchmarking Scholar Searching Ability of LLMs | ScholarSearch: Benchmarking Scholar Suche Fähigkeit von LLMs | 搜索学者:参照基准,学者搜索LLMs的能力 2506.13784v2 |
Authors (8): Junting Zhou, Wang Li, Yiyan Liao, Nengyuan Zhang, Tingjia Miao, Zhihui Qi, Yuhan Wu, Tong Yang
Large Language Models (LLMs)’ search capabilities have garnered significant attention. Existing benchmarks, such as OpenAI’s BrowseComp, primarily focus on general search scenarios and fail to adequately address the specific demands of academic search. These demands include deeper literature tracing and organization, professional support for academic databases, the ability to navigate long-tail academic knowledge, and ensuring academic rigor. Here, we proposed ScholarSearch, the first dataset specifically designed to evaluate the complex information retrieval capabilities of Large Language Models (LLMs) in academic research. ScholarSearch possesses the following key characteristics: Academic Practicality, where question content closely mirrors real academic learning and research environments, avoiding deliberately misleading models; High Difficulty, with answers that are challenging for single models (e.g., Grok DeepSearch or Gemini Deep Research) to provide directly, often requiring at least three deep searches to derive; Concise Evaluation, where limiting conditions ensure answers are as unique as possible, accompanied by clear sources and brief solution explanations, greatly facilitating subsequent audit and verification, surpassing the current lack of analyzed search datasets both domestically and internationally; and Broad Coverage, as the dataset spans at least 15 different academic disciplines. Through ScholarSearch, we expect to more precisely measure and promote the performance improvement of LLMs in complex academic information retrieval tasks. The data is available at: https://huggingface.co/datasets/PKU-DS-LAB/ScholarSearch
nan
Article 495
Title@2025-06-20 (5): Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
Title: Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs | Tower+: Überbrückung Allgemeinheit und Übersetzung Spezialisierung in mehrsprachigen LLMs | 塔+:在多语种LMM中连接通俗和翻译专业 2506.17080v1 |
Authors (7): Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha, Amin Farajian, André F. T. Martins
Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.
nan
Article 496
Title@2025-06-20 (5): Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025
Title: Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 | Simultane Übersetzung mit Offline-Speech- und LLM-Modellen in CUNI-Einreichung bei IWSLT 2025 | 同时翻译与CUNI提交2025年IWSLT的离线演讲和LLM模型的LLM模型同步翻译 2506.17077v1 |
Authors (2): Dominik Macháček, Peter Polák
This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025. We cover all four language pairs with a direct or cascade approach. The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt. We further improve the performance by prompting to inject in-domain terminology, and we accommodate context. Our cascaded systems further use EuroLLM for unbounded simultaneous translation. Compared to the Organizers’ baseline, our systems improve by 2 BLEU points on Czech to English and 13-22 BLEU points on English to German, Chinese and Japanese on the development sets. Additionally, we also propose a new enhanced measure of speech recognition latency.
nan
Article 497
Title@2025-06-20 (5): Contextual modulation of language comprehension in a dynamic neural model of lexical meaning
Title: Contextual modulation of language comprehension in a dynamic neural model of lexical meaning | Kontextuelle Modulation des Sprachverständnisses in einem dynamischen neuronalen Modell der lexikalischen Bedeutung | 以动态的神经模式对语言理解进行上下文调整 2407.14701v2 |
Authors (2): Michael C. Stern, Maria M. Piñango
We propose and computationally implement a dynamic neural model of lexical meaning, and experimentally test its behavioral predictions. We demonstrate the architecture and behavior of the model using as a test case the English lexical item ‘have’, focusing on its polysemous use. In the model, ‘have’ maps to a semantic space defined by two continuous conceptual dimensions, connectedness and control asymmetry, previously proposed to parameterize the conceptual system for language. The mapping is modeled as coupling between a neural node representing the lexical item and neural fields representing the conceptual dimensions. While lexical knowledge is modeled as a stable coupling pattern, real-time lexical meaning retrieval is modeled as the motion of neural activation patterns between metastable states corresponding to semantic interpretations or readings. Model simulations capture two previously reported empirical observations: (1) contextual modulation of lexical semantic interpretation, and (2) individual variation in the magnitude of this modulation. Simulations also generate a novel prediction that the by-trial relationship between sentence reading time and acceptability should be contextually modulated. An experiment combining self-paced reading and acceptability judgments replicates previous results and confirms the new model prediction. Altogether, results support a novel perspective on lexical polysemy: that the many related meanings of a word are metastable neural activation states that arise from the nonlinear dynamics of neural populations governing interpretation on continuous semantic dimensions.
nan
Article 498
Title@2025-06-20 (5): From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
Title: From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers | Von Konzepten zu Komponenten: Konzept-agnostische Aufmerksamkeit Modul Entdeckung in Transformatoren | 从概念到组成部分:在变异器中发现概念 – – 不可接受注意模块 2506.17052v1 |
Authors (3): Jingtong Su, Julia Kempe, Karen Ullrich
Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing performance and improving behavioral control. Attribution methods help advance interpretability by assigning model outputs associated with a target concept to specific model components. Current attribution research primarily studies multi-layer perceptron neurons and addresses relatively simple concepts such as factual associations (e.g., Paris is located in France). This focus tends to overlook the impact of the attention mechanism and lacks a unified approach for analyzing more complex concepts. To fill these gaps, we introduce Scalable Attention Module Discovery (SAMD), a concept-agnostic method for mapping arbitrary, complex concepts to specific attention heads of general transformer models. We accomplish this by representing each concept as a vector, calculating its cosine similarity with each attention head, and selecting the TopK-scoring heads to construct the concept-associated attention module. We then propose Scalar Attention Module Intervention (SAMI), a simple strategy to diminish or amplify the effects of a concept by adjusting the attention module using only a single scalar parameter. Empirically, we demonstrate SAMD on concepts of varying complexity, and visualize the locations of their corresponding modules. Our results demonstrate that module locations remain stable before and after LLM post-training, and confirm prior work on the mechanics of LLM multilingualism. Through SAMI, we facilitate jailbreaking on HarmBench (+72.7%) by diminishing “safety” and improve performance on the GSM8K benchmark (+1.6%) by amplifying “reasoning”. Lastly, we highlight the domain-agnostic nature of our approach by suppressing the image classification accuracy of vision transformers on ImageNet.
nan
Article 499
Title@2025-06-20 (5): Geopolitical biases in LLMs: what are the “good” and the “bad” countries according to contemporary language models
Title: Geopolitical biases in LLMs: what are the “good” and the “bad” countries according to contemporary language models | Geopolitische Voreingenommenheiten in LLMs: Was sind die “guten” und die “schlechten” Länder nach zeitgenössischen Sprachmodellen | LLMM中的地缘政治偏见:根据当代语言模式,什么是“好”和“坏”国家? 2506.06751v2 |
Authors (10): Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Natalia Loukachevitch, Alexander Panchenko, Elena Tutubalina
This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models’ sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.
nan
Article 500
Title@2025-06-20 (5): MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models
Title: MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models | MUCAR: Multilinguale Cross-Modal Ambiguity Auflösung für multimodale große Sprachmodelle Benchmarking | MUCAR:为多模式大语言模型制定多语言跨模式的多语种和多模式模糊分辨率基准 2506.17046v1 |
Authors (11): Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models–encompassing both open-source and proprietary architectures–reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
nan
Article 501
Title@2025-06-20 (5): COS-DPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework
Title: COS-DPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework | COS-DPO: Bedingtes eins-shot Multi-Objective Fine-Tuning Framework | COS-DPO: 有条件的单片多目标微调框架 2410.08316v3 |
Authors (5): Yinuo Ren, Tesi Xiao, Michael Shavlovsky, Lexing Ying, Holakou Rahmanian
In LLM alignment and many other ML applications, one often faces the Multi-Objective Fine-Tuning (MOFT) problem, i.e., fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose a Conditioned One-Shot fine-tuning framework (COS-DPO) that extends the Direct Preference Optimization technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By direct conditioning on the weight across auxiliary objectives, our Weight-COS-DPO method enjoys an efficient one-shot training process for profiling the Pareto front and is capable of achieving comprehensive trade-off solutions even in the post-training stage. Based on our theoretical findings on the linear transformation properties of the loss function, we further propose the Temperature-COS-DPO method that augments the temperature parameter to the model input, enhancing the flexibility of post-training control over the trade-offs between the main and auxiliary objectives. We demonstrate the effectiveness and efficiency of the COS-DPO framework through its applications to various tasks, including the Learning-to-Rank (LTR) and LLM alignment tasks, highlighting its viability for large-scale ML deployments.
nan
Article 502
Title@2025-06-20 (5): Principles of semantic and functional efficiency in grammatical patterning
Title: Principles of semantic and functional efficiency in grammatical patterning | Grundsätze der semantischen und funktionalen Effizienz bei der grammatischen Musterung | 语义和功能效率原则 2410.15865v2 |
Authors (2): Emily Cheng, Francesca Franzon
Grammatical features such as number and gender serve two central functions in human languages. While they encode salient semantic attributes like numerosity and animacy, they also offload sentence processing cost by predictably linking words together via grammatical agreement. Grammars exhibit consistent organizational patterns across diverse languages, invariably rooted in a semantic foundation-a widely confirmed but still theoretically unexplained phenomenon. To explain the basis of universal grammatical patterns, we unify two fundamental properties of grammar, semantic encoding and agreement-based predictability, into a single information-theoretic objective under cognitive constraints, accounting for variable communicative need. Our analyses reveal that grammatical organization provably inherits from perceptual attributes, and our measurements on a diverse language sample show that grammars prioritize functional goals, promoting efficient language processing over semantic encoding.
nan
Article 503
Title@2025-06-20 (5): Cash or Comfort? How LLMs Value Your Inconvenience
Title: Cash or Comfort? How LLMs Value Your Inconvenience | Bargeld oder Komfort? Wie LLMs Wert Ihre Unannehmlichkeit | 现金还是安慰? 2506.17367v1 |
Authors (6): Mateusz Cedro, Timour Ichmoukhamedov, Sofie Goethals, Yifan He, James Hinns, David Martens
Large Language Models (LLMs) are increasingly proposed as near-autonomous artificial intelligence (AI) agents capable of making everyday decisions on behalf of humans. Although LLMs perform well on many technical tasks, their behaviour in personal decision-making remains less understood. Previous studies have assessed their rationality and moral alignment with human decisions. However, the behaviour of AI assistants in scenarios where financial rewards are at odds with user comfort has not yet been thoroughly explored. In this paper, we tackle this problem by quantifying the prices assigned by multiple LLMs to a series of user discomforts: additional walking, waiting, hunger and pain. We uncover several key concerns that strongly question the prospect of using current LLMs as decision-making assistants: (1) a large variance in responses between LLMs, (2) within a single LLM, responses show fragility to minor variations in prompt phrasing (e.g., reformulating the question in the first person can considerably alter the decision), (3) LLMs can accept unreasonably low rewards for major inconveniences (e.g., 1 Euro to wait 10 hours), and (4) LLMs can reject monetary gains where no discomfort is imposed (e.g., 1,000 Euro to wait 0 minutes). These findings emphasize the need for scrutiny of how LLMs value human inconvenience, particularly as we move toward applications where such cash-versus-comfort trade-offs are made on users’ behalf.
nan
Article 504
Title@2025-06-20 (5): Incivility and Rigidity: The Risks of Fine-Tuning LLMs for Political Argumentation
Title: Incivility and Rigidity: The Risks of Fine-Tuning LLMs for Political Argumentation | Beweglichkeit und Starrheit: Die Risiken von Feinsteuerungs-LLMs für politische Argumentation | 不文明和僵硬:政治辩论的微调LMLMs 2411.16813v3 |
Authors (2): Svetlana Churina, Kokil Jaidka
The incivility prevalent on platforms like Twitter (now X) and Reddit poses a challenge for developing AI systems that can support productive and rhetorically sound political argumentation. In this study, we report experiments with GPT-3.5 Turbo, fine-tuned on two contrasting datasets of political discussions: high-variance, high-incivility Twitter replies to U.S. Congress, and low-variance, low-incivility posts from Reddit’s r/ChangeMyView. We systematically evaluate how these data sources and prompting strategies shape the rhetorical framing and deliberative quality of model-generated arguments. Our results show that Reddit-finetuned models produce safer but rhetorically rigid arguments, while cross-platform fine-tuning amplifies toxicity. Prompting reduces specific toxic behaviors, such as personal attacks, but fails to fully mitigate the influence of high-incivility training data. We introduce and validate a rhetorical evaluation rubric and provide practical guidelines for deploying LLMs in content authoring, moderation, and deliberation support.
nan
Article 505
Title@2025-06-20 (5): ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Title: ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization | ErsetzenMe: Netzwerkvereinfachung durch Tiefenkorrektur und Transformer Block Linearisierung | 替换Me:通过深度推进和变换器块条线化简化网络 2505.02819v3 |
Authors (7): Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model’s performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.
nan
Article 506
Title@2025-06-20 (5): Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning
Title: Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning | Instituto de Telecomunicações auf der IWSLT 2025: Ausrichtung von Sprach- und Sprachmodellen für das Sprach-zu-Text-Lernen | IWSLT 2025年国际电信电信研究所:调和小型话语和语音到文字学习语言模式 2506.17019v1 |
Authors (4): Giuseppe Attanasio, Sonal Sannigrahi, Ben Peters, André F. T. Martins
This paper presents the IT-IST submission to the IWSLT 2025 Shared Task on Instruction Following Speech Processing. We submit results for the Short Track, i.e., speech recognition, translation, and spoken question answering. Our model is a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder through a first phase of modality alignment and a second phase of instruction fine-tuning. Crucially, we focus on using small-scale language model backbones (< 2B) and restrict to high-quality, CC-BY data along with synthetic data generation to supplement existing resources.
nan
Article 507
Title@2025-06-20 (5): Can Large Language Models Replace Human Subjects? A Large-Scale Replication of Scenario-Based Experiments in Psychology and Management
Title: Can Large Language Models Replace Human Subjects? A Large-Scale Replication of Scenario-Based Experiments in Psychology and Management | Können große Sprachmodelle menschliche Subjekte ersetzen? Eine groß angelegte Replikation von szenariobasierten Experimenten in Psychologie und Management | 大语言模型能够取代人类课题吗?在心理学和管理中大规模重复基于设想的实验 2409.00128v3 |
Authors (3): Ziyan Cui, Ning Li, Huaikang Zhou
Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) have shown promise in replicating human-like responses in various psychological experiments. We conducted a large-scale study replicating 156 psychological experiments from top social science journals using three state-of-the-art LLMs (GPT-4, Claude 3.5 Sonnet, and DeepSeek v3). Our results reveal that while LLMs demonstrate high replication rates for main effects (73-81%) and moderate to strong success with interaction effects (46-63%), They consistently produce larger effect sizes than human studies, with Fisher Z values approximately 2-3 times higher than human studies. Notably, LLMs show significantly lower replication rates for studies involving socially sensitive topics such as race, gender and ethics. When original studies reported null findings, LLMs produced significant results at remarkably high rates (68-83%) - while this could reflect cleaner data with less noise, as evidenced by narrower confidence intervals, it also suggests potential risks of effect size overestimation. Our results demonstrate both the promise and challenges of LLMs in psychological research, offering efficient tools for pilot testing and rapid hypothesis validation while enriching rather than replacing traditional human subject studies, yet requiring more nuanced interpretation and human validation for complex social phenomena and culturally sensitive research questions.
nan
Article 508
Title@2025-06-20 (5): LLM-Generated Feedback Supports Learning If Learners Choose to Use It
Title: LLM-Generated Feedback Supports Learning If Learners Choose to Use It | LLM-generated Feedback unterstützt Lernen, wenn Lernende wählen, es zu verwenden | 如果学习者选择使用LLM-创用LLM反馈支持学习 2506.17006v1 |
Authors (6): Danielle R. Thomas, Conrad Borchers, Shambhavi Bhushan, Erin Gatz, Shivang Gupta, Kenneth R. Koedinger
Large language models (LLMs) are increasingly used to generate feedback, yet their impact on learning remains underexplored, especially compared to existing feedback methods. This study investigates how on-demand LLM-generated explanatory feedback influences learning in seven scenario-based tutor training lessons. Analyzing over 2,600 lesson completions from 885 tutor learners, we compare posttest performance among learners across three groups: learners who received feedback generated by gpt-3.5-turbo, those who declined it, and those without access. All groups received non-LLM corrective feedback. To address potential selection bias-where higher-performing learners may be more inclined to use LLM feedback-we applied propensity scoring. Learners with a higher predicted likelihood of engaging with LLM feedback scored significantly higher at posttest than those with lower propensity. After adjusting for this effect, two out of seven lessons showed statistically significant learning benefits from LLM feedback with standardized effect sizes of 0.28 and 0.33. These moderate effects suggest that the effectiveness of LLM feedback depends on the learners’ tendency to seek support. Importantly, LLM feedback did not significantly increase completion time, and learners overwhelmingly rated it as helpful. These findings highlight LLM feedback’s potential as a low-cost and scalable way to improve learning on open-ended tasks, particularly in existing systems already providing feedback without LLMs. This work contributes open datasets, LLM prompts, and rubrics to support reproducibility.
nan
Article 509
Title@2025-06-20 (5): PersonalAI: Towards digital twins in the graph form
Title: PersonalAI: Towards digital twins in the graph form | PersonalAI: Auf dem Weg zu digitalen Zwillingen in der Grafikform | 个人AAI:走向图示形式的数字双胞胎 2506.17001v1 |
Authors (9): Mikhail Menschikov, Dmitry Evseev, Ruslan Kostoev, Ilya Perepechkin, Ilnaz Salimov, Victoria Dochkina, Petr Anokhin, Evgeny Burnaev, Nikita Semenov
The challenge of personalizing language models, specifically the ability to account for a user’s history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remains pertinent. To address this, we propose utilizing external memory in the form of knowledge graphs, which are constructed and updated by the LLM itself. We have expanded upon ideas of AriGraph architecture and for the first time introduced a combined graph featuring both standard edges and two types of hyperedges. Experiments conducted on the TriviaQA, HotpotQA and DiaASQ benchmarks indicates that this approach aids in making the process of graph construction and knowledge extraction unified and robust. Furthermore, we augmented the DiaASQ benchmark by incorporating parameters such as time into dialogues and introducing contradictory statements made by the same speaker at different times. Despite these modifications, the performance of the question-answering system remained robust, demonstrating the proposed architecture’s ability to maintain and utilize temporal dependencies.
nan
Article 510
Title@2025-06-20 (5): Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling
Title: Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling | Think&Cite: Verbesserung der zugeschriebenen Textgenerierung durch selbst geführte Baumsuche und Fortschrittsprämienmodellierung | Think&Cite: 改进自导树搜索和进步奖励模型的属性文本生成 2412.14860v2 |
Authors (2): Junyi Li, Hwee Tou Ng
Despite their outstanding capabilities, large language models (LLMs) are prone to hallucination and producing factually incorrect information. This challenge has spurred efforts in attributed text generation, which prompts LLMs to generate content with supporting evidence. In this paper, we propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search. Specifically, we propose Self-Guided Monte Carlo Tree Search (SG-MCTS), which capitalizes on the self-reflection capability of LLMs to reason about the intermediate states of MCTS for guiding the tree expansion process. To provide reliable and comprehensive feedback, we introduce Progress Reward Modeling to measure the progress of tree search from the root to the current state from two aspects, i.e., generation and attribution progress. We conduct extensive experiments on three datasets and the results show that our approach significantly outperforms baseline approaches.
nan
Article 511
Title@2025-06-20 (5): TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs
Title: TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs | TeXpert: Ein Multi-Level-Benchmark zur Bewertung der LaTeX-Code-Generierung durch LLMs | TeXpert:由LLMs评估LaTeX代码生成的多层次基准 2506.16990v1 |
Authors (2): Sahil Kale, Vijaykant Nadadur
LaTeX’s precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at https://github.com/knowledge-verse-ai/TeXpert.
nan
Article 512
Title@2025-06-20 (5): SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments
Title: SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments | SHAKTI: Ein 2,5 Milliarden Parameter kleines Sprachmodell optimiert für Edge-KI und Low-Resource-Umgebungen | SHAKTI:为边缘AI和低资源环境优化的2.5亿亿亿分数小语言模型 2410.11331v2 |
Authors (3): Syed Abdul Gaffar Shakhadri, Kruthika KR, Rakshit Aralimatti
We introduce Shakti, a 2.5 billion parameter language model specifically optimized for resource-constrained environments such as edge devices, including smartphones, wearables, and IoT systems. Shakti combines high-performance NLP with optimized efficiency and precision, making it ideal for real-time AI applications where computational resources and memory are limited. With support for vernacular languages and domain-specific tasks, Shakti excels in industries such as healthcare, finance, and customer service. Benchmark evaluations demonstrate that Shakti performs competitively against larger models while maintaining low latency and on-device efficiency, positioning it as a leading solution for edge AI.
nan
Article 513
Title@2025-06-20 (5): Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation
Title: Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation | Knapsack Optimization-based Schema Linking für die LLM-basierte Text-zu-SQL-Generierung | Knapsack 基于LLM的基于LLM的文本到SQL生成的基于优化的气相连接 2502.12911v2 |
Authors (7): Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Qing Li, Xiao Huang
Generating SQLs from user queries is a long-standing challenge, where the accuracy of initial schema linking significantly impacts subsequent SQL generation performance. However, current schema linking models still struggle with missing relevant schema elements or an excess of redundant ones. A crucial reason for this is that commonly used metrics, recall and precision, fail to capture relevant element missing and thus cannot reflect actual schema linking performance. Motivated by this, we propose enhanced schema linking metrics by introducing a restricted missing indicator. Accordingly, we introduce Knapsack optimization-based Schema Linking Approach (KaSLA), a plug-in schema linking method designed to prevent the missing of relevant schema elements while minimizing the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy that first identifies the optimal table linking and subsequently links columns within the selected table to reduce linking candidate space. In each linking process, it utilizes a knapsack optimization approach to link potentially relevant elements while accounting for a limited tolerance of potentially redundant ones. With this optimization, KaSLA-1.6B achieves superior schema linking results compared to large-scale LLMs, including deepseek-v3 with the state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider and BIRD benchmarks verify that KaSLA can significantly improve the SQL generation performance of SOTA Text2SQL models by substituting their schema linking processes.
nan
Article 514
Title@2025-06-20 (5): Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond
Title: Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond | Sprachengpässe-Modelle: Ein Rahmen für interpretierbares Wissen auf Tracing und darüber hinaus | 语言瓶颈模式:可解释知识追踪框架及以后 2506.16982v1 |
Authors (2): Antonin Berthon, Mihaela van der Schaar
Accurately assessing student knowledge is critical for effective education, yet traditional Knowledge Tracing (KT) methods rely on opaque latent embeddings, limiting interpretability. Even LLM-based approaches generate direct predictions or summaries that may hallucinate without any accuracy guarantees. We recast KT as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable. Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text. By constraining all predictive information to pass through a short natural-language bottleneck, LBMs ensure that the summary contains accurate information while remaining human-interpretable. Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories. We demonstrate that training the encoder with group-relative policy optimization, using downstream decoding accuracy as a reward signal, effectively improves summary quality.
nan
Article 515
Title@2025-06-20 (5): Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Title: Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework | Polysemantik mit PRISM erfassen: Ein Multi-Konzept-Feature Beschreibung Framework | 利用PRISM获得多边性能:多概念特征描述框架 2506.15538v2 |
Authors (7): Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne, Oliver Eberle
Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Current feature description methods face two critical challenges: limited robustness and the flawed assumption that each neuron encodes only a single concept (monosemanticity), despite growing evidence that neurons are often polysemantic. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework that captures the inherent complexity of neural network features. Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features. We apply PRISM to language models and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).
nan
Article 516
Title@2025-06-20 (5): Latent Concept Disentanglement in Transformer-based Language Models
Title: Latent Concept Disentanglement in Transformer-based Language Models | Latent Concept Disentanglement in Transformer-basierten Sprachmodellen | 以变换器为基础的语言模型中的边端概念分解 2506.16975v1 |
Authors (6): Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan, Rina Panigrahy
When large language models (LLMs) use in-context learning (ICL) to solve a new task, they seem to grasp not only the goal of the task but also core, latent concepts in the demonstration examples. This begs the question of whether transformers represent latent structures as part of their computation or whether they take shortcuts to solve the problem. Prior mechanistic work on ICL does not address this question because it does not sufficiently examine the relationship between the learned representation and the latent concept, and the considered problem settings often involve only single-step reasoning. In this work, we examine how transformers disentangle and use latent concepts. We show that in 2-hop reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. In tasks parameterized by a continuous latent concept, we find low-dimensional subspaces in the representation space where the geometry mimics the underlying parameterization. Together, these results refine our understanding of ICL and the representation of transformers, and they provide evidence for highly localized structures in the model that disentangle latent concepts in ICL tasks.
nan
Article 517
Title@2025-06-20 (5): PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval
Title: PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval | PromptDSI: Prompt-basiert Probefrei Instance-wise Incremental Learning for Document Retrieval | 快速DSI:为文件检索进行基于即时的无排练-不重复式递增学习 2406.12593v3 |
Authors (8): Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Yinwei Wei, Trung Le, Dragan Gasevic, Yuan-Fang Li, Thanh-Toan Do
Differentiable Search Index (DSI) utilizes pre-trained language models to perform indexing and document retrieval via end-to-end learning without relying on external indexes. However, DSI requires full re-training to index new documents, causing significant computational inefficiencies. Continual learning (CL) offers a solution by enabling the model to incrementally update without full re-training. Existing CL solutions in document retrieval rely on memory buffers or generative models for rehearsal, which is infeasible when accessing previous training data is restricted due to privacy concerns. To this end, we introduce PromptDSI, a prompt-based, rehearsal-free continual learning approach for document retrieval. PromptDSI follows the Prompt-based Continual Learning (PCL) framework, using learnable prompts to efficiently index new documents without accessing previous documents or queries. To improve retrieval latency, we remove the initial forward pass of PCL, which otherwise greatly increases training and inference time, with a negligible trade-off in performance. Additionally, we introduce a novel topic-aware prompt pool that employs neural topic embeddings as fixed keys, eliminating the instability of prompt key optimization while maintaining competitive performance with existing PCL prompt pools. In a challenging rehearsal-free continual learning setup, we demonstrate that PromptDSI variants outperform rehearsal-based baselines, match the strong cache-based baseline in mitigating forgetting, and significantly improving retrieval performance on new corpora.
nan
Article 518
Title@2025-06-20 (5): Coreference as an indicator of context scope in multimodal narrative
Title: Coreference as an indicator of context scope in multimodal narrative | Koreferenz als Indikator für den Kontextumfang in multimodaler Erzählung | 共同参照作为多式联运说明中背景范围的一项指标 2503.05298v2 |
Authors (4): Nikolai Ilinykh, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality. Materials, metrics, and code for our study are available at https://github.com/GU-CLASP/coreference-context-scope.
nan
Article 519
Title@2025-06-20 (5): LogProber: Disentangling confidence from contamination in LLM responses
Title: LogProber: Disentangling confidence from contamination in LLM responses | LogProber: Entwirren des Vertrauens in LLM-Antworten | 日志Prober:解除对LLM反应中污染的信心 2408.14352v3 |
Authors (3): Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri
In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on gargantuan, and generally opaque, corpora of text scraped from the world wide web. Developing tools to detect contamination is therefore crucial to be able to fairly and properly track the evolution of the performance of LLMs. To date, only a few recent studies have attempted to address the issue of quantifying and detecting contamination in short text sequences, such as those commonly found in benchmarks. However, these methods have limitations that can sometimes render them impractical. In the present paper, we introduce LogProber, a novel, efficient algorithm that we show to be able to detect contamination in a black box setting that tries to tackle some of these drawbacks by focusing on the familiarity with the question rather than the answer. Here, we explore the properties of the proposed method in comparison with concurrent approaches, identify its advantages and limitations, and illustrate how different forms of contamination can go undetected depending on the design of the detection algorithm.
nan
Article 520
Title@2025-06-20 (5): Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
Title: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs | Schritt-für-Schritt-Verbesserung und überprüfbare medizinische Vernunft in MLLMs | 加强微低LLMs的逐步和可核实医疗理由 2506.16962v1 |
Authors (9): Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, Xiaosong Wang
Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
nan
Article 521
Title@2025-06-20 (5): From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts
Title: From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts | Von Daten zu Wissen: Bewertung, wie effizient Sprachmodelle Fakten lernen | 从数据到知识:评价如何高效语言模式学习事实 2506.16912v1 |
Authors (4): Daniel Christoph, Max Ploner, Patrick Haller, Alan Akbik
Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample-efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposure. This study analyzes multiple models of varying architectures and sizes, all trained on the same pre-training data. By annotating relational facts with their frequencies in the training corpus, we examine how model performance varies with fact frequency. Our findings show that most models perform similarly on high-frequency facts but differ notably on low-frequency facts. This analysis provides new insights into the relationship between model architecture, size, and factual learning efficiency.
nan
Article 522
Title@2025-06-20 (5): On Almost Surely Safe Alignment of Large Language Models at Inference-Time
Title: On Almost Surely Safe Alignment of Large Language Models at Inference-Time | Zur fast sicher sicheren Ausrichtung großer Sprachmodelle bei Inferenz-Time | 在推断时几乎可以安全地统一大语言模型 2502.01208v3 |
Authors (6): Xiaotong Ji, Shyam Sundhar Ramesh, Matthieu Zimmer, Ilija Bogunovic, Jun Wang, Haitham Bou Ammar
We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely, i.e., with probability approaching one. Our approach models the generation of safe responses as a constrained Markov Decision Process (MDP) within the LLM’s latent space. We augment a safety state that tracks the evolution of safety constraints and dynamically penalize unsafe generations to ensure the generation of safe responses. Consequently, we demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate that InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses. Our findings contribute to the advancement of safer LLM deployment through alignment at inference-time, thus presenting a promising alternative to resource-intensive, overfitting-prone alignment techniques like RLHF.
nan
Article 523
Title@2025-06-20 (5): Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models
Title: Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models | Dynamische Wissensintegration für evidenzgetriebene Counter-Argument-Generation mit großen Sprachmodellen | 具有大语言模型的有说服力的知识集成,用大语言模型进行证据驱动的反造反投标书生成 2503.05328v2 |
Authors (3): Anar Yeginbergen, Maite Oronoz, Rodrigo Agerri
This paper investigates the role of dynamic external knowledge integration in improving counter-argument generation using Large Language Models (LLMs). While LLMs have shown promise in argumentative tasks, their tendency to generate lengthy, potentially unfactual responses highlights the need for more controlled and evidence-based approaches. We introduce a new manually curated dataset of argument and counter-argument pairs specifically designed to balance argumentative complexity with evaluative feasibility. We also propose a new LLM-as-a-Judge evaluation methodology that shows a stronger correlation with human judgments compared to traditional reference-based metrics. Our experimental results demonstrate that integrating dynamic external knowledge from the web significantly improves the quality of generated counter-arguments, particularly in terms of relatedness, persuasiveness, and factuality. The findings suggest that combining LLMs with real-time external knowledge retrieval offers a promising direction for developing more effective and reliable counter-argumentation systems.
nan
Article 524
Title@2025-06-20 (5): Deep Learning based Visually Rich Document Content Understanding: A Survey
Title: Deep Learning based Visually Rich Document Content Understanding: A Survey | Deep Learning based Visually Rich Document Content Understanding: Eine Umfrage | 基于深层学习的视觉丰富文件内容理解:调查 2408.01287v2 |
Authors (4): Yihao Ding, Soyeon Caren Han, Jean Lee, Eduard Hovy
Visually Rich Documents (VRDs) play a vital role in domains such as academia, finance, healthcare, and marketing, as they convey information through a combination of text, layout, and visual elements. Traditional approaches to extracting information from VRDs rely heavily on expert knowledge and manual annotation, making them labor-intensive and inefficient. Recent advances in deep learning have transformed this landscape by enabling multimodal models that integrate vision, language, and layout features through pretraining, significantly improving information extraction performance. This survey presents a comprehensive overview of deep learning-based frameworks for VRD Content Understanding (VRD-CU). We categorize existing methods based on their modeling strategies and downstream tasks, and provide a comparative analysis of key components, including feature representation, fusion techniques, model architectures, and pretraining objectives. Additionally, we highlight the strengths and limitations of each approach and discuss their suitability for different applications. The paper concludes with a discussion of current challenges and emerging trends, offering guidance for future research and practical deployment in real-world scenarios.
nan
Article 525
Title@2025-06-20 (5): Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
Title: Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation | Anpassung beim Lernen: LLMs für wissenschaftliche Probleme mit intelligenter Werkzeugverwendung anpassen | 在学习期间适应适应:利用智能工具适应科学问题定位LMS 2411.00412v4 |
Authors (6): Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu
Large Language Models (LLMs) demonstrate promising capabilities in solving scientific problems but often suffer from the issue of hallucination. While integrating LLMs with tools can mitigate this issue, models fine-tuned on tool usage become overreliant on them and incur unnecessary costs. Inspired by how human experts assess problem complexity before selecting solutions, we propose a novel two-component fine-tuning method, Adapting While Learning (AWL). In the first component, World Knowledge Learning (WKL), LLMs internalize scientific knowledge by learning from tool-generated solutions. In the second component, Tool Usage Adaptation (TUA), we categorize problems as easy or hard based on the model’s accuracy, and train it to maintain direct reasoning for easy problems while switching to tools for hard ones. We validate our method on six scientific benchmark datasets across climate science, epidemiology, physics, and other domains. Compared to the original instruct model (8B), models post-trained with AWL achieve 29.11% higher answer accuracy and 12.72% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4o and Claude-3.5 on four custom-created datasets. Our code is open-source at https://github.com/Rose-STL-Lab/Adapting-While-Learning.
nan
Article 526
Title@2025-06-20 (5): More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Title: More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models | Mehr denken, weniger sehen? Bewertung verstärkter Halluzinationen in multimodalen Vernunftmodellen | 更多思考,少见? 评估多模式理由模型中放大的幻觉 2505.21523v3 |
Authors (8): Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, Sheng Liu
Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model’s perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
nan
Article 527
Title@2025-06-20 (5): Cost-effective Instruction Learning for Pathology Vision and Language Analysis
Title: Cost-effective Instruction Learning for Pathology Vision and Language Analysis | Kostengünstiges Instruktionslernen für Pathologie Vision und Sprachanalyse | 具有成本效益的病理学愿景和语言分析教学 2407.17734v2 |
Authors (11): Kaitao Chen, Mianxin Liu, Fang Yan, Lei Ma, Xiaoming Shi, Lilong Wang, Xiaosong Wang, Lifeng Zhu, Zhe Wang, Mu Zhou, Shaoting Zhang
The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology.
nan
Article 528
Title@2025-06-20 (5): Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs’ Multi-turn Instruction-Following Ability
Title: Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs’ Multi-turn Instruction-Following Ability | Fragen, Scheitern, Wiederholen: Meeseeks, ein iterativer Feedback-Benchmark für die Multiturn-Instruction-Following-Fähigkeit von LLMs | 问题,失败,重复:Meeseeks,LLLM女士多功能指示-执行能力的一个循环反馈基准 2504.21625v4 |
Authors (7): Jiaming Wang, Yunke Zhao, Peng Ding, Jun Kuang, Zongyu Wang, Xuezhi Cao, Xunliang Cai
The ability to follow instructions accurately is fundamental for Large Language Models (LLMs) to serve as reliable agents in real-world applications. For complex instructions, LLMs often struggle to fulfill all requirements in a single attempt. In practice, users typically provide iterative feedback until the LLM generates a response that meets all requirements. However, existing instruction-following benchmarks are either single-turn or introduce new requirements in each turn without allowing self-correction. To address this gap, we propose Meeseeks. Meeseeks simulates realistic human-LLM interactions through an iterative feedback framework, which enables models to self-correct based on specific requirement failures in each turn, better reflecting real-world user-end usage patterns. Meanwhile, the benchmark implements a comprehensive evaluation system with 38 capability tags organized across three dimensions: Intent Recognition, Granular Content Validation, and Output Structure Validation. Through rigorous evaluation across LLMs, Meeseeks provides valuable insights into LLMs’ instruction-following capabilities in multi-turn scenarios.
nan
Article 529
Title@2025-06-20 (5): MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
Title: MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning | MIST: Jailbreaking Black-Box Large Language Models über iterative Semantic Tuning | MIST: 通过迭代性语义图纸的黑盒黑盒大语言模型 2506.16792v1 |
Authors (5): Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Meng Han
Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks–methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version–order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.
nan
Article 530
Title@2025-06-20 (5): Reimagining Urban Science: Scaling Causal Inference with Large Language Models
Title: Reimagining Urban Science: Scaling Causal Inference with Large Language Models | Reimagining Urban Science: Skalierung von Kausalität mit großen Sprachmodellen | 重新想象城市科学:与大语言模型的大规模因果推断 2504.12345v3 |
Authors (11): Yutong Xia, Ao Qu, Yunhan Zheng, Yihong Tang, Dingyi Zhuang, Yuxuan Liang, Shenhao Wang, Cathy Wu, Lijun Sun, Roger Zimmermann, Jinhua Zhao
Urban causal research is essential for understanding the complex, dynamic processes that shape cities and for informing evidence-based policies. However, current practices are often constrained by inefficient and biased hypothesis formulation, challenges in integrating multimodal data, and fragile experimental methodologies. Imagine a system that automatically estimates the causal impact of congestion pricing on commute times by income group or measures how new green spaces affect asthma rates across neighborhoods using satellite imagery and health reports, and then generates comprehensive, policy-ready outputs, including causal estimates, subgroup analyses, and actionable recommendations. In this Perspective, we propose UrbanCIA, an LLM-driven conceptual framework composed of four distinct modular agents responsible for hypothesis generation, data engineering, experiment design and execution, and results interpretation with policy insights. We begin by examining the current landscape of urban causal research through a structured taxonomy of research topics, data sources, and methodological approaches, revealing systemic limitations across the workflow. Next, we introduce the design principles and technological roadmap for the four modules in the proposed framework. We also propose evaluation criteria to assess the rigor and transparency of these AI-augmented processes. Finally, we reflect on the broader implications for human-AI collaboration, equity, and accountability. We call for a new research agenda that embraces LLM-driven tools as catalysts for more scalable, reproducible, and inclusive urban research.
nan
Article 531
Title@2025-06-20 (5): Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry
Title: Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry | Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry | Alto: 带有内嵌原体的管弦式分布式 AI系统 2403.04311v2 |
Authors (10): Deepti Raghavan, Keshav Santhanam, Muhammad Shahir Rahman, Nayani Modugula, Luis Gaspar Schroeder, Maximilien Cura, Houjun Liu, Pratiksha Thaker, Philip Levis, Matei Zaharia
Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.
nan
Article 532
Title@2025-06-20 (5): DistillNote: LLM-based clinical note summaries improve heart failure diagnosis
Title: DistillNote: LLM-based clinical note summaries improve heart failure diagnosis | DistillNote: Zusammenfassungen auf LLM-Basis verbessern die Diagnose der Herzinsuffizienz | 蒸馏注:基于LLM的临床说明摘要改善心脏病诊断 2506.16777v1 |
Authors (5): Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto
Large language models (LLMs) offer unprecedented opportunities to generate concise summaries of patient information and alleviate the burden of clinical documentation that overwhelms healthcare providers. We present Distillnote, a framework for LLM-based clinical note summarization, and generate over 64,000 admission note summaries through three techniques: (1) One-step, direct summarization, and a divide-and-conquer approach involving (2) Structured summarization focused on independent clinical insights, and (3) Distilled summarization that further condenses the Structured summaries. We test how useful are the summaries by using them to predict heart failure compared to a model trained on the original notes. Distilled summaries achieve 79% text compression and up to 18.2% improvement in AUPRC compared to an LLM trained on the full notes. We also evaluate the quality of the generated summaries in an LLM-as-judge evaluation as well as through blinded pairwise comparisons with clinicians. Evaluations indicate that one-step summaries are favoured by clinicians according to relevance and clinical actionability, while distilled summaries offer optimal efficiency (avg. 6.9x compression-to-performance ratio) and significantly reduce hallucinations. We release our summaries on PhysioNet to encourage future research.
nan
Article 533
Title@2025-06-20 (5): SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation
Title: SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation | SSR-Zero: Einfaches Selbstveredelungslernen für maschinelle Übersetzung | 机械翻译简单自评强化学习 2505.16637v3 |
Authors (5): Wenjie Yang, Mao Zheng, Mingyang Song, Zheng Li, Sitong Wang
Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.
nan
Article 534
Title@2025-06-20 (5): Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Title: Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models | Cross-Modal Obfuskation für Jailbreak Attacken auf große Vision-Sprache Modelle | 对大型视觉-语言模型进行越狱袭击的跨模式阻断 2506.16760v1 |
Authors (7): Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu
Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs’ cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO’s effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.
nan
Article 535
Title@2025-06-20 (5): SocialSim: Towards Socialized Simulation of Emotional Support Conversation
Title: SocialSim: Towards Socialized Simulation of Emotional Support Conversation | SocialSim: Auf dem Weg zu einer sozialisierten Simulation emotionaler Unterstützungsgespräche | 社会观:社会化模拟情感支持对话 2506.16756v1 |
Authors (9): Zhuang Chen, Yaru Cao, Guanqun Bi, Jincenzi Wu, Jinfeng Zhou, Xiyao Xiao, Si Chen, Hongning Wang, Minlie Huang
Emotional support conversation (ESC) helps reduce people’s psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.
nan
Article 536
Title@2025-06-20 (5): Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly
Title: Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly | Sprachinformierte Synthese von Rational Agent-Modellen für geerdete Theorie-von-Mind-Gründung On-The-Fly | 理论理论理论理论理论理论理论理论理论理论理论基础理论 理论理论理论基础理论理论模型的语言综合 2506.16755v1 |
Authors (9): Lance Ying, Ryan Truong, Katherine M. Collins, Cedegao E. Zhang, Megan Wei, Tyler Brooke-Wilson, Tan Zhi-Xuan, Lionel Wong, Joshua B. Tenenbaum
Drawing real world social inferences usually requires taking into account information from multiple modalities. Language is a particularly powerful source of information in social settings, especially in novel situations where language can provide both abstract information about the environment dynamics and concrete specifics about an agent that cannot be easily visually observed. In this paper, we propose Language-Informed Rational Agent Synthesis (LIRAS), a framework for drawing context-specific social inferences that integrate linguistic and visual inputs. LIRAS frames multimodal social reasoning as a process of constructing structured but situation-specific agent and environment representations - leveraging multimodal language models to parse language and visual inputs into unified symbolic representations, over which a Bayesian inverse planning engine can be run to produce granular probabilistic judgments. On a range of existing and new social reasoning tasks derived from cognitive science experiments, we find that our model (instantiated with a comparatively lightweight VLM) outperforms ablations and state-of-the-art models in capturing human judgments across all domains.
nan
Article 537
Title@2025-06-20 (5): A Structured Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy
Title: A Structured Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy | Ein strukturierter Datensatz von Krankheit-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit | 改善诊断准确性疾病 – – 症状协会结构数据集 2506.13610v2 |
Authors (4): Abdullah Al Shafi, Rowzatul Zannat, Abdul Muntakim, Mahmudul Hasan
Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications. These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection. The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases. The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports. Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded. The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. Each symptom cell contains a binary value (1 or 0), indicating whether a symptom is associated with a disease (1 for presence, 0 for absence). Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies. Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance
nan
Article 538
Title@2025-06-20 (5): Group-Level Data Selection for Efficient Pretraining
Title: Group-Level Data Selection for Efficient Pretraining | Gruppen-Level-Datenauswahl für effizientes Vortraining | 高效预科培训的集团一级数据选择 2502.14709v2 |
Authors (6): Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong
In this paper, we introduce Group-MATES, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a relational data influence model. To train this model, we sample training trajectories of the language model and collect oracle data influences alongside. The relational data influence model approximates the oracle data influence by weighting individual influence with relationships among training data. To enable efficient selection with our relational data influence model, we partition the dataset into small clusters using relationship weights and select data within each cluster independently. Experiments on DCLM 400M-4x, 1B-1x, and 3B-1x show that Group-MATES achieves 3.5%-9.4% relative performance gains over random selection across 22 downstream tasks, nearly doubling the improvements achieved by state-of-the-art individual data selection baselines. Furthermore, Group-MATES reduces the number of tokens required to reach a certain downstream performance by up to 1.75x, substantially elevating the speed-quality frontier. Further analyses highlight the critical role of relationship weights in the relational data influence model and the effectiveness of our cluster-based inference. Our code is open-sourced at https://github.com/facebookresearch/Group-MATES.
nan
Article 539
Title@2025-06-20 (5): LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization
Title: LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization | LM-SPT: LM-basierte semantische Destillation für Sprachtokenisierung | LM-SPT: LM 统一语法的语义蒸馏 2506.16738v1 |
Authors (4): Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim
With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.
nan
Article 540
Title@2025-06-20 (5): The Role of Model Confidence on Bias Effects in Measured Uncertainties
Title: The Role of Model Confidence on Bias Effects in Measured Uncertainties | Die Rolle des Modellvertrauens auf Bias-Effekte bei gemessenen Unsicherheiten | 信任模式在衡量不确定性方面对 beas 影响的影响的作用 2506.16724v1 |
Authors (3): Xinyi Liu, Weiguang Wang, Hangfeng He
With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model’s lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases induce greater changes in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence leads to greater underestimation of epistemic uncertainty (i.e. overconfidence) due to bias, whereas it has no significant effect on the direction of changes in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.
nan
Article 541
Title@2025-06-20 (5): ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models
Title: ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models | ReasonGRM: Generative Reward-Modelle durch große Reasoning-Modelle verbessern | 理由GRM:通过大理由模型加强奖励奖励模式 2506.16712v1 |
Authors (6): Bin Chen, Xinzge Gao, Chuanrui Hu, Penghang Yu, Hua Zhang, Bing-Kun Bao
Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, $R^\star$, which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through reinforcement learning on challenging examples to enhance its preference discrimination capabilities. Experiments on three public benchmarks show that ReasonGRM achieves competitive or state-of-the-art performance, outperforming previous best GRMs by 1.8\% on average and surpassing proprietary models such as GPT-4o by up to 5.6\%. These results demonstrate the effectiveness of reasoning-aware training and highlight the importance of high-quality rationale selection for reliable preference modeling.
nan
Article 542
Title@2025-06-20 (5): Techniques for supercharging academic writing with generative AI
Title: Techniques for supercharging academic writing with generative AI | Techniken zur Aufladung akademischer Schriften mit generativer KI | 具有传宗传宗传宗接代的超级奖学金学术写作技术 2310.17143v4 |
Authors (1): Zhicheng Lin
Academic writing is an indispensable yet laborious part of the research enterprise. This Perspective maps out principles and methods for using generative artificial intelligence (AI), specifically large language models (LLMs), to elevate the quality and efficiency of academic writing. We introduce a human-AI collaborative framework that delineates the rationale (why), process (how), and nature (what) of AI engagement in writing. The framework pinpoints both short-term and long-term reasons for engagement and their underlying mechanisms (e.g., cognitive offloading and imaginative stimulation). It reveals the role of AI throughout the writing process, conceptualized through a two-stage model for human-AI collaborative writing, and the nature of AI assistance in writing, represented through a model of writing-assistance types and levels. Building on this framework, we describe effective prompting techniques for incorporating AI into the writing routine (outlining, drafting, and editing) as well as strategies for maintaining rigorous scholarship, adhering to varied journal policies, and avoiding overreliance on AI. Ultimately, the prudent integration of AI into academic writing can ease the communication burden, empower authors, accelerate discovery, and promote diversity in science.
nan
Article 543
Title@2025-06-20 (5): MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
Title: MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension | MaPPER: Multimodaler vorgeführter Parameter Effizientes Tuning für die Referenzierung von Expression-Verständnis | MaPPER: 参考表达式理解的多式前制导参数效率计图 2409.13609v4 |
Authors (6): Ting Liu, Zunnan Xu, Yue Hu, Liangtao Shi, Zhiqiang Wang, Quanjun Yin
Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by an aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters. Our code is available at https://github.com/liuting20/MaPPER.
nan
Article 544
Title@2025-06-20 (5): Large Language Models as Psychological Simulators: A Methodological Guide
Title: Large Language Models as Psychological Simulators: A Methodological Guide | Große Sprachmodelle als Psychologische Simulatoren: Ein methodischer Leitfaden | 《作为心理模拟器的大型语言模式:方法指南》 2506.16702v1 |
Authors (1): Zhicheng Lin
Large language models (LLMs) offer emerging opportunities for psychological and behavioral research, but methodological guidance is lacking. This article provides a framework for using LLMs as psychological simulators across two primary applications: simulating roles and personas to explore diverse contexts, and serving as computational models to investigate cognitive processes. For simulation, we present methods for developing psychologically grounded personas that move beyond demographic categories, with strategies for validation against human data and use cases ranging from studying inaccessible populations to prototyping research instruments. For cognitive modeling, we synthesize emerging approaches for probing internal representations, methodological advances in causal interventions, and strategies for relating model behavior to human cognition. We address overarching challenges including prompt sensitivity, temporal limitations from training data cutoffs, and ethical considerations that extend beyond traditional human subjects review. Throughout, we emphasize the need for transparency about model capabilities and constraints. Together, this framework integrates emerging empirical evidence about LLM performance–including systematic biases, cultural limitations, and prompt brittleness–to help researchers wrangle these challenges and leverage the unique capabilities of LLMs in psychological research.
nan
Article 545
Title@2025-06-20 (5): GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation
Title: GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation | GraphRAG-Bench: Herausfordernde Domain-spezifische Begründung für die Auswertung der Graph Retrieval-Augmented Generation | 图图RAG-Bench:评估图回收-提款一代的有挑战性域特定原因 2506.02404v3 |
Authors (8): Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qian-wen Zhang, Di Yin, Xing Sun, Xiao Huang
Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: ((i)) Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. ((ii)) Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. ((iii)) Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.
nan
Article 546
Title@2025-06-20 (5): From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology
Title: From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology | Von Prompts zu Constructs: Ein Dual-Validity-Rahmenwerk für LLM-Forschung in der Psychologie | 从提示到构造:心理学法学硕士研究的双重价值框架 2506.16697v1 |
Authors (1): Zhicheng Lin
Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms–statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field’s foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output–endorsing “I am anxious”–requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.
nan
Article 547
Title@2025-06-20 (5): LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions
Title: LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions | LLMs in der Krankheitsdiagnose: Eine vergleichende Studie von DeepSeek-R1 und O3 Mini Across Chronic Health Conditions | 疾病诊断中的LLMs:在整个慢性健康状况中深海Seek-R1和O3 Mini的比较研究 2503.10486v2 |
Authors (5): Gaurav Kumar Gupta, Pranal Pande, Nirajan Acharya, Aniket Kumar Singh, Suman Niroula
Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare.
nan
Article 548
Title@2025-06-20 (5): Theoretical Guarantees for Minimum Bayes Risk Decoding
Title: Theoretical Guarantees for Minimum Bayes Risk Decoding | Theoretische Garantien für die Risikodekodierung von Mindestbuchten | 最低比亚最低风险编码理论保障 2502.12685v3 |
Authors (5): Yuki Ichihara, Yuu Jinnai, Kaito Ariu, Tetsuro Morimura, Eiji Uchibe
Minimum Bayes Risk (MBR) decoding optimizes output selection by maximizing the expected utility value of an underlying human distribution. While prior work has shown the effectiveness of MBR decoding through empirical evaluation, few studies have analytically investigated why the method is effective. As a result of our analysis, we show that, given the size $n$ of the reference hypothesis set used in computation, MBR decoding approaches the optimal solution with high probability at a rate of $O\left(n^{-\frac{1}{2}}\right)$, under certain assumptions, even though the language space $Y$ is significantly larger $ | Y | \gg n$. This result helps to theoretically explain the strong performance observed in several prior empirical studies on MBR decoding. In addition, we provide the performance gap for maximum-a-posteriori (MAP) decoding and compare it to MBR decoding. The result of this paper indicates that MBR decoding tends to converge to the optimal solution faster than MAP decoding in several cases. |
nan
Article 549
Title@2025-06-20 (5): LegiGPT: Party Politics and Transport Policy with Large Language Model
Title: LegiGPT: Party Politics and Transport Policy with Large Language Model | LegiGPT: Parteipolitik und Verkehrspolitik mit großem Sprachmodell | 友好社:具有大语言模式的党政治和交通政策 2506.16692v1 |
Authors (2): Hyunsoo Yun, Eun Hak Lee
Given the significant influence of lawmakers’ political ideologies on legislative decision-making, understanding their impact on policymaking is critically important. We introduce a novel framework, LegiGPT, which integrates a large language model (LLM) with explainable artificial intelligence (XAI) to analyze transportation-related legislative proposals. LegiGPT employs a multi-stage filtering and classification pipeline using zero-shot prompting with GPT-4. Using legislative data from South Korea’s 21st National Assembly, we identify key factors - including sponsor characteristics, political affiliations, and geographic variables - that significantly influence transportation policymaking. The LLM was used to classify transportation-related bill proposals through a stepwise filtering process based on keywords, phrases, and contextual relevance. XAI techniques were then applied to examine relationships between party affiliation and associated attributes. The results reveal that the number and proportion of conservative and progressive sponsors, along with district size and electoral population, are critical determinants shaping legislative outcomes. These findings suggest that both parties contributed to bipartisan legislation through different forms of engagement, such as initiating or supporting proposals. This integrated approach provides a valuable tool for understanding legislative dynamics and guiding future policy development, with broader implications for infrastructure planning and governance.
nan
Article 550
Title@2025-06-20 (5): Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
Title: Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence | Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz | 嵌入式网络代理:为综合特工情报连接物理数字王国 2506.15677v2 |
Authors (10): Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.
nan
Article 551
Title@2025-06-20 (5): Towards Safety Evaluations of Theory of Mind in Large Language Models
Title: Towards Safety Evaluations of Theory of Mind in Large Language Models | Zu Sicherheitsbewertungen der Geistestheorie in großen Sprachmodellen | 争取对大语言模式中思想理论进行安全评价 2506.17352v1 |
Authors (2): Tatsuhiro Aoshima, Mitsuaki Akiyama
As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their behavior.To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin by reviewing existing research on theory of mind and identifying the perspectives and tasks relevant to its application in safety evaluation. Given that theory of mind has been predominantly studied within the context of developmental psychology, we analyze developmental trends across a series of open-weight LLMs. Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development. Finally, we present the current state of safety evaluation with respect to LLMs’ theory of mind, and discuss remaining challenges for future work.
nan
Article 552
Title@2025-06-20 (5): Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations
Title: Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations | Mechanismen vs. Ergebnisse: Testen für Syntax kann die Leistung bei gezielten syntaktischen Bewertungen nicht erklären | 机制与结果:检验语法无法解释定向同步评估的绩效 2506.16678v1 |
Authors (4): Ananth Agarwal, Jasper Jian, Christopher D. Manning, Shikhar Murty
Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model’s probing accuracy reliably predicts its downstream syntactic performance. Adopting a “mechanisms vs. outcomes” framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.
nan
Article 553
Title@2025-06-20 (5): Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning
Title: Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning | Med-U1: Förderung der einheitlichen medizinischen Vernunft in LLMs durch großangelegtes Verstärkungslernen | Med-U1:通过大规模加强学习在LLMs中鼓励统一医疗理由 2506.12307v2 |
Authors (9): Xiaotian Zhang, Yuan Wang, Zhaopeng Feng, Ruizhe Chen, Zhijie Zhou, Yan Zhang, Hongxia Xu, Jian Wu, Zuozhu Liu
Medical Question-Answering (QA) encompasses a broad spectrum of tasks, including multiple choice questions (MCQ), open-ended text generation, and complex computational reasoning. Despite this variety, a unified framework for delivering high-quality medical QA has yet to emerge. Although recent progress in reasoning-augmented large language models (LLMs) has shown promise, their ability to achieve comprehensive medical understanding is still largely unexplored. In this paper, we present Med-U1, a unified framework for robust reasoning across medical QA tasks with diverse output formats, ranging from MCQs to complex generation and computation tasks. Med-U1 employs pure large-scale reinforcement learning with mixed rule-based binary reward functions, incorporating a length penalty to manage output verbosity. With multi-objective reward optimization, Med-U1 directs LLMs to produce concise and verifiable reasoning chains. Empirical results reveal that Med-U1 significantly improves performance across multiple challenging Med-QA benchmarks, surpassing even larger specialized and proprietary models. Furthermore, Med-U1 demonstrates robust generalization to out-of-distribution (OOD) tasks. Extensive analysis presents insights into training strategies, reasoning chain length control, and reward design for medical LLMs. Our code is available here.
nan
Article 554
Title@2025-06-20 (5): Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM
Title: Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM | Zero-Shot Kognitive Impairment Erkennung von Sprache mit AudioLLM | 从使用音频LLM的演讲中检测出零热感知损伤 2506.17351v1 |
Authors (3): Mostafa Shahin, Beena Ahmed, Julien Epps
Cognitive impairment (CI) is of growing public health concern, and early detection is vital for effective intervention. Speech has gained attention as a non-invasive and easily collectible biomarker for assessing cognitive decline. Traditional CI detection methods typically rely on supervised models trained on acoustic and linguistic features extracted from speech, which often require manual annotation and may not generalise well across datasets and languages. In this work, we propose the first zero-shot speech-based CI detection method using the Qwen2- Audio AudioLLM, a model capable of processing both audio and text inputs. By designing prompt-based instructions, we guide the model in classifying speech samples as indicative of normal cognition or cognitive impairment. We evaluate our approach on two datasets: one in English and another multilingual, spanning different cognitive assessment tasks. Our results show that the zero-shot AudioLLM approach achieves performance comparable to supervised methods and exhibits promising generalizability and consistency across languages, tasks, and datasets.
nan
Article 555
Title@2025-06-20 (5): Kinetics: Rethinking Test-Time Scaling Laws
Title: Kinetics: Rethinking Test-Time Scaling Laws | Kinetik: Überdenken von Test-Zeit-Skalierungsgesetzen | 动因:重新思考试验时间扩增法 2506.05333v3 |
Authors (6): Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.
nan
Article 556
Title@2025-06-20 (5): Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models
Title: Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models | Adaptive Anleitung beschleunigt die Stärkung des Lernens von Vernunftmodellen | 适应性指导加速加速强化理性模型学习 2506.13923v2 |
Authors (6): Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean Hendryx
We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via “capability gain” in which models learn to solve new problems that they previously could not solve even at high $k$. We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B parameters on >500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@$k$ rates by leveraging natural language guidance for the model to consider within context while still requiring the model to derive a solution chain from scratch. Based of these insights, we derive $\text{Guide}$ – a new class of online training algorithms. $\text{Guide}$ adaptively incorporates hints into the model’s context on problems for which all rollouts were initially incorrect and adjusts the importance sampling ratio for the “off-policy” trajectories in order to optimize the policy for contexts in which the hints are no longer present. We describe variants of $\text{Guide}$ for GRPO and PPO and empirically show that Guide-GRPO on 7B and 32B parameter models improves generalization over its vanilla counterpart with up to 4$\%$ macro-average improvement across math benchmarks. We include careful ablations to analyze $\text{Guide}$’s components and theoretically analyze Guide’s learning efficiency.
nan
Article 557
Title@2025-06-19 (4): Arch-Router: Aligning LLM Routing with Human Preferences
Title: Arch-Router: Aligning LLM Routing with Human Preferences | Arch-Router: LLM-Routing mit menschlichen Präferenzen ausrichten | Arch- Router: 与人类首选对齐 LLM Routing 2506.16655v1 |
Authors (4): Co Tran, Salman Paracha, Adil Hafeez, Shuguang Chen
With the rapid proliferation of large language models (LLMs) – each optimized for different strengths, style, or latency/cost profile – routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. In this work, we propose a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) – offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce \textbf{Arch-Router}, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Our approach also supports seamlessly adding new models for routing without requiring retraining or architectural modifications. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. Our approach captures subjective evaluation criteria and makes routing decisions more transparent and flexible. Our model is available at: \texttt{https://huggingface.co/katanemo/Arch-Router-1.5B}.
nan
Article 558
Title@2025-06-19 (4): Learning to Route LLMs with Confidence Tokens
Title: Learning to Route LLMs with Confidence Tokens | Lernen, LLMs mit vertrauensvollen Token zu routen | 学习使用充满信心的LLMs路线 2410.13284v3 |
Authors (7): Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu, Helen Zhou
Large language models (LLMs) have demonstrated impressive performance on several tasks and are increasingly deployed in real-world applications. However, especially in high-stakes settings, it becomes vital to know when the output of an LLM may be unreliable. Depending on whether an answer is trustworthy, a system can then choose to route the question to another expert, or otherwise fall back on a safe default behavior. In this work, we study the extent to which LLMs can reliably indicate confidence in their answers, and how this notion of confidence can translate into downstream accuracy gains. We propose Self-Reflection with Error-based Feedback (Self-REF), a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Self-REF introduces confidence tokens into the LLM, from which a confidence score can be extracted. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.
nan
Article 559
Title@2025-06-19 (4): Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models
Title: Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models | Layer-Wise Alignment: Prüfung der Sicherheitsausrichtung über Bild-Encoder-Ebenen in Vision-Sprachenmodellen | 图层对齐: 检查视觉语言模型中图像编码图层的安全对齐情况 2411.04291v2 |
Authors (9): Saketh Bachu, Erfan Shayegani, Rohit Lal, Trishna Chakraborty, Arindam Dutta, Chengyu Song, Yue Dong, Nael Abu-Ghazaleh, Amit K. Roy-Chowdhury
Vision-language models (VLMs) have improved significantly in their capabilities, but their complex architecture makes their safety alignment challenging. In this paper, we reveal an uneven distribution of harmful information across the intermediate layers of the image encoder and show that skipping a certain set of layers and exiting early can increase the chance of the VLM generating harmful responses. We call it as “Image enCoder Early-exiT” based vulnerability (ICET). Our experiments across three VLMs: LLaVA-1.5, LLaVA-NeXT, and Llama 3.2, show that performing early exits from the image encoder significantly increases the likelihood of generating harmful outputs. To tackle this, we propose a simple yet effective modification of the Clipped-Proximal Policy Optimization (Clip-PPO) algorithm for performing layer-wise multi-modal RLHF for VLMs. We term this as Layer-Wise PPO (L-PPO). We evaluate our L-PPO algorithm across three multimodal datasets and show that it consistently reduces the harmfulness caused by early exits.
nan
Article 560
Title@2025-06-19 (4): GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View
Title: GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View | GeoGuess: Multimodale Begründung auf Basis der Hierarchie visueller Informationen in der Straßenansicht | GeoGuess:基于街景视觉信息等级的多式联运理由 2506.16633v1 |
Authors (5): Fenghua Cheng, Jinxiang Wang, Sen Wang, Zi Huang, Xue Li
Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels of granularity, e.g., local details and global context, is of little discussion, despite its frequent involvement in real scenarios. To bridge the gap, we introduce a novel and challenging task for multimodal reasoning, namely GeoGuess. Given a street view image, the task is to identify its location and provide a detailed explanation. A system that succeeds in GeoGuess should be able to detect tiny visual clues, perceive the broader landscape, and associate with vast geographic knowledge. Therefore, GeoGuess would require the ability to reason between hierarchical visual information and geographic knowledge. In this work, we establish a benchmark for GeoGuess by introducing a specially curated dataset GeoExplain which consists of panoramas-geocoordinates-explanation tuples. Additionally, we present a multimodal and multilevel reasoning method, namely SightSense which can make prediction and generate comprehensive explanation based on hierarchy of visual information and external knowledge. Our analysis and experiments demonstrate their outstanding performance in GeoGuess.
nan
Article 561
Title@2025-06-19 (4): Initial Investigation of LLM-Assisted Development of Rule-Based Clinical NLP System
Title: Initial Investigation of LLM-Assisted Development of Rule-Based Clinical NLP System | Erste Untersuchung der LLM-Assistenten Entwicklung eines regelbasierten klinischen NLP-Systems | 利用LLM协助开发有章可循的临床NLP系统的初步调查 2506.16628v1 |
Authors (2): Jianlin Shi, Brian T. Bucher
Despite advances in machine learning (ML) and large language models (LLMs), rule-based natural language processing (NLP) systems remain active in clinical settings due to their interpretability and operational efficiency. However, their manual development and maintenance are labor-intensive, particularly in tasks with large linguistic variability. To overcome these limitations, we proposed a novel approach employing LLMs solely during the rule-based systems development phase. We conducted the initial experiments focusing on the first two steps of developing a rule-based NLP pipeline: find relevant snippets from the clinical note; extract informative keywords from the snippets for the rule-based named entity recognition (NER) component. Our experiments demonstrated exceptional recall in identifying clinically relevant text snippets (Deepseek: 0.98, Qwen: 0.99) and 1.0 in extracting key terms for NER. This study sheds light on a promising new direction for NLP development, enabling semi-automated or automated development of rule-based systems with significantly faster, more cost-effective, and transparent execution compared with deep learning model-based solutions.
nan
Article 562
Title@2025-06-19 (4): Voice of a Continent: Mapping Africa’s Speech Technology Frontier
Title: Voice of a Continent: Mapping Africa’s Speech Technology Frontier | Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren | 非洲大陆之声:测绘非洲语音技术前沿 2505.18436v2 |
Authors (6): AbdelRahim Elmadany, Sang Yun Kwon, Hawau Olamide Toyin, Alcides Alcoba Inciarte, Hanan Aldarmaki, Muhammad Abdul-Mageed
Africa’s rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent’s speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa’s linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.
nan
Article 563
Title@2025-06-19 (4): Modeling Public Perceptions of Science in Media
Title: Modeling Public Perceptions of Science in Media | Modellierung öffentlicher Wahrnehmungen von Wissenschaft in Medien | 模拟公众对媒体科学的看法 2506.16622v1 |
Authors (4): Jiaxin Pei, Dustin Wright, Isabelle Augenstin, David Jurgens
Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals’ frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.
nan
Article 564
Title@2025-06-19 (4): From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
Title: From RAG to Memory: Non-Parametric Continual Learning for Large Language Models | Vom RAG zum Speicher: Nicht parametrisches kontinuierliches Lernen für große Sprachmodelle | 从RAG到内存:为大语言模型进行非计量连续学习 2502.14802v2 |
Authors (5): Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su
Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs. Code and data are available at https://github.com/OSU-NLP-Group/HippoRAG.
nan
Article 565
Title@2025-06-19 (4): A Survey of Automatic Hallucination Evaluation on Natural Language Generation
Title: A Survey of Automatic Hallucination Evaluation on Natural Language Generation | Eine Übersicht über die automatische Halluzination der natürlichen Sprachgenerierung | 自然语言生成自动幻觉评价调查 2404.12041v3 |
Authors (4): Siya Qi, Lin Gui, Yulan He, Zheng Yuan
The proliferation of Large Language Models (LLMs) has introduced a critical challenge: accurate hallucination evaluation that ensures model reliability. While Automatic Hallucination Evaluation (AHE) has emerged as essential, the field suffers from methodological fragmentation, hindering both theoretical understanding and practical advancement. This survey addresses this critical gap through a comprehensive analysis of 74 evaluation methods, revealing that 74% specifically target LLMs, a paradigm shift that demands new evaluation frameworks. We formulate a unified evaluation pipeline encompassing datasets and benchmarks, evidence collection strategies, and comparison mechanisms, systematically documenting the evolution from pre-LLM to post-LLM methodologies. Beyond taxonomical organization, we identify fundamental limitations in current approaches and their implications for real-world deployment. To guide future research, we delineate key challenges and propose strategic directions, including enhanced interpretability mechanisms and integration of application-specific evaluation criteria, ultimately providing a roadmap for developing more robust and practical hallucination evaluation systems.
nan
Article 566
Title@2025-06-19 (4): Learning to Refine with Fine-Grained Natural Language Feedback
Title: Learning to Refine with Fine-Grained Natural Language Feedback | Lernen, mit feinkörnigen natürlichen Sprachfeedback zu verfeinern | 学习精细自然语言反馈 2407.02397v3 |
Authors (4): Manya Wadhwa, Xinyu Zhao, Junyi Jessy Li, Greg Durrett
Recent work has explored the capability of large language models (LLMs) to identify and correct errors in LLM-generated responses. These refinement approaches frequently evaluate what sizes of models are able to do refinement for what problems, but less attention is paid to what effective feedback for refinement looks like. In this work, we propose looking at refinement with feedback as a composition of three distinct LLM competencies: (1) detection of bad generations; (2) fine-grained natural language critique generation; (3) refining with fine-grained feedback. The first step can be implemented with a high-performing discriminative model and steps 2 and 3 can be implemented either via prompted or fine-tuned LLMs. A key property of the proposed Detect, Critique, Refine (“DCR”) method is that the step 2 critique model can give fine-grained feedback about errors, made possible by offloading the discrimination to a separate model in step 1. We show that models of different capabilities benefit from refining with DCR on the task of improving factual consistency of document grounded summaries. Overall, DCR consistently outperforms existing end-to-end refinement approaches and current trained models not fine-tuned for factuality critiquing.
nan
Article 567
Title@2025-06-19 (4): Using Natural Language Explanations to Rescale Human Judgments
Title: Using Natural Language Explanations to Rescale Human Judgments | Natürliche Spracherklärungen verwenden, um menschliche Urteile neu zu skalieren | 使用自然语言解释来调整人类判决书的规模 2305.14770v6 |
Authors (4): Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett
The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over human judgments. However, annotators’ judgments for subjective tasks can differ in many ways: they may reflect different qualitative judgments about an example, and they may be mapped to a labeling scheme in different ways. We show that these nuances can be captured by natural language explanations, and propose a method to rescale ordinal annotations and explanations using LLMs. Specifically, we feed annotators’ Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. These scores should reflect the annotators’ underlying assessments of the example. The rubric can be designed or modified after annotation, and include distinctions that may not have been known when the original error taxonomy was devised. We explore our technique in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
nan
Article 568
Title@2025-06-19 (4): A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications
Title: A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications | Eine Bewertung der synthetischen Datengenerierung für biomedizinische Forschung und Anwendungen | 生物医学研究和应用合成数据生成范围审查 2506.16594v1 |
Authors (6): Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang
Synthetic data generation–mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields–has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations.
nan
Article 569
Title@2025-06-19 (4): Measuring (a Sufficient) World Model in LLMs: A Variance Decomposition Framework
Title: Measuring (a Sufficient) World Model in LLMs: A Variance Decomposition Framework | Messung eines (ausreichenden) Weltmodells in LLMs: Ein Rahmen für die Abweichungszersetzung | 计量(足够)LLMM世界模型:差异分解框架 2506.16584v1 |
Authors (2): Nadav Kunievsky, James A. Evans
Understanding whether large language models (LLMs) possess a world model-a structured understanding of the world that supports generalization beyond surface-level patterns-is central to assessing their reliability, especially in high-stakes applications. We propose a formal framework for evaluating whether an LLM exhibits a sufficiently robust world model, defined as producing consistent outputs across semantically equivalent prompts while distinguishing between prompts that express different intents. We introduce a new evaluation approach to measure this that decomposes model response variability into three components: variability due to user purpose, user articulation, and model instability. An LLM with a strong world model should attribute most of the variability in its responses to changes in foundational purpose rather than superficial changes in articulation. This approach allows us to quantify how much of a model’s behavior is semantically grounded rather than driven by model instability or alternative wording. We apply this framework to evaluate LLMs across diverse domains. Our results show how larger models attribute a greater share of output variability to changes in user purpose, indicating a more robust world model. This improvement is not uniform, however: larger models do not consistently outperform smaller ones across all domains, and their advantage in robustness is often modest. These findings highlight the importance of moving beyond accuracy-based benchmarks toward semantic diagnostics that more directly assess the structure and stability of a model’s internal understanding of the world.
nan
Article 570
Title@2025-06-19 (4): A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning
Title: A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning | A Impliziert B: Schaltungsanalyse in LLMs für propositionelle logische Vernunft | A Implies B: 用于推定逻辑理由的LLMLM的电路分析 2411.04105v4 |
Authors (6): Guan Zhe Hong, Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, Rina Panigrahy
Due to the size and complexity of modern large language models (LLMs), it has proven challenging to uncover the underlying mechanisms that models use to solve reasoning problems. For instance, is their reasoning for a specific problem localized to certain parts of the network? Do they break down the reasoning problem into modular components that are then executed as sequential steps as we go deeper in the model? To better understand the reasoning capability of LLMs, we study a minimal propositional logic problem that requires combining multiple facts to arrive at a solution. By studying this problem on Mistral and Gemma models, up to 27B parameters, we illuminate the core components the models use to solve such logic problems. From a mechanistic interpretability point of view, we use causal mediation analysis to uncover the pathways and components of the LLMs’ reasoning processes. Then, we offer fine-grained insights into the functions of attention heads in different layers. We not only find a sparse circuit that computes the answer, but we decompose it into sub-circuits that have four distinct and modular uses. Finally, we reveal that three distinct models – Mistral-7B, Gemma-2-9B and Gemma-2-27B – contain analogous but not identical mechanisms.
nan
Article 571
Title@2025-06-19 (4): Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement
Title: Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement | Streaming nicht-autoregressives Modell für beschleunigte Umwandlung und Aussprache Verbesserung | 流速转换和发音改进非自动递减模式 2506.16580v1 |
Authors (4): Tuan-Nam Nguyen, Ngoc-Quan Pham, Seymanur Akti, Alexander Waibel
We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.
nan
Article 572
Title@2025-06-19 (4): Advancing Harmful Content Detection in Organizational Research: Integrating Large Language Models with Elo Rating System
Title: Advancing Harmful Content Detection in Organizational Research: Integrating Large Language Models with Elo Rating System | Förderung schädlicher Inhaltserkennung in der Organisationsforschung: Integration großer Sprachmodelle mit Elo-Bewertungssystem | 在组织研究中推动有害内容的探测:将大语言模型与Elo评分系统相结合 2506.16575v1 |
Authors (2): Mustafa Akben, Aaron Satko
Large language models (LLMs) offer promising opportunities for organizational research. However, their built-in moderation systems can create problems when researchers try to analyze harmful content, often refusing to follow certain instructions or producing overly cautious responses that undermine validity of the results. This is particularly problematic when analyzing organizational conflicts such as microaggressions or hate speech. This paper introduces an Elo rating-based method that significantly improves LLM performance for harmful content analysis In two datasets, one focused on microaggression detection and the other on hate speech, we find that our method outperforms traditional LLM prompting techniques and conventional machine learning models on key measures such as accuracy, precision, and F1 scores. Advantages include better reliability when analyzing harmful content, fewer false positives, and greater scalability for large-scale datasets. This approach supports organizational applications, including detecting workplace harassment, assessing toxic communication, and fostering safer and more inclusive work environments.
nan
Article 573
Title@2025-06-19 (4): Weight Factorization and Centralization for Continual Learning in Speech Recognition
Title: Weight Factorization and Centralization for Continual Learning in Speech Recognition | Gewichtsfaktorisierung und Zentralisierung für kontinuierliches Lernen in der Spracherkennung | 语音识别中持续学习的加权因素化和集中化 2506.16574v1 |
Authors (3): Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel
Modern neural network based speech recognition models are required to continually absorb new data without re-training the whole system, especially in downstream applications using foundation models, having no access to the original training data. Continually training the models in a rehearsal-free, multilingual, and language agnostic condition, likely leads to catastrophic forgetting, when a seemingly insignificant disruption to the weights can destructively harm the quality of the models. Inspired by the ability of human brains to learn and consolidate knowledge through the waking-sleeping cycle, we propose a continual learning approach with two distinct phases: factorization and centralization, learning and merging knowledge accordingly. Our experiments on a sequence of varied code-switching datasets showed that the centralization stage can effectively prevent catastrophic forgetting by accumulating the knowledge in multiple scattering low-rank adapters.
nan
Article 574
Title@2025-06-19 (4): Capturing Visualization Design Rationale
Title: Capturing Visualization Design Rationale | Capturing Visualization Design Rationale | 模拟可视化设计 2506.16571v1 |
Authors (5): Maeve Hutchinson, Radu Jianu, Aidan Slingsby, Jo Wood, Pranava Madhyastha
Prior natural language datasets for data visualization have focused on tasks such as visualization literacy assessment, insight generation, and visualization generation from natural language instructions. These studies often rely on controlled setups with purpose-built visualizations and artificially constructed questions. As a result, they tend to prioritize the interpretation of visualizations, focusing on decoding visualizations rather than understanding their encoding. In this paper, we present a new dataset and methodology for probing visualization design rationale through natural language. We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course. These notebooks combine visual artifacts with design exposition, in which students make explicit the rationale behind their design decisions. We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks. We then carefully validate the triples and curate a dataset that captures and distills the visualization design choices and corresponding rationales of the students.
nan
Article 575
Title@2025-06-19 (4): MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
Title: MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation | MultiFinBen: Ein multilingualer, multimodaler und problemorientierter Benchmark für die finanzielle LLM-Bewertung | MultiFinBen: 财务LLM评价的多种语言、多种模式和困难软件基准 2506.14028v2 |
Authors (44): Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie
Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.
nan
Article 576
Title@2025-06-19 (4): AutoPresent: Designing Structured Visuals from Scratch
Title: AutoPresent: Designing Structured Visuals from Scratch | AutoPresent: Designing Structured Visuals from Scratch | 自动提交: 设计来自 Scratch 的结构化视觉 2501.00912v2 |
Authors (11): Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell
Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide’s quality. We hope that our work will provide a basis for future work on generating structured visuals.
nan
Article 577
Title@2025-06-19 (4): Automatic Speech Recognition Biases in Newcastle English: an Error Analysis
Title: Automatic Speech Recognition Biases in Newcastle English: an Error Analysis | Automatische Spracherkennung in Newcastle English: eine Fehleranalyse | Newcastle英语的自动语音识别比数:错误分析 2506.16558v1 |
Authors (3): Dana Serditova, Kevin Tang, Jochen Steffens
Automatic Speech Recognition (ASR) systems struggle with regional dialects due to biased training which favours mainstream varieties. While previous research has identified racial, age, and gender biases in ASR, regional bias remains underexamined. This study investigates ASR performance on Newcastle English, a well-documented regional dialect known to be challenging for ASR. A two-stage analysis was conducted: first, a manual error analysis on a subsample identified key phonological, lexical, and morphosyntactic errors behind ASR misrecognitions; second, a case study focused on the systematic analysis of ASR recognition of the regional pronouns yous'' and
wor’’. Results show that ASR errors directly correlate with regional dialectal features, while social factors play a lesser role in ASR mismatches. We advocate for greater dialectal diversity in ASR training data and highlight the value of sociolinguistic analysis in diagnosing and addressing regional biases.
nan
Article 578
Title@2025-06-19 (4): Revela: Dense Retriever Learning via Language Modeling
Title: Revela: Dense Retriever Learning via Language Modeling | Revela: Dense Retriever Lernen über Sprachmodellierung | Revela:通过语言建模进行密集检索学习 2506.16552v1 |
Authors (8): Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl
Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly and hard to obtain in specialized domains such as code-motivating growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next-token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next-token prediction on both local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on both general-domain (BEIR) and domain-specific (CoIR) benchmarks across various retriever backbones. At a comparable parameter scale, Revela outperforms the previous best method with absolute improvements of 5.2 % (18.3 % relative) and 5.6 % (14.4 % relative) on NDCG@10, respectively, underscoring its effectiveness. Performance increases with model size, highlighting both the scalability of our approach and its promise for self-supervised retriever learning.
nan
Article 579
Title@2025-06-19 (4): Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements
Title: Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements | Feintuning große Audio-Sprachen-Modelle mit LoRA für die präzise zeitliche Lokalisierung von langanhaltenden Expositionstherapieelementen | 与LORA一道精细设计大型音频语言模型,用于长期接触治疗元素的精确时间定位 2506.09707v2 |
Authors (7): Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah
Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements – identifying their start and stop times – directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases – therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) – are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.
nan
Article 580
Title@2025-06-19 (4): Essential-Web v1.0: 24T tokens of organized web data
Title: Essential-Web v1.0: 24T tokens of organized web data | Essential-Web v1.0: 24T Token von organisierten Web-Daten | 基本Web v1.0: 24个有组织网络数据标记 2506.14111v2 |
Authors (25): Essential AI, :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
nan
Article 581
Title@2025-06-19 (4): xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Title: xGen-MM (BLIP-3): A Family of Open Large Multimodal Models | xGen-MM (BLIP-3): Eine Familie offener großer multimodaler Modelle | xGen-MM(BLIP-3): “ 开放型大型多式联运模型大家庭 “ 2408.08872v3 |
Authors (33): Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.
nan
Article 582
Title@2025-06-19 (4): Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples
Title: Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples | Relic: Verallgemeinerung des Prämienmodells für Low-Resource-Indic-Sprachen mit wenigen scharfen Beispielen | Relic:加强低资源印度语言的奖赏示范性概括化,只有很少的热实例 2506.16502v1 |
Authors (7): Soumya Suvra Ghosal, Vaibhav Singh, Akash Ghosh, Soumyabrata Pal, Subhadip Baidya, Sriparna Saha, Dinesh Manocha
Reward models are essential for aligning large language models (LLMs) with human preferences. However, most open-source multilingual reward models are primarily trained on preference datasets in high-resource languages, resulting in unreliable reward signals for low-resource Indic languages. Collecting large-scale, high-quality preference data for these languages is prohibitively expensive, making preference-based training approaches impractical. To address this challenge, we propose RELIC, a novel in-context learning framework for reward modeling in low-resource Indic languages. RELIC trains a retriever with a pairwise ranking objective to select in-context examples from auxiliary high-resource languages that most effectively highlight the distinction between preferred and less-preferred responses. Extensive experiments on three preference datasets- PKU-SafeRLHF, WebGPT, and HH-RLHF-using state-of-the-art open-source reward models demonstrate that RELIC significantly improves reward model accuracy for low-resource Indic languages, consistently outperforming existing example selection methods. For example, on Bodo-a low-resource Indic language-using a LLaMA-3.2-3B reward model, RELIC achieves a 12.81% and 10.13% improvement in accuracy over zero-shot prompting and state-of-the-art example selection method, respectively.
nan
Article 583
Title@2025-06-19 (4): SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development
Title: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development | SWE-Dev: Bewertung und Schulung autonomer Feature-getriebener Software-Entwicklung | SWE-Dev: 评估和培训自主开发地物-驱动软件开发 2505.16975v2 |
Authors (9): Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, Siheng Chen
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{https://github.com/DorothyDUUU/SWE-Dev}{https://github.com/DorothyDUUU/SWE-Dev}.
nan
Article 584
Title@2025-06-19 (4): QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
Title: QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation | QG-SMS: Verbesserung der Testobjektanalyse durch Studentenmodellierung und Simulation | QG-SMS:通过学生建模和模拟加强测试物品分析 2503.05888v2 |
Authors (5): Bang Nguyen, Tingting Du, Mengxia Yu, Lawrence Angrave, Meng Jiang
While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.
nan
Article 585
Title@2025-06-19 (4): Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection
Title: Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection | Auf dem Weg zu allgemeingültigen allgemeinen schädlichen Sprachdatensätzen für Implizite Hass-Spracherkennung | 争取建立通用通用通用有害言论数据集,用于隐含仇恨言论探测 2506.16476v1 |
Authors (4): Saad Almohaimeed, Saleh Almohaimeed, Damla Turgut, Ladislau Bölöni
Implicit hate speech has recently emerged as a critical challenge for social media platforms. While much of the research has traditionally focused on harmful speech in general, the need for generalizable techniques to detect veiled and subtle forms of hate has become increasingly pressing. Based on lexicon analysis, we hypothesize that implicit hate speech is already present in publicly available harmful speech datasets but may not have been explicitly recognized or labeled by annotators. Additionally, crowdsourced datasets are prone to mislabeling due to the complexity of the task and often influenced by annotators’ subjective interpretations. In this paper, we propose an approach to address the detection of implicit hate speech and enhance generalizability across diverse datasets by leveraging existing harmful speech datasets. Our method comprises three key components: influential sample identification, reannotation, and augmentation using Llama-3 70B and GPT-4o. Experimental results demonstrate the effectiveness of our approach in improving implicit hate detection, achieving a +12.9-point F1 score improvement compared to the baseline.
nan
Article 586
Title@2025-06-19 (4): Do We Talk to Robots Like Therapists, and Do They Respond Accordingly? Language Alignment in AI Emotional Support
Title: Do We Talk to Robots Like Therapists, and Do They Respond Accordingly? Language Alignment in AI Emotional Support | Sprechen wir mit Robotern wie Therapeuten, und reagieren sie entsprechend? | 我们是否和像治疗师一样的机器人交谈,他们是否做出相应的回应? AI 情感支持中的语言对齐 2506.16473v1 |
Authors (3): Sophie Chiang, Guy Laban, Hatice Gunes
As conversational agents increasingly engage in emotionally supportive dialogue, it is important to understand how closely their interactions resemble those in traditional therapy settings. This study investigates whether the concerns shared with a robot align with those shared in human-to-human (H2H) therapy sessions, and whether robot responses semantically mirror those of human therapists. We analyzed two datasets: one of interactions between users and professional therapists (Hugging Face’s NLP Mental Health Conversations), and another involving supportive conversations with a social robot (QTrobot from LuxAI) powered by a large language model (LLM, GPT-3.5). Using sentence embeddings and K-means clustering, we assessed cross-agent thematic alignment by applying a distance-based cluster-fitting method that evaluates whether responses from one agent type map to clusters derived from the other, and validated it using Euclidean distances. Results showed that 90.88% of robot conversation disclosures could be mapped to clusters from the human therapy dataset, suggesting shared topical structure. For matched clusters, we compared the subjects as well as therapist and robot responses using Transformer, Word2Vec, and BERT embeddings, revealing strong semantic overlap in subjects’ disclosures in both datasets, as well as in the responses given to similar human disclosure themes across agent types (robot vs. human therapist). These findings highlight both the parallels and boundaries of robot-led support conversations and their potential for augmenting mental health interventions.
nan
Article 587
Title@2025-06-19 (4): Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Title: Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Sonde vor Ihnen sprechen: Auf dem Weg zur Black-Box Verteidigung gegen Hintertür Unausrichtung für große Sprachmodelle | 在你发言前的探贝之前,你先谈:争取防止大语言模型的后门不匹配的黑箱防御 2506.16447v1 |
Authors (7): Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li
Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through text. Furthermore, the sample-dependent nature of the attack target exacerbates the threat. Instead of outputting a fixed label, the backdoored LLM follows the semantics of any malicious command with the hidden trigger, significantly expanding the target space. In this paper, we introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoor. It is motivated by an intriguing observation (dubbed the probe concatenate effect), where concatenated triggered samples significantly reduce the refusal rate of the backdoored LLM towards a malicious probe, while non-triggered samples have little effect. Specifically, BEAT identifies whether an input is triggered by measuring the degree of distortion in the output distribution of the probe before and after concatenation with the input. Our method addresses the challenges of sample-dependent targets from an opposite perspective. It captures the impact of the trigger on the refusal signal (which is sample-independent) instead of sample-specific successful attack behaviors. It overcomes black-box access limitations by using multiple sampling to approximate the output distribution. Extensive experiments are conducted on various backdoor attacks and LLMs (including the closed-source GPT-3.5-turbo), verifying the effectiveness and efficiency of our defense. Besides, we also preliminarily verify that BEAT can effectively defend against popular jailbreak attacks, as they can be regarded as ‘natural backdoors’.
nan
Article 588
Title@2025-06-19 (4): StoryWriter: A Multi-Agent Framework for Long Story Generation
Title: StoryWriter: A Multi-Agent Framework for Long Story Generation | StoryWriter: Ein Multi-Agenten-Framework für lange Story-Generationen | 故事文字:长代代多方行为者框架 2506.16445v1 |
Authors (7): Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story generation framework, which consists of three main modules: (1) outline agent, which generates event-based outlines containing rich event plots, character, and event-event relationships. (2) planning agent, which further details events and plans which events should be written in each chapter to maintain an interwoven and engaging story. (3) writing agent, which dynamically compresses the story history based on the current event to generate and reflect new plots, ensuring the coherence of the generated story. We conduct both human and automated evaluation, and StoryWriter significantly outperforms existing story generation baselines in both story quality and length. Furthermore, we use StoryWriter to generate a dataset, which contains about $6,000$ high-quality long stories, with an average length of $8,000$ words. We train the model Llama3.1-8B and GLM4-9B using supervised fine-tuning on LongStory and develop StoryWriter_GLM and StoryWriter_GLM, which demonstrates advanced performance in long story generation.
nan
Article 589
Title@2025-06-19 (4): REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing
Title: REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing | REIS: Ein leistungsstarkes und energieeffizientes Retrieval-System mit In-Storage-Verarbeitung | REIS:具有在系统内处理的高效能和节能检索系统 2506.16444v1 |
Authors (10): Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, Onur Mutlu
Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).
nan
Article 590
Title@2025-06-19 (4): Quantifying artificial intelligence through algorithmic generalization
Title: Quantifying artificial intelligence through algorithmic generalization | Quantifizierung künstlicher Intelligenz durch algorithmische Verallgemeinerung | 通过算法一般化对人工智能进行量化 2411.05943v2 |
Authors (5): Takuya Ito, Murray Campbell, Lior Horesh, Tim Klinger, Parikshit Ram
The rapid development of artificial intelligence (AI) systems has created an urgent need for their scientific quantification. While their fluency across a variety of domains is impressive, AI systems fall short on tests requiring algorithmic reasoning – a glaring limitation given the necessity for interpretable and reliable technology. Despite a surge of reasoning benchmarks emerging from the academic community, no theoretical framework exists to quantify algorithmic reasoning in AI systems. Here, we adopt a framework from computational complexity theory to quantify algorithmic generalization using algebraic expressions: algebraic circuit complexity. Algebraic circuit complexity theory – the study of algebraic expressions as circuit models – is a natural framework to study the complexity of algorithmic computation. Algebraic circuit complexity enables the study of generalization by defining benchmarks in terms of the computational requirements to solve a problem. Moreover, algebraic circuits are generic mathematical objects; an arbitrarily large number of samples can be generated for a specified circuit, making it an ideal experimental sandbox for the data-hungry models that are used today. In this Perspective, we adopt tools from algebraic circuit complexity, apply them to formalize a science of algorithmic generalization, and address key challenges for its successful application to AI science.
nan
Article 591
Title@2025-06-19 (4): ALTA: Compiler-Based Analysis of Transformers
Title: ALTA: Compiler-Based Analysis of Transformers | ALTA: Compiler-basierte Analyse von Transformatoren | ALTA:以汇编者为基础对变形器的分析 2410.18077v2 |
Authors (6): Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova
We propose a new programming language called ALTA and a compiler that can map ALTA programs to Transformer weights. ALTA is inspired by RASP, a language proposed by Weiss et al. (2021), and Tracr (Lindner et al., 2023), a compiler from RASP programs to Transformer weights. ALTA complements and extends this prior work, offering the ability to express loops and to compile programs to Universal Transformers, among other advantages. ALTA allows us to constructively show how Transformers can represent length-invariant algorithms for computing parity and addition, as well as a solution to the SCAN benchmark of compositional generalization tasks, without requiring intermediate scratchpad decoding steps. We also propose tools to analyze cases where the expressibility of an algorithm is established, but end-to-end training on a given training set fails to induce behavior consistent with the desired algorithm. To this end, we explore training from ALTA execution traces as a more fine-grained supervision signal. This enables additional experiments and theoretical analyses relating the learnability of various algorithms to data availability and modeling decisions, such as positional encodings. We make the ALTA framework – language specification, symbolic interpreter, and weight compiler – available to the community to enable further applications and insights.
nan
Article 592
Title@2025-06-19 (4): Unpacking Generative AI in Education: Computational Modeling of Teacher and Student Perspectives in Social Media Discourse
Title: Unpacking Generative AI in Education: Computational Modeling of Teacher and Student Perspectives in Social Media Discourse | Entpacken generativer KI in der Bildung: Computational Modeling von Lehrer- und Studentenperspektiven im Social Media Diskurs | 《教育:在社会媒体讨论中教师和学生观点的计算模型》 2506.16412v1 |
Authors (7): Paulina DeVito, Akhil Vallala, Sean Mcmahon, Yaroslav Hinda, Benjamin Thaw, Hanqi Zhuang, Hari Kalva
Generative AI (GAI) technologies are quickly reshaping the educational landscape. As adoption accelerates, understanding how students and educators perceive these tools is essential. This study presents one of the most comprehensive analyses to date of stakeholder discourse dynamics on GAI in education using social media data. Our dataset includes 1,199 Reddit posts and 13,959 corresponding top-level comments. We apply sentiment analysis, topic modeling, and author classification. To support this, we propose and validate a modular framework that leverages prompt-based large language models (LLMs) for analysis of online social discourse, and we evaluate this framework against classical natural language processing (NLP) models. Our GPT-4o pipeline consistently outperforms prior approaches across all tasks. For example, it achieved 90.6% accuracy in sentiment analysis against gold-standard human annotations. Topic extraction uncovered 12 latent topics in the public discourse with varying sentiment and author distributions. Teachers and students convey optimism about GAI’s potential for personalized learning and productivity in higher education. However, key differences emerged: students often voice distress over false accusations of cheating by AI detectors, while teachers generally express concern about job security, academic integrity, and institutional pressures to adopt GAI tools. These contrasting perspectives highlight the tension between innovation and oversight in GAI-enabled learning environments. Our findings suggest a need for clearer institutional policies, more transparent GAI integration practices, and support mechanisms for both educators and students. More broadly, this study demonstrates the potential of LLM-based frameworks for modeling stakeholder discourse within online communities.
nan
Article 593
Title@2025-06-19 (4): When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework
Title: When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework | Wann funktioniert Trennen und Erobern für den langen Kontext LLM? Ein Lärmzersetzungsrahmen | 何时分化和征服工作为长期LLM服务? 噪音分解框架 2506.16411v1 |
Authors (8): Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, Ce Zhang
We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a length sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring superlinear model noise growth with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.
nan
Article 594
Title@2025-06-19 (4): On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Title: On Path to Multimodal Historical Reasoning: HistBench and HistAgent | Auf dem Weg zu multimodaler historischer Vernunft: HistBench und HistAgent | 通向多式联运历史原因原因之路:历史时尚与历史代理人 2505.20246v3 |
Authors (99): Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Shu Zhang, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu Deng, Jiye Fu, Yunting Gu, Zijie Guan, Zirui Huang, Xiaoyan Ji, Yumeng Jiang, Delong Kong, Haolong Li, Jiaqi Li, Ruipeng Li, Tianze Li, Zhuoran Li, Haixia Lian, Mengyue Lin, Xudong Liu, Jiayi Lu, Jinghan Lu, Wanyu Luo, Ziyue Luo, Zihao Pu, Zhi Qiao, Ruihuan Ren, Liang Wan, Ruixiang Wang, Tianhui Wang, Yang Wang, Zeyu Wang, Zihua Wang, Yujia Wu, Zhaoyi Wu, Hao Xin, Weiao Xing, Ruojun Xiong, Weijie Xu, Yao Shu, Yao Xiao, Xiaorui Yang, Yuchen Yang, Nan Yi, Jiadong Yu, Yangyuxuan Yu, Huiting Zeng, Danni Zhang, Yunjie Zhang, Zhaoyu Zhang, Zhiheng Zhang, Xiaofeng Zheng, Peirong Zhou, Linyan Zhong, Xiaoyin Zong, Ying Zhao, Zhenxin Chen, Lin Ding, Xiaoyu Gao, Bingbing Gong, Yichao Li, Yang Liao, Guang Ma, Tianyuan Ma, Xinrui Sun, Tianyi Wang, Han Xia, Ruobing Xian, Gen Ye, Tengfei Yu, Wentao Zhang, Yuxi Wang, Xi Gao, Mengdi Wang
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI’s capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
nan
Article 595
Title@2025-06-19 (4): IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Title: IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks | IS-Bench: Bewertung der interaktiven Sicherheit von VLM-getriebenen Körpermitteln bei täglichen Haushaltsaufgaben | IS-Bench:评估每日家务任务中VLM-Driven 充装代理人的互动安全 2506.16402v1 |
Authors (8): Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.
nan
Article 596
Title@2025-06-19 (4): NepaliGPT: A Generative Language Model for the Nepali Language
Title: NepaliGPT: A Generative Language Model for the Nepali Language | NepaliGPT: Ein generatives Sprachmodell für die nepalesische Sprache | 尼泊尔语:尼泊尔语创作语言模式 2506.16399v1 |
Authors (6): Shushanta Pudasaini, Aman Shakya, Siddhartha Shrestha, Sahil Bhatta, Sunil Thapa, Sushmita Palikhe
After the release of ChatGPT, Large Language Models (LLMs) have gained huge popularity in recent days and thousands of variants of LLMs have been released. However, there is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet. To fill this research gap in the Nepali NLP space, this research proposes \textit{NepaliGPT}, a generative large language model tailored specifically for the Nepali language. This research introduces an advanced corpus for the Nepali language collected from several sources, called the Devanagari Corpus. Likewise, the research introduces the first NepaliGPT benchmark dataset comprised of 4,296 question-answer pairs in the Nepali language. The proposed LLM NepaliGPT achieves the following metrics in text generation: Perplexity of 26.32245, ROUGE-1 score of 0.2604, causal coherence of 81.25\%, and causal consistency of 85.41\%.
nan
Article 597
Title@2025-06-19 (4): OJBench: A Competition Level Code Benchmark For Large Language Models
Title: OJBench: A Competition Level Code Benchmark For Large Language Models | OJBench: Ein Benchmark für Wettbewerbsebenencodes für große Sprachmodelle | OJBench:大语言模式竞争法基准 2506.16395v1 |
Authors (12): Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu
Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models’ reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.
nan
Article 598
Title@2025-06-19 (4): From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling
Title: From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling | Von der LLM-Anation zum LLM-Orchester: Koordinieren kleiner Modelle für die Datenkennzeichnung | 从LLM认证到LLM主机:数据标签小型模型协调 2506.16393v1 |
Authors (6): Yao Lu, Zhaiyuan Ji, Jiawei Du, Yu Shanqing, Qi Xuan, Tianyi Zhou
Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: https://github.com/Zhaiyuan-Ji/AutoAnnotator.
nan
Article 599
Title@2025-06-19 (4): RiOT: Efficient Prompt Refinement with Residual Optimization Tree
Title: RiOT: Efficient Prompt Refinement with Residual Optimization Tree | RiOT: Effiziente Prompt-Verfeinerung mit Residual Optimization Tree | RiOT: 高效快速提炼剩余优化树 2506.16389v1 |
Authors (6): Chenyi Zhou, Zhengyan Shi, Yuan Yao, Lei Liang, Huajun Chen, Qiang Zhang
Recent advancements in large language models (LLMs) have highlighted their potential across a variety of tasks, but their performance still heavily relies on the design of effective prompts. Existing methods for automatic prompt optimization face two challenges: lack of diversity, limiting the exploration of valuable and innovative directions and semantic drift, where optimizations for one task can degrade performance in others. To address these issues, we propose Residual Optimization Tree (RiOT), a novel framework for automatic prompt optimization. RiOT iteratively refines prompts through text gradients, generating multiple semantically diverse candidates at each step, and selects the best prompt using perplexity. Additionally, RiOT incorporates the text residual connection to mitigate semantic drift by selectively retaining beneficial content across optimization iterations. A tree structure efficiently manages the optimization process, ensuring scalability and flexibility. Extensive experiments across five benchmarks, covering commonsense, mathematical, logical, temporal, and semantic reasoning, demonstrate that RiOT outperforms both previous prompt optimization methods and manual prompting.
nan
Article 600
Title@2025-06-19 (4): Large Language Models in Argument Mining: A Survey
Title: Large Language Models in Argument Mining: A Survey | Große Sprachmodelle im Argumentbergbau: Eine Umfrage | 争议采矿大语言模型:调查 2506.16383v1 |
Authors (5): Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic
Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.
nan
Article 601
Title@2025-06-19 (4): InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems
Title: InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | InstructTTSEval: Benchmarking komplexe natursprachliche Anleitung im Anschluss an Text-zu-Sprach-Systeme | 指令TTSEval: 以文字到语音系统为基准的复杂自然语言教学 2506.16381v1 |
Authors (9): Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu
In modern speech synthesis, paralinguistic information–such as a speaker’s vocal timbre, emotional state, and dynamic prosody–plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.
nan
Article 602
Title@2025-06-19 (4): Can structural correspondences ground real world representational content in Large Language Models?
Title: Can structural correspondences ground real world representational content in Large Language Models? | Können Strukturkorrespondenzen reale Repräsentationsinhalte in großen Sprachmodellen begründen? | 结构通信能否在大语言模型中建立真实的世界代表性内容? 2506.16370v1 |
Authors (1): Iwan Williams
Large Language Models (LLMs) such as GPT-4 produce compelling responses to a wide range of prompts. But their representational capacities are uncertain. Many LLMs have no direct contact with extra-linguistic reality: their inputs, outputs and training data consist solely of text, raising the questions (1) can LLMs represent anything and (2) if so, what? In this paper, I explore what it would take to answer these questions according to a structural-correspondence based account of representation, and make an initial survey of this evidence. I argue that the mere existence of structural correspondences between LLMs and worldly entities is insufficient to ground representation of those entities. However, if these structural correspondences play an appropriate role - they are exploited in a way that explains successful task performance - then they could ground real world contents. This requires overcoming a challenge: the text-boundedness of LLMs appears, on the face of it, to prevent them engaging in the right sorts of tasks.
nan
Article 603
Title@2025-06-19 (4): DISCIE – Discriminative Closed Information Extraction
Title: DISCIE – Discriminative Closed Information Extraction | DISCIE – Diskriminative Closed Information Extraction | DISCIE - 质疑性封闭信息提取 2506.16348v1 |
Authors (2): Cedric Möller, Ricardo Usbeck
This paper introduces a novel method for closed information extraction. The method employs a discriminative approach that incorporates type and entity-specific information to improve relation extraction accuracy, particularly benefiting long-tail relations. Notably, this method demonstrates superior performance compared to state-of-the-art end-to-end generative models. This is especially evident for the problem of large-scale closed information extraction where we are confronted with millions of entities and hundreds of relations. Furthermore, we emphasize the efficiency aspect by leveraging smaller models. In particular, the integration of type-information proves instrumental in achieving performance levels on par with or surpassing those of a larger generative model. This advancement holds promise for more accurate and efficient information extraction techniques.
nan
Article 604
Title@2025-06-19 (4): Analyzing the Influence of Knowledge Graph Information on Relation Extraction
Title: Analyzing the Influence of Knowledge Graph Information on Relation Extraction | Analyse des Einflusses von Wissensgrapheninformationen auf die Beziehungsextraktion | 分析知识图表信息对采掘关系的影响 2506.16343v1 |
Authors (2): Cedric Möller, Ricardo Usbeck
We examine the impact of incorporating knowledge graph information on the performance of relation extraction models across a range of datasets. Our hypothesis is that the positions of entities within a knowledge graph provide important insights for relation extraction tasks. We conduct experiments on multiple datasets, each varying in the number of relations, training examples, and underlying knowledge graphs. Our results demonstrate that integrating knowledge graph information significantly enhances performance, especially when dealing with an imbalance in the number of training examples for each relation. We evaluate the contribution of knowledge graph-based features by combining established relation extraction methods with graph-aware Neural Bellman-Ford networks. These features are tested in both supervised and zero-shot settings, demonstrating consistent performance improvements across various datasets.
nan
Article 605
Title@2025-06-19 (4): Generalizability of Media Frames: Corpus creation and analysis across countries
Title: Generalizability of Media Frames: Corpus creation and analysis across countries | Generalisierbarkeit von Medienrahmen: Corpus-Erstellung und -Analyse über Länder hinweg | 媒体框架的通用性:公司创建和各国的分析 2506.16337v1 |
Authors (4): Agnese Daffara, Sourabh Dattawad, Sebastian Padó, Tanise Ceron
Frames capture aspects of an issue that are emphasized in a debate by interlocutors and can help us understand how political language conveys different perspectives and ultimately shapes people’s opinions. The Media Frame Corpus (MFC) is the most commonly used framework with categories and detailed guidelines for operationalizing frames. It is, however, focused on a few salient U.S. news issues, making it unclear how well these frames can capture news issues in other cultural contexts. To explore this, we introduce FrameNews-PT, a dataset of Brazilian Portuguese news articles covering political and economic news and annotate it within the MFC framework. Through several annotation rounds, we evaluate the extent to which MFC frames generalize to the Brazilian debate issues. We further evaluate how fine-tuned and zero-shot models perform on out-of-domain data. Results show that the 15 MFC frames remain broadly applicable with minor revisions of the guidelines. However, some MFC frames are rarely used, and novel news issues are analyzed using general ‘fall-back’ frames. We conclude that cross-cultural frame use requires careful consideration.
nan
Article 606
Title@2025-06-19 (4): Explainable Rule Application via Structured Prompting: A Neural-Symbolic Approach
Title: Explainable Rule Application via Structured Prompting: A Neural-Symbolic Approach | Erklärbare Regel-Anwendung über strukturierte Prompting: Ein neural-symbolischer Ansatz | 通过结构化推动可解释的规则应用:神经-循环方法 2506.16335v1 |
Authors (2): Albert Sadowski, Jarosław A. Chudziak
Large Language Models (LLMs) excel in complex reasoning tasks but struggle with consistent rule application, exception handling, and explainability, particularly in domains like legal analysis that require both natural language understanding and precise logical inference. This paper introduces a structured prompting framework that decomposes reasoning into three verifiable steps: entity identification, property extraction, and symbolic rule application. By integrating neural and symbolic approaches, our method leverages LLMs’ interpretive flexibility while ensuring logical consistency through formal verification. The framework externalizes task definitions, enabling domain experts to refine logical structures without altering the architecture. Evaluated on the LegalBench hearsay determination task, our approach significantly outperformed baselines, with OpenAI o-family models showing substantial improvements - o1 achieving an F1 score of 0.929 and o3-mini reaching 0.867 using structured decomposition with complementary predicates, compared to their few-shot baselines of 0.714 and 0.74 respectively. This hybrid neural-symbolic system offers a promising pathway for transparent and consistent rule-based reasoning, suggesting potential for explainable AI applications in structured legal reasoning tasks.
nan
Article 607
Title@2025-06-19 (4): PL-Guard: Benchmarking Language Model Safety for Polish
Title: PL-Guard: Benchmarking Language Model Safety for Polish | PL-Guard: Benchmarking Sprachmodellsicherheit für Polnisch | PL-Guard:波兰语言安全模式基准 2506.16322v1 |
Authors (4): Aleksandra Krasnodębska, Karolina Seweryn, Szymon Łukasik, Wojciech Kusa
Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.
nan
Article 608
Title@2025-06-19 (4): AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
Title: AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation | AlignDistil: Token-Level-Sprachmodell Alignment als Adaptive Policy Destillation | Aligndistil: 作为适应性政策蒸馏的调整级语言模式模型对齐 2503.02832v2 |
Authors (6): Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.
nan
Article 609
Title@2025-06-19 (4): LLM-Guided Indoor Navigation with Multimodal Map Understanding
Title: LLM-Guided Indoor Navigation with Multimodal Map Understanding | LLM-geführte Indoor-Navigation mit multimodalem Kartenverständnis | 具有多式地图理解的LLM-引导式室内导航 2503.11702v4 |
Authors (5): Alberto Coffrini, Paolo Barsocchi, Francesco Furfari, Antonino Crivello, Alessio Ferrari
Indoor navigation presents unique challenges due to complex layouts and the unavailability of GNSS signals. Existing solutions often struggle with contextual adaptation, and typically require dedicated hardware. In this work, we explore the potential of a Large Language Model (LLM), i.e., ChatGPT, to generate natural, context-aware navigation instructions from indoor map images. We design and evaluate test cases across different real-world environments, analyzing the effectiveness of LLMs in interpreting spatial layouts, handling user constraints, and planning efficient routes. Our findings demonstrate the potential of LLMs for supporting personalized indoor navigation, with an average of 86.59% correct indications and a maximum of 97.14%. The proposed system achieves high accuracy and reasoning performance. These results have key implications for AI-driven navigation and assistive technologies.
nan
Article 610
Title@2025-06-19 (4): Advancing Automated Speaking Assessment Leveraging Multifaceted Relevance and Grammar Information
Title: Advancing Automated Speaking Assessment Leveraging Multifaceted Relevance and Grammar Information | Automatisiertes Sprechen fördern Hebelisierung von facettenreicher Relevanz und Grammatik-Informationen | 利用多方相关性和语法信息 2506.16285v1 |
Authors (5): Hao-Chien Lu, Jhen-Ke Lin, Hong-Yun Lin, Chung-Chun Wang, Berlin Chen
Current automated speaking assessment (ASA) systems for use in multi-aspect evaluations often fail to make full use of content relevance, overlooking image or exemplar cues, and employ superficial grammar analysis that lacks detailed error types. This paper ameliorates these deficiencies by introducing two novel enhancements to construct a hybrid scoring model. First, a multifaceted relevance module integrates question and the associated image content, exemplar, and spoken response of an L2 speaker for a comprehensive assessment of content relevance. Second, fine-grained grammar error features are derived using advanced grammar error correction (GEC) and detailed annotation to identify specific error categories. Experiments and ablation studies demonstrate that these components significantly improve the evaluation of content relevance, language use, and overall ASA performance, highlighting the benefits of using richer, more nuanced feature sets for holistic speaking assessment.
nan
Article 611
Title@2025-06-19 (4): Uncertainty Quantification in Retrieval Augmented Question Answering
Title: Uncertainty Quantification in Retrieval Augmented Question Answering | Unsicherheit Quantifizierung in Retrieval Augmented Question Answering | 检索增强回答问题时的不确定性定量 2502.18108v2 |
Authors (2): Laura Perez-Beltrachini, Mirella Lapata
Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.
nan
Article 612
Title@2025-06-19 (4): End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
Title: End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data | End-to-End-Sprachübersetzung für Low-Resource-Sprachen mit schwach beschrifteten Daten | 使用微弱标签数据翻译低资源语言端对端语音 2506.16251v1 |
Authors (6): Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan D, Santosh Kesiraju, Anil Kumar Vuppala
The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.
nan
Article 613
Title@2025-06-19 (4): Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports
Title: Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports | Vergleichende Analyse der abstrakten Zusammenfassungsmodelle für klinische Radiologieberichte | 临床放射学报告摘要摘要摘要模型比较分析 2506.16247v1 |
Authors (4): Anindita Bhattacharya, Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay
The findings section of a radiology report is often detailed and lengthy, whereas the impression section is comparatively more compact and captures key diagnostic conclusions. This research explores the use of advanced abstractive summarization models to generate the concise impression from the findings section of a radiology report. We have used the publicly available MIMIC-CXR dataset. A comparative analysis is conducted on leading pre-trained and open-source large language models, including T5-base, BART-base, PEGASUS-x-base, ChatGPT-4, LLaMA-3-8B, and a custom Pointer Generator Network with a coverage mechanism. To ensure a thorough assessment, multiple evaluation metrics are employed, including ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BERTScore. By analyzing the performance of these models, this study identifies their respective strengths and limitations in the summarization of medical text. The findings of this paper provide helpful information for medical professionals who need automated summarization solutions in the healthcare sector.
nan
Article 614
Title@2025-06-19 (4): BEADs: Bias Evaluation Across Domains
Title: BEADs: Bias Evaluation Across Domains | BEADs: Bias-Evaluierung über Domains hinweg | BEADs: 跨领域偏见评价 2406.04220v5 |
Authors (3): Shaina Raza, Mizanur Rahman, Michael R. Zhang
Recent advancements in large language models (LLMs) have significantly improved natural language processing (NLP) applications. However, these models often inherit biases from their training data. While several datasets exist for bias detection, most are limited to one or two NLP tasks, typically classification or evaluation, and lack comprehensive coverage across a broader range of tasks. To address this gap, we introduce the Bias Evaluations Across Domains (BEADs) dataset, designed to support a wide range of NLP tasks, including text classification, token classification, bias quantification, and benign language generation. A key contribution of this work is the gold-standard annotation provided by GPT-4 for scalability, with expert verification to ensure high reliability. BEADs can be used for both fine-tuning models (for classification and generation tasks) and evaluating LLM behavior. Our findings show that BEADs effectively surfaces various biases during model fine-tuning and helps reduce biases in language generation tasks while maintaining output quality. The dataset also highlights prevalent demographic biases in LLMs during evaluation. We release BEADs as a practical resource for detecting and mitigating bias across domains, supporting the development of responsible AI systems. Project: https://vectorinstitute.github.io/BEAD/ Data: https://huggingface.co/datasets/shainar/BEAD
nan
Article 615
Title@2025-06-19 (4): Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts
Title: Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts | Multi-Preference-Optimierung: Verallgemeinern von DPO über Set-Level-Kontrast | 多优先优化:通过定点对比度普及残疾人组织 2412.04628v4 |
Authors (6): Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, Saravan Rajmohan
Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses per prompt, which are scored by a reward model to guide learning. In this setting, we propose $\textbf{Multi-Preference Optimization (MPO)}$, a generalization of DPO that optimizes over entire sets of responses by extending the Bradley-Terry model to groupwise comparisons between chosen and rejected sets. To further enhance learning, MPO employs deviation-based weighting, which emphasizes outlier responses that deviate most from the mean reward, effectively inducing a self-paced curriculum. We theoretically prove that MPO reduces alignment bias at a rate of $\mathcal{O}\left(\frac{1}{\sqrt{n}}\right)$ with respect to the number of responses per query. Empirically, MPO achieves state-of-the-art performance on the UltraFeedback benchmark and yields up to $\sim 17.5\%$ improvement over the state-of-the-art baseline in length-controlled win rate on AlpacaEval2, establishing a new baseline for preference-based alignment
nan
Article 616
Title@2025-06-19 (4): Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models
Title: Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models | Bewertung und Abmilderung medizinischer Kenntnisse Drift und Konflikte in großen Sprachmodellen | 评估和减少大语言模式中的医学知识疏漏和冲突 2505.07968v2 |
Authors (7): Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li, Lucas A. Salas, Jiang Gui
Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most consistent and reliable results. These findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice. The dataset is available at https://huggingface.co/datasets/RDBH/DriftMed.
nan
Article 617
Title@2025-06-19 (4): Learning Dynamics in Continual Pre-Training for Large Language Models
Title: Learning Dynamics in Continual Pre-Training for Large Language Models | Dynamisches Lernen im kontinuierlichen Pre-Training für große Sprachmodelle | 大语言模式持续培训前培训中的学习动态 2505.07796v2 |
Authors (5): Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng
Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.
nan
Article 618
Title@2025-06-19 (4): Web(er) of Hate: A Survey on How Hate Speech Is Typed
Title: Web(er) of Hate: A Survey on How Hate Speech Is Typed | Web(er) of Hate: Eine Umfrage über die Art und Weise, wie Hate Speech eingegeben wird | Web(er) “ 仇恨:关于仇恨言论如何打字的调查 “ 2506.16190v1 |
Authors (3): Luna Wang, Andrew Caines, Alice Hutchings
The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber’s notion of ideal types, we argue for a reflexive approach in dataset creation, urging researchers to acknowledge their own value judgments during dataset construction, fostering transparency and methodological rigour.
nan
Article 619
Title@2025-06-19 (4): JETHICS: Japanese Ethics Understanding Evaluation Dataset
Title: JETHICS: Japanese Ethics Understanding Evaluation Dataset | JETHICS: Japanische Ethik verstehen Evaluierungsdatensatz | JETICS:日本道德理解评价数据集 2506.16187v1 |
Authors (2): Masashi Takeshita, Rafal Rzepka
In this work, we propose JETHICS, a Japanese dataset for evaluating ethics understanding of AI models. JETHICS contains 78K examples and is built by following the construction methods of the existing English ETHICS dataset. It includes four categories based normative theories and concepts from ethics and political philosophy; and one representing commonsense morality. Our evaluation experiments on non-proprietary large language models (LLMs) and on GPT-4o reveal that even GPT-4o achieves only an average score of about 0.7, while the best-performing Japanese LLM attains around 0.5, indicating a relatively large room for improvement in current LLMs.
nan
Article 620
Title@2025-06-19 (4): AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation
Title: AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation | AUTOLAW: Verbesserung der rechtlichen Compliance in großen Sprachmodellen durch Fallrechtgenerierung und Jury-inspirierte Beratung | AUTOAW:通过判例法的产生和陪审团的启发审议,加强大语言模式中的法律合规性 2505.14015v2 |
Authors (3): Tai D. Nguyen, Long H. Pham, Jun Sun
The rapid advancement of domain-specific large language models (LLMs) in fields like law necessitates frameworks that account for nuanced regional legal distinctions, which are critical for ensuring compliance and trustworthiness. Existing legal evaluation benchmarks often lack adaptability and fail to address diverse local contexts, limiting their utility in dynamically evolving regulatory landscapes. To address these gaps, we propose AutoLaw, a novel violation detection framework that combines adversarial data generation with a jury-inspired deliberation process to enhance legal compliance of LLMs. Unlike static approaches, AutoLaw dynamically synthesizes case law to reflect local regulations and employs a pool of LLM-based “jurors” to simulate judicial decision-making. Jurors are ranked and selected based on synthesized legal expertise, enabling a deliberation process that minimizes bias and improves detection accuracy. Evaluations across three benchmarks: Law-SG, Case-SG (legality), and Unfair-TOS (policy), demonstrate AutoLaw’s effectiveness: adversarial data generation improves LLM discrimination, while the jury-based voting strategy significantly boosts violation detection rates. Our results highlight the framework’s ability to adaptively probe legal misalignments and deliver reliable, context-aware judgments, offering a scalable solution for evaluating and enhancing LLMs in legally sensitive applications.
nan
Article 621
Title@2025-06-19 (4): SGIC: A Self-Guided Iterative Calibration Framework for RAG
Title: SGIC: A Self-Guided Iterative Calibration Framework for RAG | SGIC: Ein selbstgesteuerter iterativer Kalibrierrahmen für RAG | SAGC: RAG自行指导的迭代校准框架 2506.16172v1 |
Authors (5): Guanhua Chen, Yutong Yao, Lidia S. Chao, Xuebo Liu, Derek F. Wong
Recent research in retrieval-augmented generation (RAG) has concentrated on retrieving useful information from candidate documents. However, numerous methodologies frequently neglect the calibration capabilities of large language models (LLMs), which capitalize on their robust in-context reasoning prowess. This work illustrates that providing LLMs with specific cues substantially improves their calibration efficacy, especially in multi-round calibrations. We present a new SGIC: Self-Guided Iterative Calibration Framework that employs uncertainty scores as a tool. Initially, this framework calculates uncertainty scores to determine both the relevance of each document to the query and the confidence level in the responses produced by the LLMs. Subsequently, it reevaluates these scores iteratively, amalgamating them with prior responses to refine calibration. Furthermore, we introduce an innovative approach for constructing an iterative self-calibration training set, which optimizes LLMs to efficiently harness uncertainty scores for capturing critical information and enhancing response accuracy. Our proposed framework significantly improves performance on both closed-source and open-weight LLMs.
nan
Article 622
Title@2025-06-19 (4): DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
Title: DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products | DeltaProdukt: Verbesserung der State-Tracking in linearen RNNs über Haushaltsprodukte | DeltaProduction:通过家用产品改进国家通过家用产品对Linear RNNNs的跟踪 2502.10297v6 |
Authors (6): Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, Riccardo Grazzi
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. Diagonal matrices, used in models such as Mamba, GLA, or mLSTM, yield fast runtime but have limited expressivity. To address this, recent architectures such as DeltaNet and RWKV-7 adopted a diagonal plus rank-1 structure, which allows simultaneous token and channel mixing, improving associative recall and, as recently shown, state-tracking when allowing negative eigenvalues in the state-transition matrices. Building on the interpretation of DeltaNet’s recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency. We provide a detailed theoretical characterization of the state-tracking capability of DeltaProduct in finite precision, showing how it improves by increasing $n_h$. Our extensive experiments demonstrate that DeltaProduct outperforms DeltaNet in both state-tracking and language modeling, while also showing significantly improved length extrapolation capabilities.
nan
Article 623
Title@2025-06-19 (4): Under the Shadow of Babel: How Language Shapes Reasoning in LLMs
Title: Under the Shadow of Babel: How Language Shapes Reasoning in LLMs | Unter dem Schatten von Babel: Wie sich Sprache in LLMs begründet | Babel的阴影之下:LLMM中语言形状如何解释 2506.16151v1 |
Authors (7): Chenxi Wang, Yixuan Zhang, Lang Gao, Zixiang Xu, Zirui Song, Yanbo Wang, Xiuying Chen
Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal word order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.
nan
Article 624
Title@2025-06-19 (4): PRISON: Unmasking the Criminal Potential of Large Language Models
Title: PRISON: Unmasking the Criminal Potential of Large Language Models | PRISON: Entlarvung des kriminellen Potenzials großer Sprachmodelle | 释放大语言模式犯罪潜力 2506.16150v1 |
Authors (6): Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang
As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs’ criminal potential across five dimensions: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films, we evaluate both criminal potential and anti-crime ability of LLMs via role-play. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 41% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.
nan
Article 625
Title@2025-06-19 (4): Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems
Title: Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems | Beyond Self-Talk: Eine kommunikationszentrische Untersuchung von LLM-basierten Multiagentensystemen | 超越自言自语:以LLM为基础的多种机构系统的通信中心调查 2502.14321v2 |
Authors (9): Bingyu Yan, Zhibo Zhou, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, Zhoujun Li, Chaozhuo Li, Xiaoming Zhang
Large language model-based multi-agent systems have recently gained significant attention due to their potential for complex, collaborative, and intelligent problem-solving capabilities. Existing surveys typically categorize LLM-based multi-agent systems (LLM-MAS) according to their application domains or architectures, overlooking the central role of communication in coordinating agent behaviors and interactions. To address this gap, this paper presents a comprehensive survey of LLM-MAS from a communication-centric perspective. Specifically, we propose a structured framework that integrates system-level communication (architecture, goals, and protocols) with system internal communication (strategies, paradigms, objects, and content), enabling a detailed exploration of how agents interact, negotiate, and achieve collective intelligence. Through an extensive analysis of recent literature, we identify key components in multiple dimensions and summarize their strengths and limitations. In addition, we highlight current challenges, including communication efficiency, security vulnerabilities, inadequate benchmarking, and scalability issues, and outline promising future research directions. This review aims to help researchers and practitioners gain a clear understanding of the communication mechanisms in LLM-MAS, thereby facilitating the design and deployment of robust, scalable, and secure multi-agent systems.
nan
Article 626
Title@2025-06-19 (4): GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Title: GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning | GRPO-CARE: Konsequentitäts-Bewusst-Verstärkungs-Lernen für multimodale Vernunft | GROPO-CARE: 统一软件强化学习,用于多模式理由 2506.16141v1 |
Authors (7): Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu
Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model’s reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
nan
Article 627
Title@2025-06-19 (4): Batayan: A Filipino NLP benchmark for evaluating Large Language Models
Title: Batayan: A Filipino NLP benchmark for evaluating Large Language Models | Batayan: Ein philippinischer NLP-Benchmark für die Bewertung großer Sprachmodelle | Batayan:菲律宾国家语言方案评估大语言模型基准 2502.14911v2 |
Authors (8): Jann Railey Montalan, Jimson Paulo Layacan, David Demitri Africa, Richell Isaiah Flores, Michael T. Lopez II, Theresa Denise Magsajo, Anjanette Cayabyab, William Chandra Tjhi
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities on widely benchmarked high-resource languages. However, linguistic nuances of under-resourced languages remain unexplored. We introduce Batayan, a holistic Filipino benchmark that systematically evaluates LLMs across three key natural language processing (NLP) competencies: understanding, reasoning, and generation. Batayan consolidates eight tasks, three of which have not existed prior for Filipino corpora, covering both Tagalog and code-switched Taglish utterances. Our rigorous, native-speaker-driven adaptation and validation processes ensures fluency and authenticity to the complex morphological and syntactic structures of Filipino, alleviating the pervasive translationese bias in existing Filipino corpora. We report empirical results on a variety of open-source and commercial LLMs, highlighting significant performance gaps that signal the under-representation of Filipino in pre-training corpora, the unique hurdles in modeling Filipino’s rich morphology and construction, and the importance of explicit Filipino language support. Moreover, we discuss the practical challenges encountered in dataset construction and propose principled solutions for building culturally and linguistically-faithful resources in under-represented languages. We also provide a public evaluation suite as a clear foundation for iterative, community-driven progress in Filipino NLP.
nan
Article 628
Title@2025-06-19 (4): FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning
Title: FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning | FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning | FICOT: 以专家财务理由作为研究链的基础 2506.16123v1 |
Authors (6): Natapong Nitarach, Warit Sirichotedumrong, Panop Pitchayarthorn, Pittawat Taveekitworachai, Potsawee Manakul, Kunat Pipatanakul
This paper presents FinCoT, a structured chain-of-thought (CoT) prompting approach that incorporates insights from domain-specific expert financial reasoning to guide the reasoning traces of large language models. We investigate that there are three main prompting styles in FinNLP: (1) standard prompting–zero-shot prompting; (2) unstructured CoT–CoT prompting without an explicit reasoning structure, such as the use of tags; and (3) structured CoT prompting–CoT prompting with explicit instructions or examples that define structured reasoning steps. Previously, FinNLP has primarily focused on prompt engineering with either standard or unstructured CoT prompting. However, structured CoT prompting has received limited attention in prior work. Furthermore, the design of reasoning structures in structured CoT prompting is often based on heuristics from non-domain experts. In this study, we investigate each prompting approach in FinNLP. We evaluate the three main prompting styles and FinCoT on CFA-style questions spanning ten financial domains. We observe that FinCoT improves performance from 63.2% to 80.5% and Qwen-2.5-7B-Instruct from 69.7% to 74.2%, while reducing generated tokens eight-fold compared to structured CoT prompting. Our findings show that domain-aligned structured prompts not only improve performance and reduce inference costs but also yield more interpretable and expert-aligned reasoning traces.
nan
Article 629
Title@2025-06-19 (4): DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents
Title: DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents | DrunkAgent: Stealthy Memory Korruption in LLM-Powered Recommender Agents | DrunkAgent:LLM授权建议代理人的隐性记忆腐败 2503.23804v2 |
Authors (8): Shiyi Yang, Zhibo Hu, Xinshu Li, Chen Wang, Tong Yu, Xiwei Xu, Liming Zhu, Lina Yao
Large language model (LLM)-powered agents are increasingly used in recommender systems (RSs) to achieve personalized behavior modeling, where the memory mechanism plays a pivotal role in enabling the agents to autonomously explore, learn and self-evolve from real-world interactions. However, this very mechanism, serving as a contextual repository, inherently exposes an attack surface for potential adversarial manipulations. Despite its central role, the robustness of agentic RSs in the face of such threats remains largely underexplored. Previous works suffer from semantic mismatches or rely on static embeddings or pre-defined prompts, all of which hinder their applicability to systems with dynamic memory states. This challenge is exacerbated by the black-box nature of commercial RSs. To tackle the above problems, in this paper, we present the first systematic investigation of memory-based vulnerabilities in LLM-powered recommender agents, revealing their security limitations and guiding efforts to strengthen system resilience and trustworthiness. Specifically, we propose a novel black-box attack framework named DrunkAgent. DrunkAgent crafts semantically meaningful adversarial textual triggers for target item promotions and introduces a series of strategies to maximize the trigger effect by corrupting the memory updates during the interactions. The triggers and strategies are optimized on a surrogate model, enabling DrunkAgent transferable and stealthy. Extensive experiments on real-world datasets across diverse agentic RSs, including collaborative filtering, retrieval augmentation and sequential recommendations, demonstrate the generalizability, transferability and stealthiness of DrunkAgent.
nan
Article 630
Title@2025-06-19 (4): On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse
Title: On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse | Über die Grenzen der Sprachgenerierung: Trade-Offs zwischen Halluzination und Modekollaps | 语言产生限制:幻觉与模式崩溃之间的取舍 2411.09642v2 |
Authors (3): Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas
Specifying all desirable properties of a language model is challenging, but certain requirements seem essential. Given samples from an unknown language, the trained model should produce valid strings not seen in training and be expressive enough to capture the language’s full richness. Otherwise, outputting invalid strings constitutes “hallucination,” and failing to capture the full range leads to “mode collapse.” We ask if a language model can meet both requirements. We investigate this within a statistical language generation setting building on Gold and Angluin. Here, the model receives random samples from a distribution over an unknown language K, which belongs to a possibly infinite collection of languages. The goal is to generate unseen strings from K. We say the model generates from K with consistency and breadth if, as training size increases, its output converges to all unseen strings in K. Kleinberg and Mullainathan [KM24] asked if consistency and breadth in language generation are possible. We answer this negatively: for a large class of language models, including next-token prediction models, this is impossible for most collections of candidate languages. This contrasts with [KM24]’s result, showing consistent generation without breadth is possible for any countable collection of languages. Our finding highlights that generation with breadth fundamentally differs from generation without breadth. As a byproduct, we establish near-tight bounds on the number of samples needed for generation with or without breadth. Finally, our results offer hope: consistent generation with breadth is achievable for any countable collection of languages when negative examples (strings outside K) are available alongside positive ones. This suggests that post-training feedback, which encodes negative examples, can be crucial in reducing hallucinations while limiting mode collapse.
nan
Article 631
Title@2025-06-19 (4): CIVET: Systematic Evaluation of Understanding in VLMs
Title: CIVET: Systematic Evaluation of Understanding in VLMs | CIVET: Systematische Bewertung des Verständnisses in VLMs | CIVET: 系统评估对脆弱、危险、危险和 2506.05146v2 |
Authors (6): Massimo Rizzoli, Simone Alghisi, Olha Khomyn, Gabriel Roccabruna, Seyed Mahed Mousavi, Giuseppe Riccardi
While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs’ understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.
nan
Article 632
Title@2025-06-19 (4): Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning
Title: Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning | Rethinking External Slow-Thinking: Von Schneeballfehlern zur Wahrscheinlichkeit einer korrekten Begründung | 重新思考外部缓慢思考:从雪球错误到正确理由的概率 2501.15602v3 |
Authors (3): Zeyu Gan, Yun Liao, Yong Liu
Test-time scaling, which is also often referred to as slow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model’s internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code at https://github.com/ZyGan1999/Snowball-Errors-and-Probability.
nan
Article 633
Title@2025-06-19 (4): Probing the Robustness of Large Language Models Safety to Latent Perturbations
Title: Probing the Robustness of Large Language Models Safety to Latent Perturbations | Nachweis der Robustheit großer Sprachmodelle Sicherheit zu latenten Störungen | 检验大语言模型安全性是否强,以证实大语言模型安全性是否足以应对前端扰动 2506.16078v1 |
Authors (10): Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.
nan
Article 634
Title@2025-06-19 (4): Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content
Title: Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content | Täuschender Humor: Ein synthetischer Mehrsprachiger Benchmark-Datensatz zur Überbrückung von fabrizierten Claims mit humorvollem Inhalt | 欺骗性幽默:一个合成多语种基准数据集,用于将制造索赔与幽默内容连接起来 2503.16031v3 |
Authors (3): Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya
In the evolving landscape of online discourse, misinformation increasingly adopts humorous tones to evade detection and gain traction. This work introduces Deceptive Humor as a novel research direction, emphasizing how false narratives, when coated in humor, can become more difficult to detect and more likely to spread. To support research in this space, we present the Deceptive Humor Dataset (DHD) a collection of humor-infused comments derived from fabricated claims using the ChatGPT-4o model. Each entry is labeled with a Satire Level (from 1 for subtle satire to 3 for overt satire) and categorized into five humor types: Dark Humor, Irony, Social Commentary, Wordplay, and Absurdity. The dataset spans English, Telugu, Hindi, Kannada, Tamil, and their code-mixed forms, making it a valuable resource for multilingual analysis. DHD offers a structured foundation for understanding how humor can serve as a vehicle for the propagation of misinformation, subtly enhancing its reach and impact. Strong baselines are established to encourage further research and model development in this emerging area.
nan
Article 635
Title@2025-06-19 (4): Cyberbullying Detection in Hinglish Text Using MURIL and Explainable AI
Title: Cyberbullying Detection in Hinglish Text Using MURIL and Explainable AI | Cyberbullying-Erkennung im abschreckenden Text mit MURIL und erklärbarer KI | 使用 MURIL 和可解释的 AI 的 Hinglish 文本中的网络欺凌探测 2506.16066v1 |
Authors (1): Devesh Kumar
The growth of digital communication platforms has led to increased cyberbullying incidents worldwide, creating a need for automated detection systems to protect users. The rise of code-mixed Hindi-English (Hinglish) communication on digital platforms poses challenges for existing cyberbullying detection systems, which were designed primarily for monolingual text. This paper presents a framework for cyberbullying detection in Hinglish text using the Multilingual Representations for Indian Languages (MURIL) architecture to address limitations in current approaches. Evaluation across six benchmark datasets – Bohra \textit{et al.}, BullyExplain, BullySentemo, Kumar \textit{et al.}, HASOC 2021, and Mendeley Indo-HateSpeech – shows that the MURIL-based approach outperforms existing multilingual models including RoBERTa and IndicBERT, with improvements of 1.36 to 13.07 percentage points and accuracies of 86.97\% on Bohra, 84.62\% on BullyExplain, 86.03\% on BullySentemo, 75.41\% on Kumar datasets, 83.92\% on HASOC 2021, and 94.63\% on Mendeley dataset. The framework includes explainability features through attribution analysis and cross-linguistic pattern recognition. Ablation studies show that selective layer freezing, appropriate classification head design, and specialized preprocessing for code-mixed content improve detection performance, while failure analysis identifies challenges including context-dependent interpretation, cultural understanding, and cross-linguistic sarcasm detection, providing directions for future research in multilingual cyberbullying detection.
nan
Article 636
Title@2025-06-19 (4): Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning
Title: Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning | Selbstkritische Kuriositätsverfeinerung: Ehrlichkeit und Hilfsbereitschaft in großen Sprachmodellen durch In-Context Learning verbessern | 自我批评、指导的精炼好奇力改进:通过内文学习加强大语言模式的诚实性和帮助性 2506.16064v1 |
Authors (2): Duc Hieu Ho, Chenglin Fan
Large language models (LLMs) have demonstrated robust capabilities across various natural language tasks. However, producing outputs that are consistently honest and helpful remains an open challenge. To overcome this challenge, this paper tackles the problem through two complementary directions. It conducts a comprehensive benchmark evaluation of ten widely used large language models, including both proprietary and open-weight models from OpenAI, Meta, and Google. In parallel, it proposes a novel prompting strategy, self-critique-guided curiosity refinement prompting. The key idea behind this strategy is enabling models to self-critique and refine their responses without additional training. The proposed method extends the curiosity-driven prompting strategy by incorporating two lightweight in-context steps including self-critique step and refinement step. The experiment results on the HONESET dataset evaluated using the framework $\mathrm{H}^2$ (honesty and helpfulness), which was executed with GPT-4o as a judge of honesty and helpfulness, show consistent improvements across all models. The approach reduces the number of poor-quality responses, increases high-quality responses, and achieves relative gains in $\mathrm{H}^2$ scores ranging from 1.4% to 4.3% compared to curiosity-driven prompting across evaluated models. These results highlight the effectiveness of structured self-refinement as a scalable and training-free strategy to improve the trustworthiness of LLMs outputs.
nan
Article 637
Title@2025-06-19 (4): BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs
Title: BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs | BriefMe: Ein gesetzlicher NLP-Benchmark für die Unterstützung mit rechtlichen Briefen | 简报:协助提供法律简报的《国家劳工规划法》法律基准 2506.06619v3 |
Authors (4): Jesse Woo, Fateme Hashemi Chaleshtori, Ana Marasović, Kenneth Marino
A core part of legal work that has been under-explored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today’s large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.
nan
Article 638
Title@2025-06-19 (4): Knee-Deep in C-RASP: A Transformer Depth Hierarchy
Title: Knee-Deep in C-RASP: A Transformer Depth Hierarchy | Knie-Tief in C-RASP: Eine Transformer-Tiefe Hierarchie | C-RASP:变异深度分层 2506.16055v1 |
Authors (3): Andy Yang, Michaël Cadilhac, David Chiang
It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained with greater depth? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). These results are established by studying a form of temporal logic with counting operators, which was shown equivalent to C-RASP in previous work. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.
nan
Article 639
Title@2025-06-19 (4): A Hybrid DeBERTa and Gated Broad Learning System for Cyberbullying Detection in English Text
Title: A Hybrid DeBERTa and Gated Broad Learning System for Cyberbullying Detection in English Text | Ein hybrides DeBERTa- und Gated-Breites Lernsystem für Cyberbullying Detection im englischen Text | 混合德贝塔和通用广学系统,用于网络欺凌探测的英文文本 2506.16052v1 |
Authors (1): Devesh Kumar
The proliferation of online communication platforms has created unprecedented opportunities for global connectivity while simultaneously enabling harmful behaviors such as cyberbullying, which affects approximately 54.4\% of teenagers according to recent research. This paper presents a hybrid architecture that combines the contextual understanding capabilities of transformer-based models with the pattern recognition strengths of broad learning systems for effective cyberbullying detection. This approach integrates a modified DeBERTa model augmented with Squeeze-and-Excitation blocks and sentiment analysis capabilities with a Gated Broad Learning System (GBLS) classifier, creating a synergistic framework that outperforms existing approaches across multiple benchmark datasets. The proposed ModifiedDeBERTa + GBLS model achieved good performance on four English datasets: 79.3\% accuracy on HateXplain, 95.41\% accuracy on SOSNet, 91.37\% accuracy on Mendeley-I, and 94.67\% accuracy on Mendeley-II. Beyond performance gains, the framework incorporates comprehensive explainability mechanisms including token-level attribution analysis, LIME-based local interpretations, and confidence calibration, addressing critical transparency requirements in automated content moderation. Ablation studies confirm the meaningful contribution of each architectural component, while failure case analysis reveals specific challenges in detecting implicit bias and sarcastic content, providing valuable insights for future improvements in cyberbullying detection systems.
nan
Article 640
Title@2025-06-19 (4): Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
Title: Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping | Leiter-Residual: Parallelismus-bewusste Architektur zur Beschleunigung großer Modellinferenz mit Kommunikationsüberlappung | 云梯-残余:加速大型模型推断与通信重叠的平行意识结构 2501.06589v5 |
Authors (10): Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao
Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. We release our code for training and inference for easier replication of experiments.
nan
Article 641
Title@2025-06-19 (4): DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling
Title: DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling | DynScaling: Effizientes Verifier-freies Inferenzscaling über dynamische und integrierte Sampling | DynSACLAG:通过动态和综合抽样,提高验证人无引文的有效比例 2506.16043v1 |
Authors (5): Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, Sercan Ö. Arık
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation. Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints. We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. The integrated sampling strategy unifies parallel and sequential sampling by constructing synthetic sequential reasoning chains from initially independent parallel responses, promoting diverse and coherent reasoning trajectories. The dynamic budget allocation framework formulates the allocation of computational resources as a multi-armed bandit problem, adaptively distributing the inference budget across queries based on the uncertainty of previously sampled responses, thereby maximizing computational efficiency. By combining these components, DynScaling effectively improves LLM performance under practical resource constraints without the need for external verifiers. Experimental results demonstrate that DynScaling consistently surpasses existing verifier-free inference scaling baselines in both task performance and computational cost.
nan
Article 642
Title@2025-06-19 (4): Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3
Title: Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3 | Verbesserung der Dokumenten-Fragebeantwortung mittels Multi-Hop Retrieval-Augmented Generation mit LLaMA 3 | 通过多层检索-提法一代加强文件层面的回答问题,LLAMA 3 2506.16037v1 |
Authors (6): Xinyue Huang, Ziqi Lin, Fang Sun, Wenchao Zhang, Kejian Tong, Yunbo Liu
This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model’s robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers.
nan
Article 643
Title@2025-06-19 (4): Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives
Title: Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives | Re-TASK: LLM-Aufgaben aus Capability, Skill und Knowledge Perspectives überarbeiten | 重新研究TASK:从能力、技能和知识角度重新研究LLM任务 2408.06904v3 |
Authors (10): Zhihu Wang, Shiwan Zhao, Yu Wang, Heyuan Huang, Sitao Xie, Yubo Zhang, Jiaxin Shi, Zhixing Wang, Hongyan Li, Junchi Yan
The Chain-of-Thought (CoT) paradigm has become a pivotal method for solving complex problems with large language models (LLMs). However, its application to domain-specific tasks remains challenging, as LLMs often fail to decompose tasks accurately or execute subtasks effectively. This paper introduces the Re-TASK framework, a novel theoretical model that revisits LLM tasks from capability, skill, and knowledge perspectives, drawing on the principles of Bloom’s Taxonomy and Knowledge Space Theory. While CoT provides a workflow-centric perspective on tasks, Re-TASK introduces a Chain-of-Learning (CoL) paradigm that highlights task dependencies on specific capability items, further broken down into their constituent knowledge and skill components. To address CoT failures, we propose a Re-TASK prompting strategy, which strengthens task-relevant capabilities through targeted knowledge injection and skill adaptation. Experiments across diverse domains demonstrate the effectiveness of Re-TASK. In particular, we achieve improvements of 45.00% on Yi-1.5-9B and 24.50% on Llama3-Chinese-8B for legal tasks. These results highlight the potential of Re-TASK to significantly enhance LLM performance and its applicability in specialized domains. We release our code and data at https://github.com/Uylee/Re-TASK.
nan
Article 644
Title@2025-06-19 (4): EvoLM: In Search of Lost Language Model Training Dynamics
Title: EvoLM: In Search of Lost Language Model Training Dynamics | EvoLM: Auf der Suche nach verlorenen Sprachmodellen | EvoLM: 寻找失传语言培训模式 2506.16029v1 |
Authors (9): Zhenting Qi, Fan Nie, Alexandre Alahi, James Zou, Himabindu Lakkaraju, Yilun Du, Eric Xing, Sham Kakade, Hanlin Zhang
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs’ training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.
nan
Article 645
Title@2025-06-19 (4): From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation
Title: From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation | Von General zu Targeted Rewards: Übertreffen von GPT-4 in Open-Ended Long-Context Generation | 从一般奖励到有目标的奖励:在不限期长长一代中取代GPT-4 2506.16024v1 |
Authors (7): Zhihan Guo, Jiele Wu, Wenqian Cui, Yifei Zhang, Minda Hu, Yufei Wang, Irwin King
Current research on long-form context in Large Language Models (LLMs) primarily focuses on the understanding of long-contexts, the Open-ended Long Text Generation (Open-LTG) remains insufficiently explored. Training a long-context generation model requires curation of gold standard reference data, which is typically nonexistent for informative Open-LTG tasks. However, previous methods only utilize general assessments as reward signals, which limits accuracy. To bridge this gap, we introduce ProxyReward, an innovative reinforcement learning (RL) based framework, which includes a dataset and a reward signal computation method. Firstly, ProxyReward Dataset generation is accomplished through simple prompts that enables the model to create automatically, obviating extensive labeled data or significant manual effort. Secondly, ProxyReward Signal offers a targeted evaluation of information comprehensiveness and accuracy for specific questions. The experimental results indicate that our method ProxyReward surpasses even GPT-4-Turbo. It can significantly enhance performance by 20% on the Open-LTG task when training widely used open-source models, while also surpassing the LLM-as-a-Judge approach. Our work presents effective methods to enhance the ability of LLMs to address complex open-ended questions posed by human.
nan
Article 646
Title@2025-06-19 (4): Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models
Title: Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models | Min-p, Max Übertreibung: Eine kritische Analyse der Min-p-Sampling in Sprachmodellen | Min-p, Max Explation: 对语言模型的 Min-p 抽样的批判性分析 2506.13681v2 |
Authors (3): Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch
Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications. Recently, Nguyen et al. 2024’s “Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs” introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling. The significance of these claims was underscored by the paper’s recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation. This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper’s four lines of evidence. First, the original paper’s human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines. Second, comprehensively sweeping the original paper’s NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. Third, the original paper’s LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading. We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity.
nan
Article 647
Title@2025-06-19 (4): Detecting Prefix Bias in LLM-based Reward Models
Title: Detecting Prefix Bias in LLM-based Reward Models | Präfix Bias in LLM-basierten Prämienmodellen erkennen | 在以LLM为基础的奖励模型中检测前功ix Bias 2505.13487v2 |
Authors (5): Ashwin Kumar, Yuzi He, Aram H. Markosyan, Bobbie Chern, Imanol Arrieta-Ibarra
Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias – a systematic shift in model preferences triggered by minor variations in query prefixes – in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose a data augmentation strategy to mitigate these biases, showing its effectiveness in reducing the impact of prefix bias. Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to the broader discourse on fairness in AI.
nan
Article 648
Title@2025-06-19 (4): Bayesian Epistemology with Weighted Authority: A Formal Architecture for Truth-Promoting Autonomous Scientific Reasoning
Title: Bayesian Epistemology with Weighted Authority: A Formal Architecture for Truth-Promoting Autonomous Scientific Reasoning | Bayesische Epistemologie mit gewichteter Autorität: Eine formale Architektur für wahrheitsfördernde autonome wissenschaftliche Vernunft | Bayesian Bayesian 与加权当局合作的巴耶斯派:促进真相促进自主科学理由的正式架构 2506.16015v1 |
Authors (1): Craig S. Wright
The exponential expansion of scientific literature has surpassed the epistemic processing capabilities of both human experts and current artificial intelligence systems. This paper introduces Bayesian Epistemology with Weighted Authority (BEWA), a formally structured architecture that operationalises belief as a dynamic, probabilistically coherent function over structured scientific claims. Each claim is contextualised, author-attributed, and evaluated through a system of replication scores, citation weighting, and temporal decay. Belief updates are performed via evidence-conditioned Bayesian inference, contradiction processing, and epistemic decay mechanisms. The architecture supports graph-based claim propagation, authorial credibility modelling, cryptographic anchoring, and zero-knowledge audit verification. By formalising scientific reasoning into a computationally verifiable epistemic network, BEWA advances the foundation for machine reasoning systems that promote truth utility, rational belief convergence, and audit-resilient integrity across dynamic scientific domains.
nan
Article 649
Title@2025-06-19 (4): Beyond Prediction – Structuring Epistemic Integrity in Artificial Reasoning Systems
Title: Beyond Prediction – Structuring Epistemic Integrity in Artificial Reasoning Systems | Jenseits der Vorhersage – Strukturierung epistemischer Integrität in künstlichen Vernunftsystemen | 超越预测 – – 构建人造理由说明系统中的宇宙完整性 2506.17331v1 |
Authors (1): Craig Steven Wright
This paper develops a comprehensive framework for artificial intelligence systems that operate under strict epistemic constraints, moving beyond stochastic language prediction to support structured reasoning, propositional commitment, and contradiction detection. It formalises belief representation, metacognitive processes, and normative verification, integrating symbolic inference, knowledge graphs, and blockchain-based justification to ensure truth-preserving, auditably rational epistemic agents.
nan
Article 650
Title@2025-06-19 (4): Large-Scale Data Selection for Instruction Tuning
Title: Large-Scale Data Selection for Instruction Tuning | Großformatige Datenauswahl für die Instruction Tuning | 用于教学图示的大型数据选择 2503.01807v2 |
Authors (5): Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi
Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested – all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.
nan
Article 651
Title@2025-06-19 (4): SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes
Title: SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes | SDE-SQL: Verbesserung der Text-zu-SQL-Generierung in großen Sprachmodellen durch selbstgesteuerte Exploration mit SQL-Probes | SDE-SQL:通过自发探索SQL勘探,在大语言模型中加强从文字到SQL的生成 2506.07245v2 |
Authors (3): Wenxuan Xie, Yaxun Dai, Wenhao Jiang
Recent advancements in large language models (LLMs) have significantly improved performance on the Text-to-SQL task. However, prior approaches typically rely on static, pre-processed database information provided at inference time, which limits the model’s ability to fully understand the database contents. Without dynamic interaction, LLMs are constrained to fixed, human-provided context and cannot autonomously explore the underlying data. To address this limitation, we propose SDE-SQL, a framework that enables large language models to perform self-driven exploration of databases during inference. This is accomplished by generating and executing SQL probes, which allow the model to actively retrieve information from the database and iteratively update its understanding of the data. Unlike prior methods, SDE-SQL operates in a zero-shot setting, without relying on any question-SQL pairs as in-context demonstrations. When evaluated on the BIRD benchmark with Qwen2.5-72B-Instruct, SDE-SQL achieves an 8.02% relative improvement in execution accuracy over the vanilla Qwen2.5-72B-Instruct baseline, establishing a new state-of-the-art among methods based on open-source models without supervised fine-tuning (SFT) or model ensembling. Moreover, with SFT, the performance of SDE-SQL can be further enhanced, yielding an additional 0.52% improvement.
nan
Article 652
Title@2025-06-19 (4): On Domain-Adaptive Post-Training for Multimodal Large Language Models
Title: On Domain-Adaptive Post-Training for Multimodal Large Language Models | Zum Domain-Adaptive Post-Training für multimodale große Sprachmodelle | 关于多模式大语言模式的多模式后培训 2411.19930v3 |
Authors (8): Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang
Adapting general multimodal large language models (MLLMs) to specific domains, such as scientific and industrial fields, is highly significant in promoting their practical applications. This paper systematically investigates domain adaptation of MLLMs via post-training, focusing on data synthesis, training pipeline, and task evaluation. (1) Data Synthesis: Using only open-source models, we develop a generate-then-filter pipeline that curates diverse visual instruction tasks based on domain-specific image-caption pairs. The resulting data surpass the data synthesized by manual rules or strong closed-source models in enhancing domain-specific performance. (2) Training Pipeline: Unlike general MLLMs that typically adopt a two-stage training paradigm, we find that a single-stage approach is more effective for domain adaptation. (3) Task Evaluation: We conduct extensive experiments in high-impact domains such as biomedicine, food, and remote sensing, by post-training a variety of MLLMs and then evaluating MLLM performance on various domain-specific tasks. Finally, we fully open-source our models, code, and data to encourage future research in this area.
nan
Article 653
Title@2025-06-19 (4): Core Knowledge Deficits in Multi-Modal Language Models
Title: Core Knowledge Deficits in Multi-Modal Language Models | Kernwissensdefizite in multimodalen Sprachmodellen | 多模式语言模型中的核心知识缺陷 2410.10855v4 |
Authors (11): Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng
While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge–rudimentary cognitive abilities innate to humans from early childhood. To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones. Finally, we propose Concept Hacking, a novel controlled evaluation method that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.
nan
Article 654
Title@2025-06-19 (4): Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion
Title: Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion | Doppelentendre: Robuste audiobasierte KI-generierte Lyrics-Erkennung über Multi-View Fusion | 双向内容: 强力音频根据 AI 生成的音频通过多视图组合探测 2506.15981v1 |
Authors (4): Markus Frohmann, Gabriel Meseguer-Brocal, Markus Schedl, Elena V. Epure
The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
nan
Article 655
Title@2025-06-19 (4): A Vietnamese Dataset for Text Segmentation and Multiple Choices Reading Comprehension
Title: A Vietnamese Dataset for Text Segmentation and Multiple Choices Reading Comprehension | Ein vietnamesischer Datensatz für Textsegmentierung und Multiple Choices Leseverständnis | 用于文字分割和多重选择的越南数据集阅读理解 2506.15978v1 |
Authors (4): Toan Nguyen Hai, Ha Nguyen Viet, Truong Quan Xuan, Duc Do Minh
Vietnamese, the 20th most spoken language with over 102 million native speakers, lacks robust resources for key natural language processing tasks such as text segmentation and machine reading comprehension (MRC). To address this gap, we present VSMRC, the Vietnamese Text Segmentation and Multiple-Choice Reading Comprehension Dataset. Sourced from Vietnamese Wikipedia, our dataset includes 15,942 documents for text segmentation and 16,347 synthetic multiple-choice question-answer pairs generated with human quality assurance, ensuring a reliable and diverse resource. Experiments show that mBERT consistently outperforms monolingual models on both tasks, achieving an accuracy of 88.01% on MRC test set and an F1 score of 63.15\% on text segmentation test set. Our analysis reveals that multilingual models excel in NLP tasks for Vietnamese, suggesting potential applications to other under-resourced languages. VSMRC is available at HuggingFace
nan
Article 656
Title@2025-06-19 (4): FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
Title: FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation | FEA-Bench: Ein Benchmark für die Bewertung der Code-Generierung auf Repository-Ebene für die Feature-Implementierung | FEA-Bench:评估存储器一级实施地物代码生成的基准 2503.06680v2 |
Authors (9): Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li
Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs’ automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.
nan
Article 657
Title@2025-06-19 (4): Multi-use LLM Watermarking and the False Detection Problem
Title: Multi-use LLM Watermarking and the False Detection Problem | Multi-Use LLM Watermarking und das Problem der falschen Erkennung | 多用途LLM LLM 水标志和假探测问题 2506.15975v1 |
Authors (2): Zihao Fu, Chris Russell
Digital watermarking is a promising solution for mitigating some of the risks arising from the misuse of automatically generated text. These approaches either embed non-specific watermarks to allow for the detection of any text generated by a particular sampler, or embed specific keys that allow the identification of the LLM user. However, simultaneously using the same embedding for both detection and user identification leads to a false detection problem, whereby, as user capacity grows, unwatermarked text is increasingly likely to be falsely detected as watermarked. Through theoretical analysis, we identify the underlying causes of this phenomenon. Building on these insights, we propose Dual Watermarking which jointly encodes detection and identification watermarks into generated text, significantly reducing false positives while maintaining high detection accuracy. Our experimental results validate our theoretical findings and demonstrate the effectiveness of our approach.
nan
Article 658
Title@2025-06-19 (4): LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
Title: LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning | LazyEviction: Verlangsamte KV-Eviktion mit Aufmerksamkeitsmusterbeobachtung für effizientes Long Reasoning | LazyEvition: 以关注方式对有效长长理由进行观察的Lucking KV驱逐 2506.15969v1 |
Authors (5): Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo
Large Language Models (LLMs) exhibit enhanced reasoning capabilities by employing Chain-of-Thought (CoT). However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache size, particularly in tasks requiring long reasoning sequences, such as mathematics and programming. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens receive renewed attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, a lagged KV eviction framework designed to maintain reasoning performance while reducing KV memory. LazyEviction is an Observation Window-based Lagged Eviction Mechanism retaining latent recurring tokens by performing lagged evictions across decoding steps, which contains two key components: (1) Recurrence Interval Tracking for capturing temporal variations in token importance, and (2) an Maximum Recurrence Interval-Centric Eviction Policy that prioritizes eviction based on tokens’ recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache size by 50% while maintaining comparable accuracy on mathematics reasoning datasets, outperforming state-of-the-art methods. Our findings highlight the importance of preserving recurring tokens, which are critical for maintaining knowledge continuity in multi-step reasoning tasks.
nan
Article 659
Title@2025-06-19 (4): Reranking-based Generation for Unbiased Perspective Summarization
Title: Reranking-based Generation for Unbiased Perspective Summarization | Reranking-basierte Generation für unvoreingenommene Perspektive Zusammenfassung | 重新排名的无偏见观点概述一代 2506.15925v1 |
Authors (3): Narutatsu Ri, Nicholas Deas, Kathleen McKeown
Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
nan