• 00 07-24 (4) Checklists Are Better Than Reward Models For Aligning Language Models Checklisten sind besser als Belohnungsmodelle für die Ausrichtung von Sprachmodellen 核对列表比奖励模型更好调整语言模型 2507.18624v1
  • 01 07-24 TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards TRPrompt: Bootstrapping Query-Aware Prompt Optimierung von Textbelohnungen TRPropt: 从文本奖励中促进解答询问软件快速优化 2507.18618v1
  • 02 07-24 SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning SynC: Synthetische Bildunterschrift Datensatzverfeinerung mit ein-zu-vielen Mapping für Zero-shot Bildunterschrift 合成图像说明: 合成图像说明数据集精化,用一到多个绘图进行零光图像说明的合成图像说明 2507.18616v1
  • 03 07-24 BEARCUBS: A benchmark for computer-using web agents BEARCUBS: Benchmark für computergestützte Web-Agenten BEARCUBS:计算机使用网络代理器的基准 2503.07919v3
  • 04 07-24 Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs Sparse Logit Sampling: Beschleunigung der Wissensdestillation in LLMs 粗略的登录抽样:加速在LLMs中进行知识蒸馏 2503.16870v2
  • 05 07-24 Scaling RL to Long Videos Skalierung von RL zu langen Videos 缩放 RL 到长视频 2507.07966v2
  • 06 07-24 What Makes You CLIC: Detection of Croatian Clickbait Headlines Was macht Sie CLIC: Erkennung von kroatischen Clickbait Schlagzeilen 是什么让你成为CLIC:发现克罗地亚点击头条头条 2507.14314v2
  • 07 07-24 AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs AQuilt: Verweben von Logik und Selbstinspektion in Low-Cost, High-Relevance-Datensynthese für Spezialisten LLMs Anilt:将逻辑和自我检查编织成低成本高相关性数据合成,供专家LLMs使用 2507.18584v1
  • 08 07-24 DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data DR.EHR: Dense Retrieval für elektronische Gesundheitsdaten mit Wissensinjektion und synthetischen Daten DR.EHR: 具有知识注射和合成数据的电子健康记录大量检索 2507.18583v1
  • 09 07-24 System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition Systembericht für CCL25-Eval Task 10: SRAG-MAV für feinkörnige chinesische Hassspracherkennung 供CCL25-Eval任务10使用的系统报告:关于中华恶言识别的SRAG-MAV系统报告 2507.18580v1
  • 10 07-24 P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应 2406.12548v3
  • 11 07-24 Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs Weit-in, schmal-out: Wiederverwertbare Dekodierung für effiziente und effektive DLLMs 宽放, 窄出: 为高效和有效DLLMs而可撤销的解码 2507.18578v1
  • 12 07-24 LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs LingBench++: Ein linguistisch-informiertes Benchmark- und Reasoning-Framework für mehrstufige und kulturübergreifende Schlussfolgerungen mit LLMs LingBench++:与LLMs的多层次和跨文化推理语言综合基准和理由框架 2507.16809v2
  • 13 07-24 SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law SafeWork-R1: Koevolving Safety and Intelligence unter dem AI-45$^{\circ}$ Gesetz 安全工作-R1:根据AI-45$ circ}$ 法发展安全和情报 2507.18576v1
  • 14 07-24 Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning Agent-Fin-R1: Verbesserung der Finanzintelligenz durch Domain-Expertise, Trainingseffizienz und Advanced Reasoning Agentar Fin-Fin-R1:通过域域专门知识、培训效率和高级理由加强金融情报 2507.16802v3
  • 15 07-24 PosterMate: Audience-driven Collaborative Persona Agents for Poster Design PosterMate: Audience-getriebene Kollaborative Persona Agenten für Poster-Design PosterMate:由观众驱动的海报设计合作人员代理 2507.18572v1
  • 16 07-24 Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods Hybride Tokenisierungsstrategie für DNA-Sprachmodell mit Byte Pair Encoding und K-MER Methoden 使用字节对等编码和K-MER方法的DNA语言模型混合化战略 2507.18570v1
  • 17 07-24 GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation GIIFT: Graph-geführte induktive Bildverarbeitungsfreie multimodale maschinelle Übersetzung GIIFT: 图表制导感性不含图像的无图像多式机器翻译 2507.18562v1
  • 18 07-24 Identity-related Speech Suppression in Generative AI Content Moderation Identitätsbezogene Sprachunterdrückung in der Generativen KI-Inhaltsmoderation 在产生AI 内容调节中禁止与身份有关的言语 2409.13725v3
  • 19 07-24 LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important LagKV: Lag-Relative Information des KV-Cache erzählt, welche Token wichtig sind LagKV: KV 缓存告诉哪个 Tokens 重要, 而 KV 缓存的拉格- 相对信息Name 2504.04704v2
  • 20 07-24 GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface GLiNER2: Ein effizientes Multi-Task-Informationsextraktionssystem mit Schema-gesteuerter Schnittstelle GLINER2:具有Schema-Driven界面的高效多任务信息提取系统 2507.18546v1
  • 21 07-24 Effective Multi-Task Learning for Biomedical Named Entity Recognition Effektives Multi-Task-Lernen für die biomedizinische benannte Entitätserkennung 有效多任务学习促进生物医学命名实体的识别 2507.18542v1
  • 22 07-24 The Moral Gap of Large Language Models Die moralische Kluft großer Sprachmodelle 大语言模式的道德差距 2507.18523v1
  • 23 07-24 GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks GCC-Spam: Spam-Erkennung über GAN, Kontrastives Lernen und Charaktergleichheitsnetzwerke 海合会-Spam:通过全球大气监测网、反竞争学习和特征相似网络探测垃圾邮件 2507.14679v2
  • 24 07-24 Exploiting individual differences to bootstrap communication Nutzung individueller Unterschiede zur Bootstrap-Kommunikation 利用个人差异进行靴套通信 2504.05211v2
  • 25 07-24 Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models Nicht alle Funktionen widmen sich der Aufmerksamkeit: Graphengeführtes Abhängigkeitslernen für tabellarische Datengenerierung mit Sprachmodellen 并非所有值得注意的地物:用语言模型编制图表数据时的图表指导依赖性学习 2507.18504v1
  • 26 07-24 LLM-based Embedders for Prior Case Retrieval LLM-basierte Embedders für frühere Fallwiederherstellung 用于先前个案检索的LLM 以LLM为基础的嵌入器 2507.18455v1
  • 27 07-24 Generation of Synthetic Clinical Text: A Systematic Review Generieren von synthetischem klinischem Text: Ein systematischer Test 合成临床文本的生成:系统审查 2507.18451v1
  • 28 07-24 Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language Wiederherstellung des Rhythmus: Pünktlichkeitsrestaurierung mit Transformer-Modellen für Bangla, eine Sprache mit geringer Ressource 恢复时速:使用孟加拉国低资源语言 “ 孟加拉 “ 变压器模型恢复脉冲 2507.18448v1
  • 29 07-24 AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data AraTable: Benchmarking von LLMs’ Vernunft und Verständnis arabischer Tabellendaten 阿拉伯表格:按基准确定LLM女士对阿拉伯表格数据的理由和理解 2507.18442v1
  • 30 07-24 IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation IPCGRL: Sprachgestütztes Verstärkungslernen für die verfahrenstechnische Level-Generierung ICPCGRL: 程序生成阶段语言教学强化学习 2503.12358v4
  • 31 07-24 DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten DFAME: 与多式联运专家进行动态证据法检查 2412.10510v4
  • 32 07-24 How do language models learn facts? Dynamics, curricula and hallucinations Wie lernen Sprachmodelle Fakten? Dynamik, Lehrpläne und Halluzinationen 语言模式如何了解事实?动态、课程和幻觉 2503.21676v2
  • 33 07-24 FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs FinDPO: Finanz-Sentiment-Analyse für algorithmischen Handel durch Preference-Optimierung von LLMs FinDPO:通过优惠优化LLMs,分析通过高利贷交易的金融敏感度 2507.18417v1
  • 34 07-24 ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models Explica: Explizite kausale Vernunft in großen Sprachmodellen bewerten ExpliCa:在大语言模型中评估明确的原因原因 2502.15487v3
  • 35 07-24 Factual Inconsistencies in Multilingual Wikipedia Tables Tatsächliche Inkonsistenzen in mehrsprachigen Wikipedia-Tabellen 多语言维基百科表格中的事实不一致 2507.18406v1
  • 36 07-24 CLEAR: Error Analysis via LLM-as-a-Judge Made Easy CLEAR: Fehleranalyse über LLM-as-a-Judge leicht gemacht CLLEAR:通过LLM-as-a法官进行错误分析 2507.18392v1
  • 37 07-24 Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games 原因:在公共货物运动会中,理性语言模式成为自由骑手 2506.23276v2
  • 38 07-24 Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs Beyond Profile: Von Oberflächen-Fakten zur tiefen Persona-Simulation in LLMs 超越简介:从地平面事实到深人模拟LLMM 2502.12988v3
  • 39 07-24 Mechanistic Indicators of Understanding in Large Language Models Mechanistische Indikatoren des Verstehens in großen Sprachmodellen 大语言模型中理解力的机械指标 2507.08017v3
  • 40 07-24 Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence Hybride Annotation für Propagandaerkennung: Integration von LLM-Vorannotationen mit menschlicher Intelligenz 宣传探测混合说明:将LLM预告与人类情报相结合 2507.18343v1
  • 41 07-24 TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning TDR: Task-decoupled Retrieval mit feinkörnigem LLM-Feedback für das In-Context-Lernen TDR: 以精细的LLM反馈方式进行任务减缩的检索,以便进行内容学习 2507.18340v1
  • 42 07-24 Uncertainty Quantification for Evaluating Machine Translation Bias Ungewissheit Quantifizierung für die Auswertung von maschinellen Übersetzungs-Bias 评价机器翻译偏见的不确定性定量 2507.18338v1
  • 43 07-24 A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1 Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1 关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1 2507.08621v2
  • 44 07-24 BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit BadReasoner: Pflanzung Tunable Überdenken Hintertüren zu großen Grundmodellen für Spaß oder Gewinn BadReasoner: 将金枪鱼可变性过度思考的后门规划成娱乐或利润的大理由模型 2507.18305v1
  • 45 07-24 LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models LoRA-Leak: Membership Inferenz Angriffe gegen LoRA fein abgestimmte Sprachmodelle LoRA-Leak:对LORA精调语言模式的成员推论攻击 2507.18302v1
  • 46 07-24 DocTER: Evaluating Document-based Knowledge Editing DocTER: Dokumentbasierte Wissensbearbeitung bewerten 评价基于文件的知识编辑 2308.09954v2
  • 47 07-24 Step-Audio 2 Technical Report Schritt-Audio 2 Technischer Bericht 技术报告 2507.16632v2
  • 48 07-24 VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks VolDoGer: LLM-unterstützte Datensätze für Domain-Verallgemeinerung in Vision-Language-Aufgaben VolDoGer:LLM辅助数据集,用于视野语言任务中通用域的LLM辅助数据集 2407.19795v2
  • 49 07-24 StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer StyleAdaptedLM: Weiterentwicklung der Anleitung nach Modellen mit effizienter Stylistik-Übertragung StypeAddapedLM:按照高效立体转让模式加强教学 2507.18294v1
  • 50 07-24 Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil Null-Schuss OCR Genauigkeit der niedrig-Ressourcen Sprachen: Eine vergleichende Analyse auf Sinhala und Tamil 低资源语言的准确性:僧伽罗语和泰米尔语比较分析 2507.18264v1
  • 51 07-24 Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models Locate-and-Focus: Verbesserung der Terminologieübersetzung in Sprachmodellen 目的和重点:加强语言语言模式术语翻译 2507.18263v1
  • 52 07-24 Meta Prompting for AI Systems Meta Prompting für KI-Systeme AI 系统的模拟模拟 2311.11482v8
  • 53 07-24 Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation Prune&Comp: Kostenloses Mittagessen für Layer-Pruned LLMs über iterative Pruning mit Magnitude Compensation Prune & Comp: 通过模拟谨慎与磁度补偿为由层驱动的LMs免费午餐 2507.18212v1
  • 54 07-24 Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen 利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变 2505.20658v2
  • 55 07-24 Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation Untersuchung der Auswirkungen von Instruction-Tuning auf die Anfälligkeit von LLM für Fehlinformationen 探讨指导指导对LLM对错误信息易感性的影响 2507.18203v1
  • 56 07-24 Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection Sicherung von RAG-Pipelines mit GMTP: Eine gradient-basierte maskierte Token-Wahrscheinlichkeitsmethode für vergiftete Dokumentenerkennung 使用GMTP来保护RAG管道:一种基于渐进式蒙面的中毒文件检测概率方法 2507.18202v1
  • 57 07-24 Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization Integration eines ISO30401-konformen Wissensmanagementsystems in bestehende Geschäftsprozesse einer Organisation 将符合ISO30401的知识管理系统纳入一个组织的现有业务流程 2507.18197v1
  • 58 07-24 TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks TN-AutoRCA:电信网络中自我改进基于警报的原始原因分析的基准建设和示范框架 2507.18190v1
  • 59 07-24 SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models ANWENDUNGSBEREICH: Stochastische und gegensätzliche Wahlplatzierung für die Bewertung großer Sprachmodelle SCOPE:评估大语言模式的施虐和反偏见选择安置 2507.18182v1
  • 60 07-24 Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models Das Mittel halten: Sticky Tokens in Text-Embedding-Modellen erkennen 坚持平均值:在文本嵌入模型中检测粘力 2507.18171v1
  • 61 07-24 Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges Jüngste Trends bei der Ferngesprächserkennung: Ein Rückblick auf die Herausforderungen CHiME-7 und 8 DASR 最近对不同政见的语音识别趋势:对CHiME-7和8DASR挑战的回顾 2507.18161v1
  • 62 07-24 A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects Eine Umfrage über die Kausalitätsidentifizierung: Taxonomie, Herausforderungen, Bewertung und Perspektiven 事件原因识别调查:分类、挑战、评估和前景 2411.10371v5
  • 63 07-24 Large Language Models in Argument Mining: A Survey Große Sprachmodelle im Argumentbergbau: Eine Umfrage 争议采矿大语言模型:调查 2506.16383v4
  • 64 07-24 Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models Auf dem Weg zu größerer Hebelwirkung: Skalierungsgesetze für effiziente Mixture-of-Experts-Sprachmodelle 争取更大程度的利用:提高有效混合专家语言模式法的规模 2507.17702v2
  • 65 07-24 Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice Seed LiveInterpret 2.0: End-to-End Simultanübersetzung mit Ihrer Stimme 种子实况解释2.0:用声音翻译终端到终端同声语音语音 2507.17527v2
  • 66 07-24 HIVMedQA: Benchmarking large language models for HIV medical decision support HIVMedQA: Benchmarking großer Sprachmodelle für die medizinische HIV-Entscheidungsunterstützung HIVMedQA:确定艾滋病毒医疗决策支助大语言模式的基准 2507.18143v1
  • 67 07-24 MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning Mathopeval: Ein feinkörniger Evaluations-Benchmark für visuelle Operationen von MLLMs in mathematischer Reasoning MathOPEval:数学理由中MLLMs视觉操作精美评价基准 2507.18140v1
  • 68 07-24 OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v4
  • 69 07-24 A Survey of Deep Learning for Geometry Problem Solving Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen 解决几何问题深层学习调查 2507.11936v3
  • 70 07-24 GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness GOAT-SLM: Ein gesprochenes Sprachmodell mit paralinguistischem und Lautsprechercharakteristischem Bewusstsein GOAT-SLM:具有多语言语言和议长特点意识的口语模式 2507.18119v1
  • 71 07-24 When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems Wenn Autonomie Rogue: Vorbereitung auf Risiken der Multi-Agenten-Kollusion in sozialen Systemen 当自治时,罗格:准备应对社会系统中多机构串通的风险 2507.14660v2
  • 72 07-24 Agentic AI framework for End-to-End Medical Data Inference Agentische KI-Framework für Ende-zu-Ende medizinische Datenableitung 最终至最终医疗数据推断的AA AA 框架框架 2507.18115v1
  • 73 07-24 A New Pair of GloVes Ein neues Paar GloVes 新的地球之对 2507.18103v1
  • 74 07-24 Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation Lang-Short-Distanz Graph Neural Networks und verbessertes Curriculum-Lernen für Emotionserkennung im Gespräch 长短距离远距神经神经网络和改进课程学习,以在对话中认识情感 2507.15205v2
  • 75 07-24 ELITE: Enhanced Language-Image Toxicity Evaluation for Safety ELITE: Verbesserte Sprach-Image-Toxizitätsbewertung für Sicherheit ELITE:加强语言-图像安全毒性评价 2502.04757v3
  • 76 07-24 EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework BildungQ: Bewertung der Lehrfähigkeiten von LLMs durch Multi-Agent Dialograhmen 教育Q:通过多机构对话框架评价LLMS的教学能力 2504.14928v2
  • 77 07-24 Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints Hybrides und einheitliches Feintuning von großen Sprachmodellen: Methoden und Benchmarking unter Ressourcenbeschränkungen 大语言模式统一调整和统一调整适用:在资源限制下的方法和基准 2507.18076v1
  • 78 07-24 Group Sequence Policy Optimization Optimierung der Gruppensequenzpolitik 组序列政策优化 2507.18071v1
  • 79 07-24 BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz BlockDiaect: 节能LLM 推论的粗件精细混合格式量化 2501.01144v5
  • 80 07-24 TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios TELEVAL: Ein dynamischer Benchmark für gesprochene Sprachmodelle in chinesischen interaktiven Szenarien TELEVAL:为中文互动假想中的口语模式设计的一个动态基准 2507.18061v1
  • 81 07-24 Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias Causally Testing Gender Bias in LLMs: Eine Fallstudie über berufsbezogene Bias 《LLMM中因果测试性别偏见:职业偏见案例研究》 2212.10678v4
  • 82 07-24 A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models Ein Multi-Faceted-Evaluierungsrahmen für die Bewertung synthetischer Daten, erzeugt durch große Sprachmodelle 评估由大语言模型生成的合成数据多面评价框架 2404.14445v2
  • 83 07-24 Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs Privacy-Preserving Synthetic Review Generation mit unterschiedlichen Schreibstilen mit LLMs 使用LLMMs以多种写作风格生成的隐私-保护合成审查 2507.18055v1
  • 84 07-24 From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems Von der Hypothese zur Veröffentlichung: Eine umfassende Umfrage zu KI-getriebenen Forschungsunterstützungssystemen 从假设到出版物:AI-Driven研究支助系统综合调查 2503.01424v3
  • 85 07-24 RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models EINGEDENK: Ein ungebundener Ressourcenverbrauchsangriff auf große Visions-Sprachenmodelle 回顾:对大型愿景-语言模型的无约束资源消费攻击 2507.18053v1
  • 86 07-24 Segmentation-free Goodness of Pronunciation Segmentierungsfreie Güte der Aussprache 读音良好 2507.16838v2
  • 87 07-24 Synthetic Data Generation for Phrase Break Prediction with Large Language Model Synthetische Datengenerierung für Phrase Break Prediction mit großem Sprachmodell 制作用于大语言模范大语言时段间断预测的合成数据 2507.18044v1
  • 88 07-24 GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs GrAInS: Gradient-basierte Zuordnung zur Inferenz-Zeitlenkung von LLMs und VLMs GrAInS:LLMs和VLMs的推论时间指导的逐步归属 2507.18043v1
  • 89 07-24 AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark AIR-Bench: Automatisierte Heterogene Information Retrieval Benchmark AIR-Bench:自动异源信息检索基准 2412.13102v4
  • 90 07-24 NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database NeuralDB: Skalierung von Wissen in LLMs auf 100.000 Fakten mit neuraler KV-Datenbank NeuralDDB: 将知识编辑在LLM 中到 100,000 千兆瓦的Neural KV 数据库中 2507.18028v1
  • 91 07-24 Technical Report of TeleChat2, TeleChat2.5 and T1 Technischer Bericht von TeleChat2, TeleChat2.5 und T1 TeleChat2、TeleChat2.5和T1技术报告 2507.18013v1
  • 92 07-24 GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures GRR-CoCa: LLM-Mechanismen in multimodalen Modellarchitekturen nutzen GRR-CoCa:在多模式建模中利用LLM机制 2507.18009v1
  • 93 07-23 (3) Quantifying the Uniqueness and Divisiveness of Presidential Discourse Quantifizierung der Einzigartigkeit und Teilung des Präsidentendiskurses 量化总统意见会的独一性和分散性 2401.01405v2
  • 94 07-23 Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains? Breaking Barriers: Gewinnt die Verstärkung von Posttrainings die Übertragung auf ungesehene Domains? 突破障碍:加强培训后收益是否转移到未知领域? 2506.19733v2
  • 95 07-23 Natural Language Processing for Tigrinya: Current State and Future Directions Natürliche Sprachverarbeitung für Tigrinya: Aktueller Zustand und zukünftige Richtungen 提格里尼亚的自然语言处理:现状和未来方向 2507.17974v1
  • 96 07-23 LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios LIFBench: Bewertung der Anleitung nach Leistung und Stabilität von großen Sprachmodellen in Langkontextszenarien LIFBench:评价长期设想中大语言模式绩效和稳定性指示 2411.07037v3
  • 97 07-23 Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation Mehrsprachige LLMs sind keine Mehrsprachigkeitsdenker: Belege aus Hindi Analogy Evaluation 多语种LLM女士不是多语种思想家:印地语分析评估的证据 2507.13238v2
  • 98 07-23 Are LLM Belief Updates Consistent with Bayes’ Theorem? Sind LLM-Belief-Updates im Einklang mit Bayes’ Theorem? 天主教信仰最新消息符合贝斯理论吗? 2507.17951v1
  • 99 07-23 Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text Bewertung der Leistungsfähigkeit von KI-Textdetektoren, wenige-Schuss und Chain-of-Thought-Prompting mit DeepSeek Generated Text 评估AI Text 检测器、很少热和用深搜索生成的催促研究链的文本的性能 2507.17944v1
  • 100 07-23 LLM Alignment as Retriever Optimization: An Information Retrieval Perspective LLM Alignment als Retriever-Optimierung: Eine Informations-Retrieval-Perspektive LLM 对齐作为最佳优化:信息检索视角 2502.03699v3
  • 101 07-23 Analyzing Fairness of Computer Vision and Natural Language Processing Models Analyse der Fairness von Computer Vision und natürlichen Sprachverarbeitungsmodellen 分析计算机视觉和自然语言处理模式的公平性 2412.09900v3
  • 102 07-23 Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation Bob’s Confetti: phonetische Erinnerungsangriffe in Musik und Videogenerierung Bob的Fonfetti:音乐和视频制作中的音响记忆攻击 2507.17937v1
  • 103 07-23 One Whisper to Grade Them All Ein Whisper, um sie alle zu bewerten 一次低口低口低口低语到年级 2507.17918v1
  • 104 07-23 Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data Diskriminative Feinsteuerung generativer großer Sprachmodelle ohne Belohnungsmodelle und menschliche Präferenzdaten 对没有奖励模式和人类优先数据、没有奖励模式和人类优先数据的产生大语言模型的产生型大语言模型进行有偏见的微调 2502.18679v3
  • 105 07-23 VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL VeriMinder: Eindämmung analytischer Schwachstellen in NL2SQL VeriMinder:减轻NL2SQL分析脆弱性 2507.17896v1
  • 106 07-23 Weak-to-Strong Jailbreaking on Large Language Models Schwach-zu-starkes Gefängnis mit großen Sprachmodellen 关于大语言模型的弱至强强监狱破解 2401.17256v5
  • 107 07-23 FLEXITOKENS: Flexible Tokenization for Evolving Language Models FLEXITOKENS: Flexible Tokenisierung für sich entwickelnde Sprachmodelle FLEXITOKENS: 不断演变的语言模式灵活化 2507.12720v2
  • 108 07-23 Dynamic and Generalizable Process Reward Modeling Dynamische und generalisierbare Prozess-Reward-Modellierung 动态和可通用流程奖励模型 2507.17849v1
  • 109 07-23 Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning Shop-R1: Belohnende LLMs, um menschliches Verhalten im Online-Shopping durch Verstärkungslernen zu simulieren 商店R1:通过强化学习在网上购物中模拟人类行为奖励LMs 2507.17842v1
  • 110 07-23 Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks Das Vortraining auf dem Testset ist nicht länger alles, was Sie brauchen: Ein debattegetriebener Ansatz zu QA-Benchmarks 有关测试成套标准的培训前培训并非你需要的更长时间:对质量评估基准采取辩论驱动的办法 2507.17747v1
  • 111 07-23 Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains Rubriken als Belohnungen: Verstärktes Lernen jenseits überprüfbarer Domänen ” 奖励 “ :超越可核实域域的强化学习 2507.17746v1
  • 112 07-23 Megrez2 Technical Report Technischer Bericht Megrez2 Megrez2 技术报告 2507.17728v1
  • 113 07-23 AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer KI-Telefonvermessung: Quantitative Datenerfassung mit einem KI-Interviewer automatisieren AI 电话测量:与AI 采访者一起自动化定量数据收集 2507.17718v1
  • 114 07-23 From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes Von Feedback zu Checklisten: Geerdete Bewertung von KI-generierten klinischen Anmerkungen 从反馈到核对表:对AI - AI - - - - - - - 临床笔记进行基础评价 2507.17717v1
  • 115 07-23 Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding Deep Video Discovery: Agentische Suche mit Tool-Nutzung für Langzeit-Video-Verständnis 深视频发现: 用于远程视频理解的工具的 Agric 搜索 2505.18079v3
  • 116 07-23 TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa TyDi QA-WANA: Ein Benchmark für die Beantwortung von Informationsanfragen in den Sprachen Westasiens und Nordafrikas Tydi QA-WANA:西亚和北非语言信息查询问题回答基准 2507.17709v1
  • 117 07-23 A Mathematical Theory of Discursive Networks Eine mathematische Theorie diskursiver Netzwerke 讨论网络的数学理论 2507.06565v5
  • 118 07-23 LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung LoX:低Rank外推法强力推力LLM 安全防止微调 2506.15606v2
  • 119 07-23 Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step Können wir Bilder mit CoT generieren? Lassen Sie uns die Bildgenerierung Schritt für Schritt überprüfen und verstärken 我们能用 Cot 生成图像吗? 让我们一步一步地校验和加强图像生成 2501.13926v2
  • 120 07-23 Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries Wer greift an und warum? Mit LLMs negative Kampagnen in 18M Tweets in 19 Ländern identifizieren 利用LLM公司查明18M Tweets 18M Tweets的负面运动,横跨19个国家 2507.17636v1
  • 121 07-23 WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training WSM: Decay-Free Learning Rate Scheduling via Checkpoint Merging für LLM Pre-Training WSM:通过LLM培训前的检查站合并,制定无下降的学习率表 2507.17634v1
  • 122 07-23 Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion Conan: Ein Chunkwise Online-Netzwerk für Null-Shot Adaptive Voice Conversion Conan:一个零热适应性语音转换的中远在线网络 2507.14534v2
  • 123 07-23 A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE) Hybrider Früh-Exit-Algorithmus für große Sprachmodelle auf Basis von Space Alignment Decoding (SPADE) 以空间调整编码为基础的大语言模型混合早期出界比值(SPADE) 2507.17618v1
  • 124 07-23 Multi-Level Explanations for Generative Language Models Mehrstufige Erklärungen für generative Sprachmodelle 产生语言模式的多层次解释 2403.14459v2
  • 125 07-23 Dual-branch Prompting for Multimodal Machine Translation Dual-Branch Prompting für multimodale maschinelle Übersetzung 多式联运机器翻译的双分支提示 2507.17588v1
  • 126 07-23 GenSelect: A Generative Approach to Best-of-N GenSelect: Ein generativer Ansatz zum Best-of-N GenSect: 产生最佳N型的方法 2507.17797v1
  • 127 07-23 Synthetic Voice Data for Automatic Speech Recognition in African Languages Synthetische Sprachdaten zur automatischen Spracherkennung in afrikanischen Sprachen 非洲语言自动语音识别合成声音数据 2507.17578v1
  • 128 07-23 Fairness Evaluation of Large Language Models in Academic Library Reference Services Fairness-Evaluierung von großen Sprachmodellen in wissenschaftlichen Bibliotheksreferenzdiensten 学术图书馆参考资料服务大语言模型公平评价 2507.04224v2
  • 129 07-23 BoSS: Beyond-Semantic Speech Boss: Jenseits semantischer Sprache BOSSS:超语语言 2507.17563v1
  • 130 07-23 Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline Auswirkungen von Aufklebern auf multimodale Sentiment und Intent in sozialen Medien: Eine neue Aufgabe, Datensatz und Ausgangslage 贴标签者对社会媒体多式联运和意向的影响:新任务、数据集和基线 2405.08427v2
  • 131 07-23 From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment Von Neuronen zur Semantik: Bewertung der Cross-Linguistic Alignment Fähigkeiten großer Sprachmodelle über Neuronen Alignment 从中世纪到语义学:通过中世纪对齐评估大语言模型的跨语言一致能力 2507.14900v2
  • 132 07-23 Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction Rede als multimodaler digitaler Phenotyp für Multi-Task LLM-basierte psychische Gesundheitsvorhersage 作为多任务LLM基于心理健康预测的多种模式数字哲学型演讲 2505.23822v3
  • 133 07-23 URPO: A Unified Reward & Policy Optimization Framework for Large Language Models URPO: Ein einheitliches Reward & Policy Optimization Framework für große Sprachmodelle URPO:大语言模式统一奖励和政策优化框架 2507.17515v1
  • 134 07-23 DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD DNT: ein tief normalisierter Transformer, der von Momentum SGD trainiert werden kann DNT:一种可接受 “ 动力 “ SPGD培训的 “ 高度正常化 “ 变异器 2507.17501v1
  • 135 07-23 Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants Lost in Variation? Bewertung der NLI-Performance in baskischen und spanischen geografischen Varianten 评价巴斯克和西班牙地理变异性国家LI绩效 2506.15239v2
  • 136 07-23 Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Pre-Training 推进藏藏人大语言模式,提供 “ 扩展数据 “ 和 “ 持续培训前 “ 。 2507.09205v3
  • 137 07-23 MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs MultiNRC: Ein anspruchsvolles und eingeborenes, mehrsprachiges Bewertungsmaßstab für LLMs 多伦多挪威研究中心:对LLMs的质疑和土著多语种理由评估基准 2507.17476v1
  • 138 07-23 WAKENLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking WAKENLLM: Bewertung des Potenzials und der Stabilität von LLMs mittels feinkörniger Benchmarking WAKNLLM: 通过精细基准评估LLMLM公司的合理合理潜力和稳定性 2507.16199v2
  • 139 07-23 Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Pseudo-Autoregressive Neural Codec-Sprachenmodelle für effiziente Null-Shot-Text-to-Speech-Synthese 高效零热文本对语音合成的优多-自动递减神经规范语言模型 2504.10352v2
  • 140 07-23 Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration Miipher-2: Ein universelles Sprachrestaurationsmodell für die Millionen-Stunden-Skala-Datenrestauration Mipher-2:百万小时规模数据恢复普遍语音恢复模式 2505.04457v4
  • 141 07-23 A Diagrammatic Calculus for a Functional Model of Natural Language Semantics Ein diagrammatischer Kalkulus für ein funktionelles Modell der natürlichen Sprachsemantik 自然语言语义学功能模型的图表计算 2507.00782v2
  • 142 07-23 MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models MEF: Ein Capability-Aware Multi-Encryption Framework zur Bewertung von Schwachstellen in Black-Box Large Language Models MEF: 用于评价黑箱大语言模型脆弱性的能力-软件多加密框架 2505.23404v4
  • 143 07-23 Each to Their Own: Exploring the Optimal Embedding in RAG Jeder für sich: Die optimale Einbettung in die RAG erkunden 探索在RAG中以最佳方式嵌入 2507.17442v1
  • 144 07-23 Investigating Subjective Factors of Argument Strength: Storytelling, Emotions, and Hedging Untersuchte subjektive Faktoren der Streitkraft: Geschichtenerzählen, Emotionen und Hedging 争议力量的主观调查因素: 故事、情感和上下行 2507.17409v1
  • 145 07-23 Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents Millionen von $\text{GeAR}$-s: Erweiterung von GraphRAG auf Millionen von Dokumenten 百万美元/美元/GeAR}- 美元:将图图扩大至百万份文件 2507.17399v1
  • 146 07-23 Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis Spracherkennung mit Hilfe der Minkowski-Norm: Identifikation durch Charakter Bigrams und Frequenzanalyse 以Minkowski Norm 手段进行语言探测:通过字符比格和频率分析进行识别 2507.16284v2
  • 147 07-23 Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models Visualisierung von Policy-Reward-Interplay, um Nullth-Order-Preference-Optimierung von großen Sprachmodellen zu informieren 可视化政策回报互动功能,为大语言模型提供零分优先优化信息 2503.03460v2
  • 148 07-23 TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition TransLPRNet: Lite Vision-Language Network für die Single/Dual-Line-Erkennung der chinesischen Lizenzschilde TransLPRNet:中国单线/双线许可证牌照识别利于视觉-语言网络 2507.17335v1
  • 149 07-23 Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies Auf dem Weg zur Erkennbarkeit von Überzeugungen in sozialen Medien: Von der Modellentwicklung zu Erkenntnissen über Überzeugungsstrategien 探索社会媒体的观察:从示范发展到观察社会媒体的观察 2503.13844v2
  • 150 07-23 Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation Lernen, rationale Beweise durch Verstärkungslernen für die retrieval-angereicherte Generation zu extrahieren 学习如何通过为回收-提款一代人加强学习来提取合理证据 2507.15586v2
  • 151 07-23 Cautious Next Token Prediction Vorsichtige nächste Zeichen Vorhersage 谨慎的次下 Tok 预测 2507.03038v2
  • 152 07-23 Is text normalization relevant for classifying medieval charters? Ist die Textnormierung für die Klassifizierung mittelalterlicher Chartas relevant? 文本正常化是否与中世纪宪章的分类相关? 2408.16446v2
  • 153 07-23 Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge Triple X: Ein LLM-basiertes mehrsprachiges Spracherkennungssystem für die INTERSPEECH2025 MLC-SLM Challenge 三三X:为InterSPEECH2025刚果解放运动-解运挑战建立基于LLM的多语言语言语言语言识别系统 2507.17288v1
  • 154 07-23 Tiny language models Kleine Sprachmodelle 微小语言模式 2507.14871v2
  • 155 07-23 Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start Multimodale Reasoning durch verstärktes Lernen mit kaltem Start fördern 通过 “ 冷起 “ 的强化学习推进多模式理由 2505.22334v2
  • 156 07-23 An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning Ein effizientes und präzises Training Data Construction Framework für prozessbeaufsichtigtes Prämienmodell in mathematischer Reasoning 由进程监督的数学理由评分模型的高效率和精确的培训数据构建框架 2503.02382v2
  • 157 07-23 Tab-MIA: A Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLMs Tab-MIA: Ein Benchmark-Datensatz für Mitgliedschafts-Inferenzangriffe auf Tabellendaten in LLMs Tab-MIA:关于LLMM表列数据的成员推断攻击基准数据集 2507.17259v1
  • 158 07-23 Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models Modality-Aware Neuron Pruning für das Lernen in multimodalen großen Sprachmodellen 多式联运大语言模型中不学习模式-Aware中度中枢 2502.15910v3
  • 159 07-23 Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent Test-Time-Matching: Entkoppelung von Persönlichkeit, Speicher und sprachlichem Stil im LLM-basierten Rollenspiel-Sprachenagenten 测试时间 – – 匹配:以LLM为基础的角色扮演语言代理的分解个性、记忆和语言风格 2507.16799v2
  • 160 07-23 CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings CLARIFID: Verbesserung der Radiologie-Berichtsgenerierung durch Verstärkung klinisch exakter Impressionen und Verstärkung detaillierter Befunde CLARIFID:通过加强临床准确压抑和执行详细调查结果,改进放射学报告的编制工作 2507.17234v1
  • 161 07-23 GTA: Grouped-head latenT Attention GTA: Grouped-head latenT Achtung GTA: 分组组长晚间会议 2506.17286v2
  • 162 07-23 A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task Ein hochreines Rezept Datensatz mit Inhaltsstoff-Staaten Annotation für staatliche Probing-Aufgabe 国家检验任务说明 2507.17232v1
  • 163 07-23 The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models Die pluralistische Morallücke: Urteil und Wertunterschiede zwischen Menschen und großen Sprachmodellen verstehen 多元道德差距:了解人类与大语言模式之间的判断和价值差异 2507.17216v1
  • 164 07-23 LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants LEGO Co-Builder: Exploring Fine-Grained Vision-Language Modeling für multimodale LEGO Assembly Assistants LEGO 共同建筑者:为多式LEGO大会助理探索精美的愿景-语言建模 2507.05515v2
  • 165 07-23 AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation AlignDistil: Token-Level-Sprachmodell Alignment als Adaptive Policy Destillation Aligndistil: 作为适应性政策蒸馏的调整级语言模式模型对齐 2503.02832v3
  • 166 07-23 FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance FinGAIA: Ein End-to-End-Benchmark für die Bewertung von KI-Agenten in der Finanzierung FinGAIA: 对AI公司金融代理机构进行评价的端至端基准 2507.17186v1
  • 167 07-23 SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs SKA-Bench: Ein feinkörniger Benchmark zur Bewertung des strukturierten Wissensverständnisses von LLMs SKA-Bunch:评估对LLMS的结构性知识了解的精细基准 2507.17178v1
  • 168 07-23 Adaptive Graph Pruning for Multi-Agent Communication Adaptives Graph Pruning für Multi-Agent Kommunikation 多机构通信调节图 2506.02951v3
  • 169 07-23 SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script SHARE: Shared Memory-Aware Open-Domain Langzeitdialogdatensatz aus Movie Script SHARE: 从电影脚本建构的内存- 内存- 内存- 公用 Open- Domain 长期对话数据集 2410.20682v3
  • 170 07-23 CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards CogDual: Verbesserung der Dual Cognition von LLMs durch Stärkung des Lernens mit impliziten regelbasierten Belohnungen 认知:通过强化学习,加强LLMs的双重认知,以不隐含规则的奖励加强学习 2507.17147v1
  • 171 07-23 Resona: Improving Context Copying in Linear Recurrence Models with Retrieval Resona: Verbesserung der Kontextkopie in linearen Wiederholungsmodellen mit Retrieval Resona: 改进有检索的线性重复模型中环境复制 2503.22913v3
  • 172 07-22 (2) Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings Evolutionäre Feature-weise Thresholding für Binäre Darstellung von NLP-Embeddings NLP 嵌入器二进制代表制的进化特点 2507.17025v1
  • 173 07-22 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles OpenVLThinker: Komplexe Vision-Sprachen-Reasoning über iterative SFT-RL-Zyklen OpenVLTHinker:通过循环 SFT-RL循环的复杂愿景-语言理由 2503.17352v2
  • 174 07-22 Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? Können externe Validierungstools die Annotationsqualität für LLM-as-a-Judge verbessern? 外部验证工具能否提高LLM-as-a-Judge的批注质量? 2507.17015v1
  • 175 07-22 Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors Multi-Label-Klassifikation mit generativen KI-Modellen im Gesundheitswesen: Eine Fallstudie über Suizidalität und Risikofaktoren 多标签分类,具有产生AI 保健模式的模式:关于自杀性和风险因素的个案研究 2507.17009v1
  • 176 07-22 ORANSight-2.0: Foundational LLMs for O-RAN ORANSight-2.0: LLM-Grundlagen für O-RAN ORANSight-2.0.0:O-RAN基础项目 2503.05200v2
  • 177 07-22 Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks Obscured, aber nicht gelöscht: Bewertung von Nationalitäts-Bias in LLMs über namensbasierte Bias-Benchmarks 以名称为依据的Bias基准在LLMs中评估国籍偏见 2507.16989v1
  • 178 07-22 Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain Nutzung synthetischer Daten zur Beantwortung von Fragen mit mehrsprachigen LLMs im landwirtschaftlichen Bereich 利用合成数据在农业领域利用多种语言LLM 解答问题 2507.16974v1
  • 179 07-22 Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning Text-zu-SPARQL geht über das Englische hinaus: Mehrsprachige Fragen beantworten über Wissen Graphen durch von Menschen inspirierte Vernunft 文字到SPARQL 超越英语:通过人类激发的理由解答多语种问题 2507.16971v1
  • 180 07-22 Functionals in the Clouds: An abstract architecture of serverless Cloud-Native Apps Funktionen in den Clouds: Eine abstrakte Architektur serverloser Cloud-Native Apps 云中的功能:无云软件的抽象结构 2105.10362v6
  • 181 07-22 Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs Nutzung von RLHF für robuste Unannehmbarkeitserkennung und vertrauenswürdige Reaktionsgenerierung in LLMs 利用RLHF在LLM中利用RLHF促进强有力的无法回答的承认和可信赖的应对生成 2507.16951v1
  • 182 07-22 3LM: Bridging Arabic, STEM, and Code through Benchmarking 3LM: Arabisch, MINT und Code durch Benchmarking überbrücken 3LM:通过基准确定连接阿拉伯语、STEM和代码 2507.15850v2
  • 183 07-22 AI-based Clinical Decision Support for Primary Care: A Real-World Study KI-basierte klinische Entscheidungsunterstützung für die Primärversorgung: Eine Real-World-Studie 基于AI的初级保健临床决定支持:现实世界研究 2507.16947v1
  • 184 07-22 SiLQ: Simple Large Language Model Quantization-Aware Training SiLQ: Einfaches großsprachiges Modell Quantization-Aware Training SiLQ: 简单大语言模型量化软件培训 2507.16933v1
  • 185 07-22 Modeling Public Perceptions of Science in Media Modellierung öffentlicher Wahrnehmungen von Wissenschaft in Medien 模拟公众对媒体科学的看法 2506.16622v2
  • 186 07-22 A Unifying Scheme for Extractive Content Selection Tasks Ein einheitliches Schema für die Auswahl von extraktiven Inhalten 开采内容选择任务统一办法 2507.16922v1
  • 187 07-22 MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning MegaScience: Die Grenzen von Post-Training-Datensätzen für wissenschaftliche Vernunft sprengen 超科学:推进培训后数据集的前沿,促进科学理性 2507.16812v1
  • 188 07-22 Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty Über Binäre Belohnungen hinaus: LMs zur Vernunft über ihre Ungewissheit ausbilden 二元奖励之后的奖励:培训 “ 以其不确定性为由 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 2507.16806v1
  • 189 07-22 Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning Steuerung der Out-of-Distribution-Verallgemeinerung mit Konzeptablation Fine-Tuning 带有 “ 缩算概念 “ 定额概念的 “ 批发外普遍化 “ 指导指导 2507.16795v1
  • 190 07-22 Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning Beyond Context Limits: Unterbewusste Threads für die Long-Horizon Reasoning 超越上下文限制: 长霍氏理由的潜意识线条 2507.16784v1
  • 191 07-22 SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods SenWiCh: Sense-Annotation von Low-Resource-Sprachen für WiC mit Hybrid-Methoden SenWiCH: 使用混合方法为无线电通信中心提供低资源语言的高级说明 2505.23714v2
  • 192 07-22 GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding GUI-G$^2$: Gaussian Reward Modeling für GUI Grounding GUI-G$$2美元:GUI地基的高斯奖赏模型 2507.15846v2
  • 193 07-22 Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals Unpacking Ambiguity: Die Wechselwirkung von Polysem-Diskursmarkern und Nicht-DM-Signalen 拆包装模糊性:多相相片标记器和非DM信号的相互作用 2507.16748v1
  • 194 07-22 Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Zebra-CoT: Ein Datensatz für interleaved Vision Language Reasoning Zebra-CoT:关于不同视力语言理由的数据集 2507.16746v1
  • 195 07-22 RAVine: Reality-Aligned Evaluation for Agentic Search RAVine: Realitätsorientierte Bewertung für die Agentische Suche RAVine: 化学搜索的现实统一评价 2507.16725v1
  • 196 07-22 Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory Erfahrung ist der beste Lehrer: Erdung von VLMs für Robotik durch selbsterzeugtes Gedächtnis 经验是最好的教师:通过自创记忆,为机器人创造VLMs 2507.16713v1
  • 197 07-22 Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance Advancing Risk and Quality Assurance: Ein RAG Chatbot für verbesserte regulatorische Compliance 提高风险和质量保证:改进监管合规的RAG Chadbot 2507.16711v1
  • 198 07-22 Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM Interpretierbare Themenextraktion und Wort-Embedding Lernen mit zeilenstochastischem DEDICOM 利用行可查的DEDICOM进行可解释专题抽取和单词嵌入学习 2507.16695v1
  • 199 07-22 Universal Model Routing for Efficient LLM Inference Universelle Modellführung für effiziente LLM-Inferenz 高效LLM 推导法通用通用模型规则 2502.08773v2
  • 200 07-22 PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization PICACO: Pluralistische Im-Kontext-Wert-Ausrichtung von LLMs über Gesamtkorrelationsoptimierung PICACO: 通过总关联性优化使LLMs的多元内流价值一致 2507.16679v1
  • 201 07-22 InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification Internagent: Wenn Agent zum Wissenschaftler wird – Gebäude-Closed-Loop-System von der Hypothese bis zur Verifikation 实习生:当探员成为科学家时 – – 建立从假说到核查的闭线系统 2505.16938v3
  • 202 07-22 Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs Selbstverachtung als Selbstverbesserung: Der Generationsverständigen-Gap in MLLMs entgegenwirken 自我自我改善:缩小小林林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中的小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小的自我改造:缩小代对小林中小林中小林中小林中小的鸿沟 2507.16663v1
  • 203 07-22 P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs P-CoT: Eine pädagogisch motivierte partizipative Kette von Denkanstößen für phonologische Vernunft in LLMs P-Cot:以教育为动机的、旨在激励LLM中声学原因的参与性研究链 2507.16656v1
  • 204 07-22 Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models Auf dem Weg zu einer automatisierten Überprüfung der regulatorischen Compliance bei der Finanzprüfung mit großen Sprachmodellen 采用大语言模式进行财务审计自动监管合规核查 2507.16642v1
  • 205 07-22 A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1 Eine Methode für die Architektur eines medizinischen vertikalen Großsprachmodells auf Basis von Deepseek R1 基于Deepseek R1的医学垂直大语言模型的架构方法 2505.00025v2
  • 206 07-22 A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis Multi-Granularität Konzept Sparse Aktivierung und Hierarchisches Wissen Graph Fusion Framework für Seltene Krankheiten Diagnose 罕见疾病诊断多发性概念分散活动和等级知识图集融合框架 2507.08529v2
  • 207 07-22 Mangosteen: An Open Thai Corpus for Language Model Pretraining Mangosteen: Ein offener thailändischer Corpus für Sprachmodellvorschulungen Mangosteen: 开放的泰语语言模型泰国公司 2507.14664v2
  • 208 07-22 Hear Your Code Fail, Voice-Assisted Debugging for Python Hören Sie Ihren Code fehlschlagen, Voice-Assisted Debugging für Python 听到您的代码失效, 语音协助调试 Python 的调试 2507.15007v2
  • 209 07-22 Self-Correcting Code Generation Using Small Language Models Selbstkorrekte Code-Generierung mit kleinen Sprachmodellen 使用小型语言模式自行校正代码生成 2505.23060v2
  • 210 07-22 Scaling Linear Attention with Sparse State Expansion Scaling Lineare Aufmerksamkeit mit Sparse State Expansion Sparassar 州扩展时的 缩放线性注意 2507.16577v1
  • 211 07-22 Supernova: Achieving More with Less in Transformer Architectures Supernova: Mit weniger Transformer-Architekturen mehr erreichen 超新星:在变形结构结构中以更少的变形结构实现更大的成就 2507.15773v2
  • 212 07-22 Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models Pixel zu Prinzipien: Intuitive Physik in multimodalen Sprachmodellen verstehen 原则的像素:在多模式语言模型中探明直觉物理理解 2507.16572v1
  • 213 07-22 Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language Gender Bias in großen Sprachmodellen erforschen: Ein tiefer Einblick in die deutsche Sprache 在大语言模式中探索性别偏见:深入跳入德语 2507.16557v1
  • 214 07-22 Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen LLM女士能产生可靠的试验案例发电机吗? 2506.06821v3
  • 215 07-22 Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters Seed-X: Starke Mehrsprachige Übersetzung LLM mit 7B-Parametern aufbauen 种子-X:利用7B参数建立强有力的多语种翻译LLM 2507.13618v2
  • 216 07-22 Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report Frontier AI Risk Management Framework in der Praxis: Ein technischer Bericht zur Risikoanalyse 《国际边界风险管理框架实际操作:风险分析技术报告》 2507.16534v1
  • 217 07-22 Learning Text Styles: A Study on Transfer, Attribution, and Verification Lerntextstile: Eine Studie über Transfer, Attribution und Verifizierung 学习教科书样式:关于转让、归属和核查的研究 2507.16530v1
  • 218 07-22 C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning C2-Evo: Co-Evolving multimodale Daten und Modell zur Selbstverbesserung C2-Evo:共同演进的多模式数据和自我改进理由模型 2507.16518v1
  • 219 07-22 Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness Einführung der Qualitätsschätzung in die maschinelle Übersetzung Nachbearbeitung des Workflows: Eine empirische Studie über seine Nützlichkeit 对机器翻译质量进行质量估算,编辑后工作流程:关于其使用经验研究 2507.16515v1
  • 220 07-22 The Ever-Evolving Science Exam Die allgegenwärtige Wissenschaftsprüfung 不断演变的科学考试 2507.16514v1
  • 221 07-22 Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation Sparrow: Dateneffizientes Video-LLM mit Text-zu-Bild-Erweiterung 麻雀:数据有效视频LLM,带有文本到图像放大功能 2411.19951v5
  • 222 07-22 Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics Bewertung der Intermediate Reasoning von Code-Assisted Large Language Models für Mathematik 评价代号协助的数学大语言模型的中间推理 2504.17665v2
  • 223 07-22 Combining Language and Topic Models for Hierarchical Text Classification Kombination von Sprach- und Themenmodellen für die Hierarchische Textklassifikation 将等级文字分类的语言和专题模式相结合 2507.16490v1
  • 224 07-22 ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs ICR-Probe: Verborgene Zustandsdynamiken für zuverlässige Halluzinationserkennung in LLMs verfolgen ICR Probe:跟踪隐藏状态动态,以便用LLMs进行可靠的幻觉探测 2507.16488v1
  • 225 07-22 Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation Typed-RAG: Type-Aware Zersetzung von nicht-Faktoiden Fragen für retrieval-Augmented Generation 型式RAG: 用于回收-提款一代的非实物问题类型软件分解 2503.15879v3
  • 226 07-22 ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension ReMeREC: Beziehungsbewusste und Multi-Entity-Bezug auf Expression-Verständnis ReMEREC: 关系意识和多实体参考表达式理解 2507.16877v1
  • 227 07-22 Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿 2507.06261v4
  • 228 07-22 MMS Player: an open source software for parametric data-driven animation of Sign Language avatars MMS Player: eine Open-Source-Software für parametrische datengesteuerte Animation von Sign Language Avataren MMS MMS 播放器: 一个用于模拟数据驱动的手语阿凡达动画的开放源码软件 2507.16463v1
  • 229 07-22 Towards Enforcing Company Policy Adherence in Agentic Workflows Auf dem Weg zur Stärkung der unternehmenspolitischen Einhaltung von Agent-Workflows 致力于加强公司政策,坚持对制剂性工作流程的政策 2507.16459v1
  • 230 07-22 Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch Dutch CrowS-Pairs: Anpassung eines Challenge Datasets zur Messung sozialer Biasen in Sprachmodellen für Niederländisch 荷兰语人群对称:调整一套挑战数据集,以衡量荷兰语语言模式中的社会两边状况 2507.16442v1
  • 231 07-22 HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing HausaNLP: Aktueller Status, Herausforderungen und Zukunftsrichtung für Hausa Natural Language Processing 豪萨民族语言:豪萨自然语言处理的现状、挑战和未来方向 2505.14311v3
  • 232 07-22 Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models Hierarchische Sicherheits-Neuausrichtung: Leichte Wiederherstellung der Sicherheit in beschnittenen großen Vision-Sprachen-Modellen 等级安全调整:谨慎大型视觉语言模型中轻度安全恢复 2505.16104v2
  • 233 07-22 Atomic Calibration of LLMs in Long-Form Generations Atomkalibrierung von LLMs in langen Generationen 长代人长龄人LLMs的原子校准 2410.13246v2
  • 234 07-22 Synthetic Data Generation Using Large Language Models: Advances in Text and Code Synthetische Datengenerierung mit großen Sprachmodellen: Fortschritte in Text und Code 使用大语言模式生成合成数据:文本和代码的进步 2503.14023v2
  • 235 07-22 Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study Beyond English: Bewertung der automatisierten Messung von Moralfundamenten im nicht-englischen Diskurs mit einer chinesischen Fallstudie 英文之后:评价非英语论文中道德基础的自动计量,与中国案例研究 2502.02451v3
  • 236 07-22 PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning PromptAL: Sample-Aware Dynamische Soft-Prompts für wenig heißes aktives Lernen 提示: 用于少点热积极学习的样本- 软件动态软提示 2507.16424v1
  • 237 07-22 GG-BBQ: German Gender Bias Benchmark for Question Answering GG-BBQ: Deutscher Gender-Bias-Benchmark für Fragenbeantwortung GGG-BBQ:德国回答问题性别比基准 2507.16410v1
  • 238 07-22 Routine: A Structural Planning Framework for LLM Agent System in Enterprise Routine: Ein Strukturplanungsrahmen für LLM Agent System in Unternehmen 常规:企业LLM代理系统结构规划框架 2507.14447v2
  • 239 07-22 Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model Multimodale Vorhersage von Sparse Intraoperativen Hypotonieereignissen durch Sprachmodell 以语言模式为动力的草散的不合作和不连续活动多式预报 2505.22116v3
  • 240 07-22 Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts Autonome Datenauswahl mit Zero-shot Generative Klassifikatoren für mathematische Texte 具有数学文本零光生成分类器的自动数据选择 2402.07625v7
  • 241 07-22 Physical models realizing the transformer architecture of large language models Physikalische Modelle, die die Transformatorenarchitektur großer Sprachmodelle realisieren 实现大型语言模型变压器结构的物理模型 2507.13354v2
  • 242 07-22 Data Processing for the OpenGPT-X Model Family Datenverarbeitung für die OpenGPT-X Modellfamilie OpenGPT-X模式家庭数据处理 2410.08800v3
  • 243 07-22 DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph DCG-SQL: Verbesserung des In-Context-Lernens für Text-zu-SQL mit Deep Contextual Schema Link Graph DCG-SQL:加强内文学习,以便用深背景图示链接图进行文字到SQL的内文学习 2505.19956v2
  • 244 07-22 LLMs syntactically adapt their language use to their conversational partner LLMs passen ihre Sprachnutzung syntaktisch an ihren Gesprächspartner an LLLMs 共学性调整其语言使用以适应其对话伙伴 2503.07457v2
  • 245 07-22 X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display X-Intelligence 3.0: Schulung und Bewertung von LLM für Halbleiteranzeige X- Intelligence 3.0: 用于半导体显示的培训和评估说明理由的LLMLM 2507.14430v2
  • 246 07-22 Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny Re:Form – Reduzierung menschlicher Priore bei skalierbarer formaler Software-Verifikation mit RL in LLMs: Eine Vorstudie zu Dafny Re:形式 – – 在可扩展的正式软件核查中减少人类前科,LLL女士:关于Dafny的初步研究 2507.16331v1
  • 247 07-22 SpeLLM: Character-Level Multi-Head Decoding SpeLLM: Charakter-Level-Multi-Head-Dekodierung SpeLLM: 职务级别多负责人解码 2507.16323v1
  • 248 07-22 WhatsApp Tiplines and Multilingual Claims in the 2021 Indian Assembly Elections WhatsApp Tipps und Mehrsprachige Behauptungen bei den Wahlen zur indischen Versammlung 2021 2021年印度议会选举中什么是App Tiplines和多语种权利主张 2507.16298v1
  • 249 07-22 Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction Jenseits isolierter Punkte: Benchmarking strukturierter Tabellenkonstruktion als Vertiefung der Wissensextraktion 孤立点以外的孤立点:作为深知识采掘的 2507.16271v1
  • 250 07-22 iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss iShumei-Chinchunmei bei SemEval-2025 Task 4: Ein ausgewogenes Multi-Task-Framework für Vergessen und Retention mit effektivem Lernverlust SemEval-2025任务4:利用有效的不学习损失,平衡地忘记和保留多任务框架 2507.16263v1
  • 251 07-22 Efficient RL for optimizing conversation level outcomes with an LLM-based tutor Effizienter RL zur Optimierung der Gesprächsergebnisse mit einem LLM-basierten Tutor 与一个以LLM为主的辅导员进行高效RL,以优化对话级别成果 2507.16252v1
  • 252 07-22 FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents FinResearchBench: Ein auf Logic Tree basierender Agent-as-a-Richter-Evaluierungsrahmen für Finanzforschungsagenten 金融研究时间:基于逻辑树的金融研究代理评估框架 2507.16248v1
  • 253 07-22 MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment MPO: Ein effizientes Post-Processing-Framework zum Mischen unterschiedlicher Präferenzen MPO: 混合多种优惠协调的高效处理后框架 2502.18699v3
  • 254 07-22 Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing Das Heilige modellieren: Überlegungen bei der Verwendung von religiösen Texten in der natürlichen Sprachverarbeitung 示范神圣:在自然语言处理中使用宗教文字时的考虑 2404.14740v3
  • 255 07-22 Hierarchical Budget Policy Optimization for Adaptive Reasoning Hierarchische Budgetpolitik Optimierung für adaptives Reasoning 适应性合理理由的等级预算政策优化 2507.15844v2
  • 256 07-22 Towards Compute-Optimal Many-Shot In-Context Learning Auf dem Weg zu einem rechnerisch-optimalen, viel scharfen In-Context-Lernen 迈向计算最优化的多个热点内文体学习 2507.16217v1
  • 257 07-22 Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models Promptomatix: Ein automatisches Optimierungs-Framework für große Sprachmodelle 即时表达式:大语言模型自动快速优化框架 2507.14241v2
  • 258 07-22 Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models Prompt4Trust: Ein Verstärkungs-Learning Prompt Augmentation Framework für klinisch ausgerichtete Vertrauenskalibrierung in multimodalen großen Sprachmodellen 提示4信任:在多式大语言模式中加强学习学习,促进临床一致信心校正的快速增强框架 2507.09279v3
  • 259 07-22 Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task Haben große Sprachmodelle eine Planungstheorie des Geistes? Beweise von MindGames: eine mehrstufige Überzeugungsaufgabe 大语言模型是否具有规划思维理论?来自MindGames的证据:多功能透析任务 2507.16196v1
  • 260 07-22 SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior SciFi-Benchmark: Leveraging Science Fiction zur Verbesserung des Roboterverhaltens SciFi-基准:利用科学信条改进机器人行为 2503.10706v2
  • 261 07-22 SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型 2507.07939v2
  • 262 07-22 Characterizing Online Activities Contributing to Suicide Mortality among Youth Charakterisieren von Online-Aktivitäten, die zur Selbstmordsterblichkeit unter Jugendlichen beitragen 确定造成青年自杀死亡率的在线活动 2507.16185v1
  • 263 07-22 BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset BIDWESH: Ein auf Bangla basierender Hass-Spracherkennungs-Datensatz BIDWESH:孟加拉地区基于孟加拉的仇恨言论检测数据集 2507.16183v1
  • 264 07-22 R-Bot: An LLM-based Query Rewrite System R-Bot: Ein LLM-basiertes Abfrage-Rewrite-System R-Bot:一个基于LLM的查询重写系统 2412.01661v2
  • 265 07-22 Reasoning Does Not Necessarily Improve Role-Playing Ability Vernunft verbessert nicht unbedingt die Fähigkeit zum Rollenspiel 理由并不必然改善发挥作用的能力 2502.16940v2
  • 266 07-22 SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting SpiroLLM: Feinsteuerungsvorbereitete LLMs, um Spirogramm-Zeitreihen mit klinischer Validierung in COPD-Reporting zu verstehen SpiroLLM:在COPD报告中使用临床校验功能以理解螺旋射时间序列的精练预先培训的LMLM 微调 2507.16145v1
  • 267 07-22 L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models L4Q: Parameter Effiziente Quantisierungsware Feinsteuerung bei großen Sprachmodellen L4Q:大语言模型参数有效量化-软件精美推荐 2402.04902v6
  • 268 07-22 Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: 带有内嵌原体的管弦式分布式 AI系统 2403.04311v3
  • 269 07-22 Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition Generative Zeichenbeschreibung Prompts mit multi-positivem Kontrastivem Lernen für die Erkennung von Zeichensprachen 多积极的手语识别多反比学习生成手语识别信号描述提示 2505.02304v2
  • 270 07-21 (1) Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education Menschliche Empathie als Encoder: KI-Assisted Depression Assessment in Special Education 人类的同情作为编码器:大赦国际协助的特殊教育中抑郁症评估 2505.23631v2
  • 271 07-21 Pixels, Patterns, but No Poetry: To See The World like Humans Pixel, Muster, aber keine Poesie: Die Welt wie Menschen zu sehen 像素、图案、但没有诗歌:像人类一样看世界 2507.16863v1
  • 272 07-21 Efficient Compositional Multi-tasking for On-device Large Language Models Effizientes kompositorisches Multi-Tasking für On-Device große Sprachmodelle 内部设计大型语言模型的高效组成多任务 2507.16083v1
  • 273 07-21 Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder Erforschen, wie Generative MLLMs mehr als CLIP mit dem gleichen Vision Encoder wahrnehmen 使用相同的愿景编码器探索如何产生比 CLIP 更远的多见性大型LLMs 2411.05195v3
  • 274 07-21 The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models Die Aufforderung macht die Person(a): Eine systematische Bewertung der soziodemographischen Persona, die für große Sprachmodelle aufruft 《迅速使人成为人》(a):系统评价社会人口人口人a 《激发大语言模式》 2507.16076v1
  • 275 07-21 Deep Researcher with Test-Time Diffusion Deep Researcher mit Test-Time Diffusion 具有试验时间扩散的深层研究员 2507.16075v1
  • 276 07-21 Erasing Conceptual Knowledge from Language Models Auslöschen von konzeptionellen Kenntnissen aus Sprachmodellen 将概念知识从语言模式中除去 2410.02760v3
  • 277 07-21 AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering AutoMeet: eine Proof-of-Concept-Studie von GenAI zur Automatisierung von Meetings in der Automobiltechnik AutoMeet:对genAI进行概念证明研究,以使汽车工程会议自动化 2507.16054v1
  • 278 07-21 Continuously Updating Digital Twins using Large Language Models Kontinuierliche Aktualisierung von digitalen Zwillingen mit großen Sprachmodellen 利用大语言模式不断更新数字双双 2506.12091v2
  • 279 07-21 mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages mRAKL:多种语文检索增强的低资源语言知识图构建 2507.16011v1
  • 280 07-21 Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy Risiken von KI-Wissenschaftlern: Priorisierender Schutz vor Autonomie AI 科学家的风险:将保障自治作为优先事项 2402.04247v5
  • 281 07-21 Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback Helfen Sie mir, eine Geschichte zu schreiben: Bewertung der Fähigkeit von LLMs, Schreiben Feedback zu generieren 帮助我写一个故事:评估LLMS的生成写作反馈的能力 2507.16007v1
  • 282 07-21 Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme Agent KB: 利用跨域经验解决代理问题 2507.06229v4
  • 283 07-21 Learning without training: The implicit dynamics of in-context learning Lernen ohne Ausbildung: Die implizite Dynamik des In-Context-Lernens 缺乏培训的学习:内通性学习的隐含动态 2507.16003v1
  • 284 07-21 Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation Hindi NER im niedrigen Kontext verbessern: Eine vergleichende Studie von Transformer-basierten Modellen mit vs. ohne Retrieval Augmentation 在低背景情况下加强印地语净净值:对以变换器为基础的模型的比较研究,与不回收增量的对比 2507.16002v1
  • 285 07-21 Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Omni-Router: Routing-Entscheidungen in Sparse Mixture-of-Experts für die Spracherkennung teilen Omni-Router: 分享语音识别专家的松散混集决定 2507.05724v2
  • 286 07-21 Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track Lehren aus dem TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) Track TREC 生物医学摘要(PLABA)平语言适应(PLABA)轨道的经验教训 2507.14096v2
  • 287 07-21 The Impact of Language Mixing on Bilingual LLM Reasoning Die Auswirkungen des Sprachmixens auf die zweisprachige LLM-Reasoning 语言混合对双语LLM理由解释的影响 2507.15849v1
  • 288 07-21 A Survey of Context Engineering for Large Language Models Eine Übersicht über Kontext-Engineering für große Sprachmodelle 大语言模型背景工程调查 2507.13334v2
  • 289 07-21 Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work Operationalisierung von KI für das Gute: Fokussierung auf Einsatz und Integration von KI-Modellen in humanitäre Arbeit 实施大赦国际促进良好:在人道主义工作中采用和整合大赦国际模式的焦点 2507.15823v1
  • 290 07-21 Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning Kleine LLMs lernen keine verallgemeinerbare Theorie des Geistes durch Verstärkungslernen 小型LLMs Do Loms Don not Learn a Global For Syor of Mind Syory 通过加强学习学习学习不学习普通心理理论的小型LLMs 2507.15788v1
  • 291 07-21 DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs DaMO: Ein dateneffizienter Multimodal-Orchester für zeitliche Vernunft mit Video-LLMs DaMO: 带有视频LMS的时空理由数据高效多式多式圆板 2506.11558v3
  • 292 07-21 Reservoir Computing as a Language Model Reservoir Computing als Sprachmodell 作为语言模式的 “ 储量计算 “ 模式 2507.15779v1
  • 293 07-21 Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR Stabilisierung von Wissen, Förderung von Vernunft: Dual-Token-Beschränkungen für RLVR 稳定知识,促进合理合理性:对风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险和风险的双重制约 2507.15778v1
  • 294 07-21 Dissociating model architectures from inference computations Trennen von Modellarchitekturen von Inferenzberechnungen 将模型结构与推断计算分离 2507.15776v1
  • 295 07-21 KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education? KnowShiftQA: Wie robust sind RAG-Systeme, wenn Textbook Knowledge Shifts in K-12 Education? K-12教育中教科书知识转移时RAG系统如何强大? 2412.08985v4
  • 296 07-21 Interaction as Intelligence: Deep Research With Human-AI Partnership Interaktion als Intelligenz: Tiefe Forschung mit Mensch-KI-Partnerschaft 作为情报的互动:与人类 – – AI伙伴关系的深入研究 2507.15759v1
  • 297 07-21 LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization LAPO: Internalisierung der Effizienz durch Längen-Anpassungspolitik-Optimierung LAPO:通过延长期限政策优化实现内部合理性效率 2507.15758v1
  • 298 07-21 DialogueForge: LLM Simulation of Human-Chatbot Dialogue DialogueForge: LLM Simulation des Mensch-Chatbot-Dialogs “对话论坛:模拟人类与哈特波特对话的LLMLM 2507.15752v1
  • 299 07-21 Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models Steuerung in neue Einbettungsräume: Analyse der Cross-Lingual Alignment Induziert durch Modellinterventionen in mehrsprachigen Sprachmodellen 指导进入新嵌入空间:分析多语文模式示范干预措施所引出的不同语言之间的横向一致 2502.15639v2
  • 300 07-21 Where Do People Tell Stories Online? Story Detection Across Online Communities Wo erzählen Menschen Geschichten online? Story Detection Across Online Communities 《人们在哪里在网上讲述故事? 2311.09675v4
  • 301 07-21 Towards physician-centered oversight of conversational diagnostic AI Auf dem Weg zur ärztlichen Aufsicht über gesprächsdiagnostische KI 致力于以医生为中心对谈话诊断进行监督 AI 2507.15743v1
  • 302 07-21 A Fisher’s exact test justification of the TF-IDF term-weighting scheme Genaue Begründung des TF-IDF-Term-Wichtungssystems durch einen Fisher A Fisher公司对TF-IDF术语加权办法的精确测试理由 2507.15742v1
  • 303 07-21 Understanding Large Language Models’ Ability on Interdisciplinary Research Verständnis der Fähigkeit von großen Sprachmodellen zur interdisziplinären Forschung 了解关于跨学科研究的大型语言模型能力 2507.15736v1
  • 304 07-21 BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning Benchmarking LLMs für Ophthalmologie (BELO) für ophthalmologisches Wissen und Vernunft 眼生理知识和理性的眼生理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学和理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 2507.15717v1
  • 305 07-21 From Queries to Criteria: Understanding How Astronomers Evaluate LLMs Von Fragen zu Kriterien: Wie Astronomen LLMs bewerten 从询问到标准:了解天文学家如何评价LLMs 2507.15715v1
  • 306 07-21 Chinchunmei at SemEval-2025 Task 11: Boosting the Large Language Model’s Capability of Emotion Perception using Contrastive Learning Chinchunmei bei SemEval-2025 Aufgabe 11: Erhöht die Fähigkeit des großen Sprachmodells zur Wahrnehmung von Emotionen durch kontrastives Lernen Chinchunmei在SemEval-2025任务11:利用差异学习促进大语言模式情感感知能力 2507.15714v1
  • 307 07-21 Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents Groß gewinnen mit kleinen Modellen: Wissensdestillation vs. Selbsttraining zur Reduktion der Halluzination in Produkt-QA-Agenten 以小型模型赢得大奖:知识蒸馏与减少产品质量保证剂中幻觉的自我培训 2502.19545v2
  • 308 07-21 Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked? Wird die Leistungsfähigkeit eines großen Sprachmodells bei mit Gründen versehenen Aufgaben durch verschiedene Wege beeinflusst Fragen werden gestellt? 问到不同方式的问题是否影响到大语言解释任务示范业绩? 2507.15707v1
  • 309 07-21 Compositional Understanding in Signaling Games Kompositionales Verständnis bei Signalspielen 信号运动会的组成理解 2507.15706v1
  • 310 07-21 CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models CoLD: Counterfactually-Führungslängen-Debiasing für Prozess-Reward-Modelle CoLD: 反事实引导进程奖励模型的长度偏差 2507.15698v1
  • 311 07-21 Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language Verbesserung der natürlichen Sprachinferenzleistung mit Wissensdiagramm für COVID-19 Automatisiertes Fact-Checking in indonesischer Sprache 以印度尼西亚语自动进行事实调查的COVID-19 自动调查印度尼西亚语知识图,提高自然语言引文性能 2409.00061v2
  • 312 07-21 Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems Ausführbare Funktionsabstractions: Ausleiten von Generativen Programmen für fortgeschrittene Math-Probleme 可执行的功能性抽象:为高级数学问题推导产生方案 2504.09763v2
  • 313 07-21 P3: Prompts Promote Prompting P3: Prompts fördern Prompting P3: 推动推动推动 2507.15675v1
  • 314 07-21 Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains Aufmerksamkeit bei Markov: Ein Rahmen für die grundsätzliche Analyse von Transformatoren über Markov Ketten 注意Markov:通过Markov 链条对变形器进行原则分析的框架 2402.04161v2
  • 315 07-21 Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark Tokenization Standards for Linguistic Integrity: Türkisch als Benchmark 语言完整性的接受标准:土耳其作为基准 2502.07057v2
  • 316 07-21 Leveraging Context for Multimodal Fallacy Classification in Political Debates Nutzung des Kontexts für multimodale Fehlerklassifizierung in politischen Debatten 在政治辩论中利用多模式误差分类背景 2507.15641v1
  • 317 07-21 Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training Data Mixing Agent: Erlernen von Re-Gewicht Domains für kontinuierliches Pre-Training 数据混合代理: 学习为连续培训前学习重新加权域域 2507.15640v1
  • 318 07-21 Preventing Rogue Agents Improves Multi-Agent Collaboration Verhindern von Rogue-Agenten verbessert Multi-Agenten-Kollaboration B. 改进多机构协作 2502.05986v2
  • 319 07-21 Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis Effiziente Angriffsuntersuchung mittels Human-in-the-Loop Sicherheitsanalyse ermöglichen 通过 “ 现场人 “ 系统安全分析,促进高效袭击调查 2211.05403v3
  • 320 07-21 CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization CCSBench: Bewertung der kompositorischen Kontrollierbarkeit in LLMs für wissenschaftliche Dokumentzusammenfassung CCSBENCH:评估科学文件摘要中LLMs中的组成可控性 2410.12601v2
  • 321 07-21 Conflicting narratives and polarization on social media Widersprüchliche Narrative und Polarisierung in den sozialen Medien 社交媒体的矛盾叙述和两极分化 2507.15600v1
  • 322 07-21 Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration Wiederbelebung des Kulturerbes: Ein neuartiger Ansatz für eine umfassende Restaurierung historischer Dokumente 恢复文化遗产:全面恢复历史文件的新办法 2507.05108v2
  • 323 07-21 Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging Smart Eyes für Silent Threats: VLMs und In-Context Learning für THz Imaging 静默威胁的 “ 聪明的眼睛 “ :VLMs和THz成像的内书学习 2507.15576v1
  • 324 07-21 clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations clem:todd: Ein Rahmen für die systematische Benchmarking von LLM-basierten, auf Aufgaben ausgerichteten Dialogsystem-Realisierungen 模块:基于LLM的以任务为导向的对话系统实现情况系统基准化框架 2505.05445v2
  • 325 07-21 Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification Textübertragung bewerten: Ein Neun-Sprachen-Benchmark für Textentgiftung 评价文本样式转让:文本解毒九语言基准 2507.15557v1
  • 326 07-21 Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems Mehr tun mit weniger: Eine Umfrage über Routing-Strategien zur Ressourcenoptimierung in großsprachlichen modellbasierten Systemen 少花钱多办事:关于大语言示范系统资源优化区域战略的调查 2502.00409v3
  • 327 07-21 KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan KazMMLU: Bewertung von Sprachmodellen zu kasachischen, russischen und regionalen Kenntnissen Kasachstans KazMMMLU:评估哈萨克斯坦哈萨克语、俄语和区域知识的语言模式 2502.12829v2
  • 328 07-21 Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks Beeinflussen Emotionen wirklich die Überzeugung von Argumenten? Ein dynamischer Ansatz mit LLM-basierten Manipulationsprüfungen 情感真的会真的影响竞价说服力吗? 使用基于 LLM 的操纵测试的动态方法 2503.00024v2
  • 329 07-21 Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models Schritt-Level-Verifier-geführte Hybrid-Test-Time-Skalierung für große Sprachmodelle 大语言模型的逐步一级核证人-制导大语言模型混合试验-时间缩放 2507.15512v1
  • 330 07-21 Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback Off-Policy korrigierte Prämienmodellierung für verstärktes Lernen aus menschlichem Feedback 利用人类反馈加强学习的非政策纠正奖励模型 2507.15507v1
  • 331 07-21 ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution ASPERA: Eine simulierte Umgebung, um Planung für komplexe Aktionen zu bewerten ASPERA:评估复杂行动执行规划的模拟环境 2507.15501v1
  • 332 07-21 OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning OMoE: Diversifizierende Mischung aus Low-Rank-Anpassung durch Orthogonal Finetuning OMoE:通过矫形微调使低Rank适应混合体多样化 2501.10062v2
  • 333 07-21 KAT-V1: Kwai-AutoThink Technical Report KAT-V1: Kwai-AutoThink Technical Report KAT-V1: Kwai-AutoThink 技术报告 2507.08297v3
  • 334 07-21 DARE: Diverse Visual Question Answering with Robustness Evaluation DARE: Diverse visuelle Fragebeantwortung mit Robustheitsbewertung DARE: 以强力评价回答多种视觉问题 2409.18023v2
  • 335 07-21 STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning STUN: Strukturierte und dann unstrukturierte Pruning für skalierbare MoE Pruning STUN: 结构化的当时无结构化的为可缩缩的MoE Pruning提供结构化的当时无结构化的谨慎 2409.06211v2
  • 336 07-21 End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data End-to-End-Gemeinsame Pünktliche und Normalisierte ASR mit einer begrenzten Menge an Pünktlichen Trainingsdaten 配有数量有限的点对培训数据的点对端联合标点和正常化的ASR 2311.17741v3
  • 337 07-21 Entity-aware Cross-lingual Claim Detection for Automated Fact-checking Entity-aware Cross-lingual Claim Detection for Automated Fact-Checking 用于自动实况调查的有实体意识的跨语言交叉索赔调查 2503.15220v4
  • 338 07-21 KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model KaLM-Embedding-V2: Überlegene Trainingstechniken und Daten inspirieren ein vielseitiges Einbettungsmodell KaLM-Embedding-V2:高级培训技术和数据预报 2506.20923v2
  • 339 07-21 AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming AlgoSimBench: Algorithmisch ähnliche Probleme für wettbewerbsfähige Programmierung identifizieren AlgoSimBeunch:为竞争性方案拟订查明在职等上相似的难题 2507.15378v1
  • 340 07-21 MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs MKE-Coder: Multi-Axial-Wissen mit Evidenzverifizierung bei ICD-Coding für chinesische EMRs MKE-编码器:中文EMR的ICD编码中多轴知识与证据核查的多轴知识 2502.14916v3
  • 341 07-21 STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models STITCH: Gleichzeitiges Denken und Sprechen mit Chunked Reasoning für gesprochene Sprachmodelle SSTTCH: 同时思考和交谈 与口语模式的“关键理由”对话 2507.15375v1
  • 342 07-21 Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation Meta4XNLI: Ein Crosslingual Parallel Corpus für die Erkennung und Interpretation von Metaphoren Meta4XNLI: 用于识别和解释代名词的跨语言平行体 2404.07053v3
  • 343 07-21 Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding Metaphorische und große Sprachmodelle: Wenn Oberflächenmerkmale mehr ausmachen als tiefes Verständnis 名词和大语言模型:当地表地貌特征比深了解更重要时 2507.15357v1
  • 344 07-21 ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events ChronoSense: Erforschen des zeitlichen Verständnisses in großen Sprachmodellen mit Zeitintervallen von Ereignissen Chronossensensense:探索具有时际事件间隔的大型语言模型中的时间理解 2501.03040v2
  • 345 07-21 Probing Information Distribution in Transformer Architectures through Entropy Analysis Probing Information Distribution in Transformer-Architekturen durch Entropie-Analyse 通过 Entropy 分析在变形结构中进行测试信息发布 2507.15347v1
  • 346 07-21 LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators LionGuard 2: Leichte, dateneffiziente und lokalisierte Mehrsprachige Inhaltsmoderatoren bauen 狮子座标2:轻量、数据效率和本地化多语种内容主持人 2507.15339v1
  • 347 07-21 Reasoning Models are Test Exploiters: Rethinking Multiple-Choice Reasoning Models sind Testexploiter: Multi-Choice neu denken 说明理由的模型是实验性剥削者:重新思考多选择 2507.15337v1
  • 348 07-21 Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Mixture-of-Recursions: Dynamische Rekursive Tiefen für adaptive Token-Level-Computation lernen 混合流流流:学习适应调控级计算法的动态回流深度 2507.10524v2
  • 349 07-21 On the Inevitability of Left-Leaning Political Bias in Aligned Language Models Zur Unvermeidlichkeit linksleanender politischer Bias in gerichteten Sprachmodellen 关于采用统一语言模式的左倾政治偏见的不可避免的问题 2507.15328v1
  • 350 07-21 Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models Katzen verunsichern LLM: Abfrage von Agnostiker-Adversarial-Triggern für vernunftbewusste Modelle Cats 配置理由解释的LLM: 用于说明理由模型的询问Agnistic Aversarial 触发器 2503.01781v2
  • 351 07-21 ACEBench: Who Wins the Match Point in Tool Usage? ACEBench: Wer gewinnt den Match Point in der Werkzeugnutzung? CEBench:谁在工具使用中赢得了匹配点? 2501.12851v6
  • 352 07-21 FastMCTS: A Simple Sampling Strategy for Data Synthesis FastMCTS: Eine einfache Probenahmestrategie für die Datensynthese 数据综合简单抽样战略 2502.11476v2
  • 353 07-21 Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM-generated Text Detection Beyond Easy Wins: Ein Text Hardness-Aware Benchmark für LLM-generierte Texterkennung 超越简单赢:LLM生成的文本检测的文本硬度软件基准 2507.15286v1
  • 354 07-21 A Novel Self-Evolution Framework for Large Language Models Ein neuartiges Selbst-Evolution-Rahmenwerk für große Sprachmodelle 大语言模式新自演框架 2507.15281v1
  • 355 07-21 ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling ChiMed 2.0: Fortschrittlicher chinesischer medizinischer Datensatz bei der Erleichterung des großen Sprachmodellierens 切米德2.0:推进中国医疗数据集,促进大语言建模 2507.15275v1
  • 356 07-21 A2TTS: TTS for Low Resource Indian Languages A2TTS: TTS für ressourcenarme indische Sprachen A2TTS: 低资源印度语言TTS 2507.15272v1
  • 357 07-21 GREAT: Guiding Query Generation with a Trie for Recommending Related Search about Video at Kuaishou GREAT: Guiding Query Generation mit einem Versuch zum Empfehlen Verwandte Suche zum Thema Video bei Kuaishou 大:指导Query Greaking Query Generation 与一个三合队在广州建议相关视频搜索 2507.15267v1
  • 358 07-21 Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models Visually Guided Decoding: Gradient-Free Hard Prompt Inversion mit Sprachmodellen 视觉导导解码: 带语言模型的逐步无限制硬快速翻版 2505.08622v2
  • 359 07-21 Commonsense Reasoning in Arab Culture Commonsense Vernunft in der arabischen Kultur 阿拉伯文化中的常识理由 2502.12788v2
  • 360 07-21 SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest SOI Matters: Analyse von Multi-Setting-Trainingsdynamiken in vorgebildeten Sprachmodellen über Teilmengen von Interesse SOI事项:分析通过利益子集分析培训前语言模式中多设置培训动态 2507.15236v1
  • 361 07-21 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning Search-R1: LLMs zu Grund und Hebel-Suchmaschinen mit Verstärkungs-Lernen 搜索R1:培训 “ 理性与利用搜索引擎与强化学习 “ 培训LLMS 2503.09516v4
  • 362 07-21 Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models PTSD in klinischen Interviews erkennen: Eine vergleichende Analyse von NLP-Methoden und großen Sprachmodellen 临床访谈中检测创伤后创伤后精神紧张症:国家语言规划方法和大语言模式的比较分析 2504.01216v2
  • 363 07-21 Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems Ausnutzen von kontextabhängigen Dauerfunktionen für Sprachanonymisierungs-Angriffsysteme 利用语音匿名攻击系统视具体情况而定的期间特征 2507.15214v1
  • 364 07-21 Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment Collaborative Destillationsstrategien für parametereffiziente Sprachmodell-Einsatz 辅助计量有效语言模式部署的协作性静修战略 2507.15198v1
  • 365 07-21 Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles Hierarchische Prompting Taxonomie: Ein universeller Evaluationsrahmen für große Sprachmodelle, ausgerichtet auf menschliche Kognitive Prinzipien 符合人类认知原则的大语言模式普遍评价框架 2406.12644v5
  • 366 07-21 Empowering LLMs with Logical Reasoning: A Comprehensive Survey Stärkung von LLMs mit logischer Begründung: Eine umfassende Umfrage 赋予LLMs以逻辑理由:全面调查 2502.15652v4
  • 367 07-20 (7) What Level of Automation is “Good Enough”? A Benchmark of Large Language Models for Meta-Analysis Data Extraction Welche Stufe der Automatisierung ist “Gut genug”? Ein Benchmark für große Sprachmodelle für die Meta-Analyse-Datenextraktion 自动化的等级是“好到好”? 元分析数据提取大语言模式的基准 2507.15152v1
  • 368 07-20 A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge’ez Script Ein Fall gegen implizite Standards: Homophone Normalisierung in maschineller Übersetzung für Sprachen, die das Ge’ez Script verwenden 反对隐含标准案:使用盖兹文稿的语文机器翻译中同声传译正常化 2507.15142v1
  • 369 07-20 A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation Ein semantisch-basierter Optimierungsansatz zur Reparatur von LLMs: Fallstudie zur Codegenerierung 修复LLMLM 的基于语义的优化优化方法:关于代码生成的案例研究 2503.12899v3
  • 370 07-20 From Disagreement to Understanding: The Case for Ambiguity Detection in NLI Von der Uneinigkeit zum Verständnis: Der Fall für Ambiguitätserkennung in NLI 从分歧到理解:国家调查局的模糊性探测案例 2507.15114v1
  • 371 07-20 Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference? Füllen der Lücke: Ist Commonsense Knowledge Generation nützlich für die natürliche Sprachinferenz? 填补空白:创造常识知识对自然语言推论有用吗? 2507.15100v1
  • 372 07-20 Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models Nur ein wenig nach links: Ein theoriebasiertes Maß politischer Bias in großen Sprachmodellen 仅向左一小点:大语言模式中政治偏见的理论依据度量 2503.16148v2
  • 373 07-20 A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations Eine Strafe geht einen langen Weg: Lexikale Vielfalt in synthetischen Texten unter prompt beeinflussten Längenvariationen messen 惩罚有很长的路要走:在迅速影响长长变的情况下,在合成文字中衡量法律多样性 2507.15092v1
  • 374 07-20 Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling Bewertung von Codierungsschemata für transformerbasierte Gene-Sequenz-Modellierung 以变异器为基础的基因序列建模编码方案评价 2507.15087v1
  • 375 07-20 The Dual-Route Model of Induction Das Dual-Routen-Modell der Induktion 双重制引模式 2504.03022v2
  • 376 07-20 OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs OpeNLGauge: Ein erklärbares Maß für die NLG-Evaluierung mit offenen LLMs OpeNLGauge: NLG 评估可解释的计量器,使用开放重力LMs 2503.11858v2
  • 377 07-20 WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization WebShaper: Agentische Datensynthese über Informationssuche Formalisierung WebShaper: 通过信息搜索正规化实现数据同步化 2507.15061v1
  • 378 07-20 RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback RefCritic: Training von langen Ketten-of-Thought-Kritik-Modellen mit Raffination Feedback 批评:培训具有精炼反馈的 “ 长期研究链 “ 批评模型 2507.15024v1
  • 379 07-20 How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation Wie weit sind LLMs davon entfernt, unsere digitalen Zwillinge zu sein? Ein Benchmark für die Persona-Based Behavior Chain Simulation 如何远离“我们的数字双双”的LLMs? 以人为基础的行为链模拟基准 2502.14642v2
  • 380 07-20 Towards Harmonized Uncertainty Estimation for Large Language Models Hin zu einer harmonisierten Ungewissheitsschätzung für große Sprachmodelle 争取为大语言模式统一不确定性估算 2505.19073v2
  • 381 07-20 Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian Dr.Copilot: Ein Multi-Agent Prompt Optimierter Assistent zur Verbesserung der Patienten-Doktor-Kommunikation auf Rumänisch 副驾驶:罗马尼亚改善病人-医生沟通多代理快速优化助理 2507.11299v2
  • 382 07-20 Why Does New Knowledge Create Messy Ripple Effects in LLMs? Warum erzeugt Neues Wissen in LLMs messy Ripple Effekte? 为什么新知识会在LLMS中产生混乱的波纹效应? 2407.12828v3
  • 383 07-20 Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Unterstützung der SENCOTEN Sprachdokumentation Bemühungen mit automatischer Spracherkennung 支持SENCOTEN语文文件工作,并自动语音识别 2507.10827v2
  • 384 07-20 MUR: Momentum Uncertainty guided Reasoning for Large Language Models MUR: Momentum Ungewissheit geführte Begründung für große Sprachmodelle MUR:大语言模型的动态不确定性引导理由 2507.14958v1
  • 385 07-20 SYNTHIA: Synthetic Yet Naturally Tailored Human-Inspired PersonAs SYNTHIA: Synthetisch und doch natürlich maßgeschneiderte, von Menschen inspirierte Person SYNTHIA:合成但自然而然定制的受人类启发的人 2507.14922v1
  • 386 07-20 GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks GTSinger: Globales Multi-Technique Singen Corpus mit realistischen Noten für alle Singaufgaben GTSinger:一个拥有现实音乐分数的全唱任务全球多技术多技术歌唱公司 2409.13832v8
  • 387 07-20 On Entity Identification in Language Models Zur Identitätskennung in Sprachmodellen 关于在语文模式中实体识别 2506.02701v4
  • 388 07-20 PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation PromptSuite: Ein Task-Agnostic Framework für die Multi-Prompt-Generation 快速实用:多生一代任务不可确定框架 2507.14913v1
  • 389 07-20 AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction AutoGen Driven Multi Agent Framework für iterative Kriminalität Datenanalyse und Vorhersage 循环犯罪数据分析和预测自动驱动器多剂框架 2506.11475v2
  • 390 07-20 Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs Sparse Autoencoder-geführte Supervised Finetuning zu Mitigate Unerwartete Code-Switching in LLMs 用于LLMM 中非预期代码切换的微亮自定义编码器导导监督调整 2507.14894v1
  • 391 07-20 MEKiT: Multi-source Heterogeneous Knowledge Injection Method via Instruction Tuning for Emotion-Cause Pair Extraction MEKiT: Multi-Source Heterogene Wissensinjektionsmethode über Instruction Tuning für Emotion-Cause-Paar-Extraktion MEKIT:通过情感-原因对等采掘教学图示,多源源、异种知识注射法 2507.14887v1
  • 392 07-20 A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models Eine Übersicht über die Entwicklung sprachmodellbasierter Dialogsysteme: Daten, Aufgaben und Modelle 语文示范对话系统演变概览:数据、任务和模式 2311.16789v2
  • 393 07-20 Controlling Language Confusion in Multilingual LLMs Sprachkonfusion in mehrsprachigen LLMs kontrollieren 多语种LMM中控制语言混杂 2505.19116v2
  • 394 07-20 Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages Transformer und Ensemble-Methoden: Eine Lösung für Hass-Spracherkennung in arabischen Sprachen 变换器和组合方法:用阿拉伯语探测仇恨言论的解决方案 2303.09823v2
  • 395 07-20 Beyond Isolated Capabilities: Bridging Long CoT Reasoning and Long-Context Understanding Über isolierte Fähigkeiten hinaus: Überbrückung von Long CoT-Reasoning und Long Context Understanding 超越孤立能力:连接长 CoT理由和长期理解 2507.14849v1
  • 396 07-20 The Invisible Leash: Why RLVR May Not Escape Its Origin Die unsichtbare Leine: Warum RLVR seinem Ursprung nicht entkommen kann 隐形Leash:为什么RLVR不能逃离其起源 2507.14843v1
  • 397 07-20 Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents Doc2Chart: Intent-Driven Zero-Shot Chart Generation aus Dokumenten Doc2图示: 从文档中生成零热图 2507.14819v1
  • 398 07-20 FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing FastLongSpeech: Erweiterung großer Sprachmodelle für eine effiziente Langspeech-Verarbeitung FastLongSpeech:加强大型语音-语言模型,以高效长语音处理 2507.14815v1
  • 399 07-20 Lizard: An Efficient Linearization Framework for Large Language Models Lizard: Ein effizienter Linearisierungsrahmen für große Sprachmodelle Lizard:大型语言模型的高效线性框架 2507.09025v2
  • 400 07-20 Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems Schwache Überwachungstechniken für verbesserte ASR-Modelle in CRM-Systemen auf Industrieebene 在工业级客户关系管理系统中加强ASR模型的薄弱监督技术 2507.16843v1
  • 401 07-20 A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios Eine Umfrage über großsprachige modellbasierte Sozialagenten in Spiel-Theoretischen Szenarien 关于游戏理论情景中以大语言模式为基础的社会因素的调查 2412.03920v2
  • 402 07-19 (6) GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization GRACE: Generative Empfehlung über Journey-Aware Sparse Achtung bei der Ketten-of-Thought-Tokenisierung GRACE: 通过Journey-Aware Sparass 注意力在 “ 探索链 “ 中产生的建议 2507.14758v1
  • 403 07-19 Domain-Adaptive Small Language Models for Structured Tax Code Prediction Domain-Adaptive kleine Sprachmodelle für strukturierte Steuervorhersage 结构化税法预测结构化税法 2507.10880v2
  • 404 07-19 On the robustness of modeling grounded word learning through a child’s egocentric input Auf die Robustheit der Modellierung geerdetes Wort Lernen durch den egozentrischen Input eines Kindes 通过儿童以自我为中心的投入进行基于基础的模拟文字学习的强健性 2507.14749v1
  • 405 07-19 Disparities in Peer Review Tone and the Role of Reviewer Anonymity Unterschiede in Peer Review Tone und die Rolle der Reviewer Anonymität 同行审查方式和审查者匿名作用的差异 2507.14741v1
  • 406 07-19 Finding A Voice: Exploring the Potential of African American Dialect and Voice Generation for Chatbots Eine Stimme finden: Das Potenzial der afroamerikanischen Dialekt- und Sprachgenerierung für Chatbots erforschen 寻找声音:探索非裔美国人为查波特人创造语音和语音的潜力 2501.03441v2
  • 407 07-19 Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems Sorformer: Ein neuartiger Ansatz für Permutations-Resolved Speaker Supervision in Speech-to-Text Systemen 排序前:语音到文字系统变换解决的议长监督新办法 2409.06656v3
  • 408 07-19 Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation Dynamische Kontext-Tunings für retrieval-angereicherte Generation: Multi-Turn-Planung und Werkzeuganpassung verbessern 回收-提款一代动态环境图示:加强多周期规划和工具适应 2506.11092v2
  • 409 07-19 APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay APIGen-MT: Agentische Pipeline für die Multi-Turn-Datengenerierung über simuliertes Agent-Human-Interplay PAPIGen-MT: 通过模拟代理人间相互作用生成多发数据时的代理管道 2504.03601v4
  • 410 07-19 Towards the Next Frontier in Speech Representation Learning Using Disentanglement Auf dem Weg zur nächsten Front in der Sprachrepräsentanz Lernen mit Entflechtung 走向使用分离手段进行演讲代表学习的下一个前沿 2407.02543v2
  • 411 07-19 Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation Umdenken bei der Erkennung von Selbstmordgedanken: Ein vertrauensvolles Annotations-Framework und Cross-Lingual Model Evaluation 重新思考潮ideideididation 探测:可信赖的注解框架和跨语言模式评价 2507.14693v1
  • 412 07-19 Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations Mind the Gap: Eine Überprüfung der arabischen Post-Training-Datensätze und deren Einschränkungen 《思想差距:对阿拉伯培训后数据集及其局限性的审查》 2507.14688v1
  • 413 07-19 MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization MiroMind-M1: Eine Open-Source-Erhöhung in mathematischer Reasoning über kontextorientierte Multi-Stage-Politikoptimierung MiroMind-MM1:通过上下文软件多层次政策优化在数学理由方面的开放源码进步 2507.14683v1
  • 414 07-19 Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care Große Sprachmodelle als medizinische Codes-Selektoren: ein Maßstab unter Verwendung der Internationalen Klassifikation der Primärversorgung 大语言模式作为医疗法典选择者:使用国际初级保健分类的基准 2507.14681v1
  • 415 07-19 Docopilot: Improving Multimodal Models for Document-Level Understanding Docopilot: Verbesserung multimodaler Modelle für die Verständigung auf Dokumentebene Docopolil:改进文件级理解的多模式模式 2507.14675v1
  • 416 07-19 Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs Cleanse: Ungewissheitsabschätzungsansatz mit Clustering-basierter semantischer Konsistenz in LLMs 清洁性:在LLMM中采用基于集群的语义一致性 2507.14649v1
  • 417 07-19 Linear Relational Decoding of Morphology in Language Models Lineare relationale Dekodierung der Morphologie in Sprachmodellen 语言模型中细胞体理学的线际关系代谢 2507.14640v1
  • 418 07-19 CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages CSSL: Kontrastives Selbst-überwachtes Lernen für Abhängigkeitsparsing auf relativ freien Word-Ordnung und morphologisch reichen Low Resource Sprachen CSSL: 相对自由的有秩序和有体力丰富、低资源语言的自学自学自导学习 2410.06944v2
  • 419 07-19 Growing a Twig to Accelerate Large Vision-Language Models Einen Zweig wachsen, um große Visions-Sprachen-Modelle zu beschleunigen 为加速大型视觉语言模型而成长的Twig 2503.14075v2
  • 420 07-19 Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining Optimierung der Legal Document Retrieval in Vietnamesen mit semi-harten negativen Bergbau 优化越南法律文件检索,使用半硬负负采矿 2507.14619v1
  • 421 07-19 Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenian Primary Care: A Methodology Paper 肯尼亚初级保健背景示范测试的取回强化临床基准:方法文件 2507.14615v1
  • 422 07-19 Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification Backtranslation und Paraphrasierung in der LLM-Ära? Vergleich von Daten Augmentationsmethoden für die Emotionsklassifizierung LLM 时代的后翻和翻译? 比较情绪分类的数据增强方法 2507.14590v1
  • 423 07-19 What do Large Language Models know about materials? Was wissen Large Language Models über Materialien? 大语言模型对材料了解多少? 2507.14586v1
  • 424 07-19 Explainable Collaborative Problem Solving Diagnosis with BERT using SHAP and its Implications for Teacher Adoption Erklärbares kollaboratives Problem beim Lösen der Diagnose mit BERT unter Verwendung von SHAP und dessen Implikationen für die Lehreradoption 使用SHAP及其对教师收养的影响,与BERT进行可解释的协作问题解决分析 2507.14584v1
  • 425 07-19 Exploring Human-AI Complementarity in CPS Diagnosis Using Unimodal and Multimodal BERT Models Erforschung der Human-AI-Komplementarität in der CPS-Diagnose mit unimodalen und multimodalen BERT-Modellen 利用单式和多式BERT模型探索在CPS诊断中人与AI的互补性 2507.14579v1
  • 426 07-19 XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification XL-DURel: Feinsteuerungs-Sentenztransformatoren für die Ordnungs-Wort-in-Kontext-Klassifikation XL-DURel:Odinal Word-in-Ctext分类的微调句式变换器 2507.14578v1
  • 427 07-19 AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs? AlgoTune: Können Sprachmodelle allgemeine numerische Programme beschleunigen? AlgoTune: 语言模型能加速通用计算程序吗? 2507.15887v1
  • 428 07-19 BriLLM: Brain-inspired Large Language Model BriLLM: Gehirninspiriertes Large Language Model BrILLM: 脑启发型大语言模式 2503.11299v5
  • 429 07-19 KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse KVLink: Beschleunigen von großen Sprachmodellen über effiziente KV Cache Reuse KVLink: 通过高效 KV 缓存再利用加速大语言模型 2502.16002v3
  • 430 07-19 MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation MEMERAG: Mehrsprachiger Meta-Evaluierungs-Benchmark für retrieval Augmented Generation MEMEMAAG: 回收增加的一代多语言端到末至末的元值评价基准 2502.17163v4
  • 431 07-19 Efficient Whole Slide Pathology VQA via Token Compression Effiziente ganze Folie Pathologie VQA über Token Compression 通过 Token 压缩高效的全幻灯片病理学 VQA 2507.14497v1
  • 432 07-19 TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios ZEIT: Mehrstufiger Benchmark für die zeitliche Reasonierung von LLMs in realen Szenarien 时间:现实世界情景中LLMs的多层次时间理由基准 2505.12891v2
  • 433 07-19 Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification Label-Semantik Aware Generativer Ansatz für Domain-Agnostic Multilabel-Klassifikation 域-不可知性多标签分类的认知生成方法 2506.06806v2
  • 434 07-19 SWI: Speaking with Intent in Large Language Models SWI: Sprechen mit Intent in großen Sprachmodellen SWI:用大语言模型表达意向 2503.21544v2
  • 435 07-19 Draft-based Approximate Inference for LLMs Entwurfsbasierte annähernde Schlussfolgerung für LLM LLMM 的基于草案的近似推论 2506.08373v2
  • 436 07-19 AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization AlphaDPO: Adaptive Prämienspanne für direkte Präferenzoptimierung AlphaDPO: 直接优化优惠的适应性回报边缘 2410.10148v4
  • 437 07-19 Texture or Semantics? Vision-Language Models Get Lost in Font Recognition Textur oder Semantik? Vision-Sprachen-Modelle Verloren in Schrifterkennung 纹理还是语义学? 2503.23768v2
  • 438 07-19 Vulnerability of LLMs to Vertically Aligned Text Manipulations Schwachstelle von LLMs an vertikal ausgerichtete Textmanipulationen LLMM LLM 易发生垂直一致的文本处理 2410.20016v3
  • 439 07-19 DRS: Deep Question Reformulation With Structured Output DRS: Tiefenfrage-Reformulation mit strukturierter Ausgabe DRS: 用结构化产出进行深度问题重新分析 2411.17993v5
  • 440 07-19 VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension VlogQA: Aufgaben-, Datensatz- und Ausgangsmodelle für vietnamesisch gesprochene maschinelle Leseverständnisse VlogQA:越南语音机器阅读理解的任务、数据集和基线模型 2402.02655v3
  • 441 07-19 It’s Not That Simple. An Analysis of Simple Test-Time Scaling Es ist nicht so einfach. Eine Analyse der einfachen Test-Zeit-Skalierung 不是那么简单 简单的测试时间缩放分析 2507.14419v1
  • 442 07-19 Inverse Scaling in Test-Time Compute Inverse Skalierung in der Testzeit berechnen 测试时间计算中的反反缩放 2507.14417v1
  • 443 07-18 (5) Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning Orchestrator-Agent Trust: Ein modulares KI-Visualisierungssystem mit vertrauensbewusster Orchestrierung und RAG-basierter Reasoning Orchetor-Agentor-Agentor Trust:一个具有信托软件管弦和RAG依据的理由的模块代理 AI 视觉分类系统 2507.10571v2
  • 444 07-18 Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions Bewertung der Zuverlässigkeit großer Sprachmodelle für deduktives Qualitatives Coding: Eine vergleichende Studie von ChatGPT-Interventionen 评估减减量化定性编码大语言模型的可靠性:对聊天点、低质量编码的干预措施的比较研究 2507.14384v1
  • 445 07-18 Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms Kombinatorische Optimierung für alle: Verwendung von LLMs zur Unterstützung von Nicht-Experten bei der Verbesserung von Optimierungsalgorithmen 组合优化全民:利用LLMs帮助非专家改进最佳化算法 2503.10968v2
  • 446 07-18 Error-Aware Curriculum Learning for Biomedical Relation Classification Error-Aware Curriculum Learning for Biomedical Relation Classification 生物医学关系分类的错误意识课程学习 2507.14374v1
  • 447 07-18 Text-to-SQL for Enterprise Data Analytics Text-zu-SQL für Enterprise Data Analytics 企业数据分析的文本到SQL 2507.14372v1
  • 448 07-18 Layerwise Recall and the Geometry of Interwoven Knowledge in LLMs Layerwise Recall und die Geometrie des verwobenen Wissens in LLMs 平整图层回溯和LLM 中互交知识的几何 2502.10871v2
  • 449 07-18 Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans Analysieren Sie die Neuronen, nicht die Einbettungen: Verstehen, wann und wo LLM-Darstellungen mit Menschen ausgerichtet sind 分析神经,而不是内嵌:了解LLM代表何时何地与人类对齐 2502.15090v2
  • 450 07-18 Can LLMs Infer Personality from Real World Conversations? Kann LLMs Persönlichkeit von Real World Conversations ableiten? ” 现实世界对话 “ 的推论人性能能否得到LLMs? 2507.14355v1
  • 451 07-18 Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers Solo-Anschluss: Eine Parameter-Effiziente Feintuning-Technik für Transformatoren Solo 连接: 用于变形器的参数节能微调技术 2507.14353v1
  • 452 07-18 Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark Document Haystack: Ein langer Kontext Multimodales Bild/Dokument Verständnis Vision LLM Benchmark Haystack文件:长期、多模式图像/文件理解愿景LLM基准 2507.15882v1
  • 453 07-18 Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models Plan für Geschwindigkeit: Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle 速度计划: 遮蔽传播语言模型的饱和日程安排 2506.19037v2
  • 454 07-18 Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning Symbolische Mixture-of-Experts: Adaptives Skill-basiertes Routing für heterogene Vernunft 专家的混合符号:基于适应性技能的异异源理据调离 2503.05641v3
  • 455 07-18 How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs Wie LLMs zeitliche Bedeutung in Narratives verstehen: Eine Fallstudie zur kognitiven Bewertung von LLMs LLM女士 在叙述中如何理解时间含义:对LLMs进行认知评价的案例研究 2507.14307v1
  • 456 07-18 Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study Ausrichtung großer Sprachmodelle auf ressourcenarme Sprachen durch LLM-basierte Selektive Übersetzung: Eine systematische Studie 通过基于LLM的选择性翻译,使大语言模式与低资源语言相一致:系统研究 2507.14304v1
  • 457 07-18 In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding In-Depth und In-Breadth: Vorschulung multimodaler Sprachmodelle für ein umfassendes Chart-Verständnis In-Deph和In-Breadth:为全面了解图表而定制的培训前多模式语言模式 2507.14298v1
  • 458 07-18 WebGuard: Building a Generalizable Guardrail for Web Agents WebGuard: Aufbau einer generalisierbaren Leitplanke für Web-Agenten WebGuard:为网络代理建立一个通用的警卫车 2507.14293v1
  • 459 07-18 A General Framework for Inference-time Scaling and Steering of Diffusion Models Ein allgemeiner Rahmen für Schlussfolgerungs-Zeit-Skalierung und Steuerung von Diffusionsmodellen 传播模型的推推时间缩放和引导总框架 2501.06848v5
  • 460 07-18 Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning Harmonie in Divergenz: Auf dem Weg zu einer schnellen, präzisen und speichereffizienten Null-Order-LLM Feinabstimmung 和谐共存:快速、准确和记忆效率高的零级LLM微调 2502.03304v2
  • 461 07-18 NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining NoHumansRequired: Autonome High-Quality Bildbearbeitung Triplet Mining 无人要求:自主高品质图像编辑三线采矿 2507.14119v1
  • 462 07-18 MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs MultiBLiMP 1.0: Ein massiver Mehrsprachigkeits-Benchmark für sprachliche Minimal Pairs MuldiBLIMP 1.0:语言最小对等语言多语种大比例基准 2504.02768v2
  • 463 07-18 Learning to Reason at the Frontier of Learnability Vernunft lernen an der Grenze der Lernfähigkeit 学习在可学习的前沿学习理性 2502.12272v5
  • 464 07-18 Sparse Rewards Can Self-Train Dialogue Agents Sparse Belohnungen können Selbst-Train Dialogmittel 可自我培训对话代理器 2409.04617v3
  • 465 07-18 DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits DENSE: Longitudinal Progress Note Generation mit zeitlicher Modellierung von heterogenen klinischen Anmerkungen über Krankenhausbesuche hinweg DENS: 医院全程探视不同临床诊断说明的实时建模纵向进展说明的生成 2507.14079v1
  • 466 07-18 Critiques of World Models Kritik an Weltmodellen 世界模式的证明 2507.05169v2
  • 467 07-18 On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding On-Policy-Optimierung mit äquivalenter Gruppenpräferenz für das Multiprogrammieren des Sprachverständnisses 与多方案语言理解的集团等效优先 2505.12723v2
  • 468 07-18 Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog Kollaboratives Rational Speech Act: Pragmatische Begründung für Multi-Turn-Dialog 《合作合理言论法:多发对话的实用理由》 2507.14063v1
  • 469 07-18 EdgeVLA: Efficient Vision-Language-Action Models EdgeVLA: Effiziente Vision-Sprache-Aktionsmodelle EdgeVLA: 高效率的愿景-语言-行动模式 2507.14049v1
  • 470 07-18 Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs Cross-Lingual Auto-Evaluation für die Bewertung mehrsprachiger LLMs 评估多种语文LLMs的跨语言自动评价 2410.13394v2
  • 471 07-18 Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks Bewertung der Wirksamkeit von kosteneffizienten großen Sprachmodellen in biomedizinischen Benchmark-Aufgaben 评价基准生物医学任务中成本效率高的大型语言模型的效力 2507.14045v1
  • 472 07-18 Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models Auf dem Weg zu einer vernünftigen Ära: Eine Umfrage über lange Kette von Gedanken, um große Sprachmodelle zu verstehen 通向理性时代:关于为理由使用大语言模式而寻求的长链研究的调查 2503.09567v5
  • 473 07-18 CPC-CMS: Cognitive Pairwise Comparison Classification Model Selection Framework for Document-level Sentiment Analysis CPC-CMS: Kognitives Paarweises Vergleichs-Klassifikation Modellauswahl-Framework für Dokument-Level-Sentimentanalyse CPC-CMS:文件级别感知分析文件级别感应分析的认知对称比较比较分类示范选择框架 2507.14022v1
  • 474 07-18 Efficient Temporal Tokenization for Mobility Prediction with Large Language Models Effiziente zeitliche Tokenisierung für Mobilitätsvorhersage mit großen Sprachmodellen 具有大语言模式的流动预测高效时时适调 2507.14017v1
  • 475 07-18 On the class of coding optimality of human languages and the origins of Zipf’s law Über die Klasse der Kodierung der optimalen menschlichen Sprachen und die Ursprünge des Zippschen Gesetzes 在人类语言最优化的编码和齐普夫法律的起源方面 2505.20015v4
  • 476 07-18 Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Offene automatische Spracherkennungsmodelle für klassische und moderne Standard-Arabisch 经典和现代阿拉伯文标准开放自动语音识别模式 2507.13977v1
  • 477 07-18 From Roots to Rewards: Dynamic Tree Reasoning with RL Von Wurzeln zu Belohnungen: Dynamische Baumveranlagung mit RL 从根到奖赏: 使用 RL 解释动态树 2507.13142v2
  • 478 07-18 Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking Ev2R: Evidence Retrieval im automatisierten Fact-Checking bewerten Ev2R:评价自动实况调查中的证据检索 2411.05375v2
  • 479 07-18 Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need Bottom-up Domain-spezifische Superintelligenz: Eine zuverlässige Wissensgrafik ist das, was wir brauchen 自下而上 内地特有超级情报机构:一个可靠的知识图是我们需要的 2507.13966v1
  • 480 07-18 Exploiting Primacy Effect To Improve Large Language Models Nutzung des Primateffekts zur Verbesserung großer Sprachmodelle 利用优势效应改进大语言模式 2507.13949v1
  • 481 07-18 Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support Marcel: Ein leichter und offener Gesprächsagent für Studentenunterstützung an der Universität 马塞尔:一个轻量级和开放源码的大学学生支助对话代理人 2507.13937v1
  • 482 07-18 Preprint: Did I Just Browse A Website Written by LLMs? Preprint: Habe ich gerade eine Website durchsucht, die von LLMs geschrieben wurde? 预印:我刚刚浏览了一个由LLMS编写的网站吗? 2507.13933v1
  • 483 07-18 The Levers of Political Persuasion with Conversational AI Die Leiter der politischen Überzeugung mit konversatorischer KI 与AI协会对话的政治见解的先锋 2507.13919v1
  • 484 07-18 Political Leaning and Politicalness Classification of Texts Politisches Leaning und Politisches Einordnen von Texten 文本的政治精度和政治政治性分类 2507.13913v1
  • 485 07-18 Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review 从商业文件中提取的基于深学习的关键信息:系统文献审查 2408.06345v2
  • 486 07-18 Using LLMs to identify features of personal and professional skills in an open-response situational judgment test Verwendung von LLMs zur Identifizierung von Merkmalen persönlicher und beruflicher Fähigkeiten in einem offenen situativen Beurteilungstest 利用LLMM 确定公开反应情况判断测试中个人和专业技能的特点 2507.13881v1
  • 487 07-18 Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies Optimierung von ASR für katalanische-spanische Code-Switching: Eine vergleichende Analyse von Methodologien 优化加泰罗尼亚-西班牙编码转换的ASR:方法比较分析 2507.13875v1
  • 488 07-18 Label Unification for Cross-Dataset Generalization in Cybersecurity NER Label-Einheit für Cross-Dataset-Verallgemeinerung in Cybersecurity NER 网络安全通用化网络安全 2507.13870v1
  • 489 07-18 HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation HoH: Ein dynamischer Benchmark zur Bewertung der Auswirkungen veralteter Informationen auf die retrieval-augmentierte Generation HoH:评估过时信息对回源一代人的影响的动态基准 2503.04800v3
  • 490 07-18 SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection SPARQL Query Generation mit LLMs: Messung der Auswirkungen von Trainingsdatenerfassung und Wissensinjektion SPARQL 使用LLMs 进行查询:衡量培训数据记忆和知识输入的影响 2507.13859v1
  • 491 07-18 InTraVisTo: Inside Transformer Visualisation Tool InTraVisTo: Innen-Transformer-Visualisierungswerkzeug IntraVisto: 内部变异可视化工具 2507.13858v1
  • 492 07-18 Modeling Fair Play in Detective Stories with Language Models Modeling Fair Play in Detektivgeschichten mit Sprachmodellen 模拟具有语言模式的侦探故事中的公平游戏 2507.13841v1
  • 493 07-18 The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words Die Ausdrucksformen von Depression und Angst im chinesischen Psycho-Springen: Verwendung von Singular Pronomen und negativen emotionalen Wörtern 《中国心理咨询中抑郁和焦虑的表现形式:第一人使用一人独唱普罗诺文和消极情感言词》 2507.13839v1
  • 494 07-18 LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop 学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈 2507.04295v3
  • 495 07-18 Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models Frage-Antwort-Extraktion aus wissenschaftlichen Artikeln mit Wissensgraphen und großen Sprachmodellen 利用知识图和大语言模型从科学文章中提取问题答案 2507.13827v1
  • 496 07-18 RAG-based Architectures for Drug Side Effect Retrieval in LLMs RAG-basierte Architekturen für Arzneimittel-Side-Effekt-Retrieval in LLMs 以RAG为基础的长效LM中药物副效应回收建筑 2507.13822v1
  • 497 07-18 Exploring Graph Representations of Logical Forms for Language Modeling Erforschen von Graphendarstellungen von Logischen Formen für die Sprachmodellierung 探讨语言建模逻辑格式图示图示 2505.14523v2
  • 498 07-18 Consistency of Responses and Continuations Generated by Large Language Models on Social Media Kohärenz von Reaktionen und Fortsetzungen, die von großen Sprachmodellen in den sozialen Medien erzeugt werden 由社会媒体大语言模式生成的应对措施和延续的一致性 2501.08102v3
  • 499 07-18 Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian Code Lesbarkeit im Zeitalter großer Sprachmodelle: Eine industrielle Fallstudie von Atlassian 《大语言模式时代的可读性:阿特拉斯斯语工业案例研究》 2501.11264v3
  • 500 07-18 An Enhanced Model-based Approach for Short Text Clustering Ein verbesserter modellbasierter Ansatz für Kurztext-Clustering 强化的短文本集群化基于模式的强化办法 2507.13793v1
  • 501 07-18 Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions Vision-Sprachen-Modelle zu fragen lehren: Ambiguität in visuellen Fragen lösen 教学 “ 视觉-语言模型:解决视觉问题中的模糊问题 “ 的问询 2507.13773v1
  • 502 07-18 From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation Von KMMLU-Redux zu KMMLU-Pro: Eine professionelle koreanische Benchmark-Suite für die LLM-Bewertung 从KMMLU-Redux到KMMLU-Pro:韩国用于LLM评价的专业基准套件 2507.08924v2
  • 503 07-18 Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models Unschuld im Kreuzfeuer: Rollen von Skip Connections in Jailbreaking Visual Language Models 《交火中的无罪:在破狱视觉语言模型中跳过连接的作用》 2507.13761v1
  • 504 07-18 PRIDE – Parameter-Efficient Reduction of Identity Discrimination for Equality in LLMs PRIDE – Parameter-Effiziente Reduzierung der Identitätsdiskriminierung für die Gleichstellung in LLMs PRIDE – – 有效减少在LLM中平等身份歧视的参数 2507.13743v1
  • 505 07-18 From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios Von Worten bis zu Kollisionen: LLM-geführte Bewertung und adversarische Generierung von sicherheitskritischen Fahrszenarien 从文字到碰撞:LLM-指导评价和反向生成安全紧急驾驶设想方案 2502.02145v4
  • 506 07-18 DailyLLM: Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLMs DailyLLM: Context-Aware-Aktivitätsprotokollierung mit Multi-Modal-Sensoren und LLMs DailyLLM: 使用多模式传感器和LLM 生成背景软件活动日志 2507.13737v1
  • 507 07-18 The Judge Variable: Challenging Judge-Agnostic Legal Judgment Prediction Die Richtervariable: Herausfordernde Richter-agnostische rechtliche Urteilsvorhersage 法官变量:挑战法官-不可接受法律判决预测 2507.13732v1
  • 508 07-18 DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Verstärkungslernen für subgoale Zersetzung 深SeepSeek-Prover-V2:通过强化学习推进正规数学理由,以降低次级目标的分目标分解 2504.21801v2
  • 509 07-18 Automatically assessing oral narratives of Afrikaans and isiXhosa children Automatische Beurteilung mündlicher Erzählungen von Afrikaans und isiXhosa Kindern 自动评估南非荷兰语和土著Xhoosa儿童口述叙述 2507.13205v2
  • 510 07-18 To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization Um zu kodieren oder nicht zu kodieren? Adaptive Toolintegration für Math Language Models über Erwartungs-Maximierung 代码或非代码?通过期望-最大化将数学语言模型整合的适应性工具集成 2502.00691v4
  • 511 07-18 LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning LLM-getriebene medizinische Report Generierung über kommunikationseffizientes Heterogenes Federated Learning LLM 驱动的通过通信效率高的异质联邦学习编写医学报告 2506.17562v2
  • 512 07-18 ASTRID – An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems ASTRID – Eine automatisierte und skalierbare TRIaD für die Bewertung von RAG-basierten klinischen Frageantwortsystemen ASTRID – – 用于评价以RAG为基础的临床问题解答系统的自动和可升级的TRIAD 2501.08208v2
  • 513 07-18 Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations Konsequente Erklärer oder unzuverlässige Erzähler? LLM-generierte Gruppenempfehlungen verstehen 理解LLM提出的集团建议 2507.13705v1
  • 514 07-18 Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models Modellierung der Open-World-Kognition als On-Demand-Synthese probabilistischer Modelle 将开放世界的认知建模作为概率模型的 “ 现场合成 “ 模型 2507.12547v2
  • 515 07-18 LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues LoopServe: Ein adaptives Dual-Phase-LLM-Inferenz-Beschleunigungssystem für Multi-Turn-Dialoge 环环服务:多轨对话的适应性双阶段双阶段LLM推推加速系统 2507.13681v1
  • 516 07-18 KiC: Keyword-inspired Cascade for Cost-Efficient Text Generation with LLMs KiC: Schlüsselwort-inspirierte Cascade für kosteneffiziente Textgenerierung mit LLMs KIC: 与LLMs一起制作成本效率高的文本的关键字启发级联 2507.13666v1
  • 517 07-18 CU-ICU: Customizing Unsupervised Instruction-Finetuned Language Models for ICU Datasets via Text-to-Text Transfer Transformer CU-ICU: Anpassen unüberwachter Instruktions-Finetuned Language Models für ICU-Datensätze über Text-zu-Text Transfer Transformer CU-ICU: 通过文本到文字传输变换器定制ICU数据集的不受监督的指令-不全调语言模型 2507.13655v1
  • 518 07-18 The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Die Illusion des Denkens: Die Stärken und Grenzen von Vernunftmodellen über das Lens of Problem Complexity verstehen 思考的幻觉:通过问题复杂焦点了解理性模型的长处和局限性 2506.06941v2
  • 519 07-18 EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation EvolveNav: Selbstverbessernde körpereigene Begründung für LLM-basierte Vision-Language-Navigation EvolveNav:基于LLM的愿景-语言导航自我改善自足理由 2506.01551v2
  • 520 07-18 Temporal reasoning for timeline summarisation in social media Temporale Argumentation für Zeitlinienzusammenfassung in sozialen Medien 社交媒体时间时间总结推理 2501.00152v3
  • 521 07-18 ViMMRC 2.0 – Enhancing Machine Reading Comprehension on Vietnamese Literature Text ViMMRC 2.0 – Verbesserung des Leseverständnisses in vietnamesischer Literatur Text VIMRC 2.0 – – 加强对越南文学文本的机器阅读理解 2303.18162v3
  • 522 07-18 Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models Linguistische und einbettende Profilierung von Texten, die von Menschen und großen Sprachmodellen erzeugt werden 人类和大语言模式产生的文本的语言和嵌入式图解 2507.13614v1
  • 523 07-18 Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know? Vernunft über Ungewissheit: Wissen Vernunftmodelle, wenn sie es nicht wissen? 关于不确定性的原因:理性模型知道他们不知道什么时候知道吗? 2506.18183v3
  • 524 07-18 CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks CoTasks: Chain-of-Thought-basierte Video-Anleitung Tuning-Aufgaben 考量表: 以研究链为基础的视频教学图示任务 2507.13609v1
  • 525 07-18 STACK: Adversarial Attacks on LLM Safeguard Pipelines Gegenseitige Angriffe auf LLM Safeguard Pipelines 对LLM保障管道的反向攻击 2506.24068v2
  • 526 07-18 Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Ist das Ihre letzte Antwort? Test-Time Scaling verbessert selektive Fragen beantworten 这就是你最后的答案吗? 测试时间缩放能改善选择性回答问题 2502.13962v2
  • 527 07-18 When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models Wenn Menschen überflutet sind: Analyse der Entmenschlichung von Metaphoren in Einwanderungsdiskursen mit großen Sprachmodellen 当人们遭受洪水时:用大语言模型分析移民问题中非人化的比喻 2502.13246v2
  • 528 07-18 TexGS-VolVis: Expressive Scene Editing for Volume Visualization via Textured Gaussian Splatting TexGS-VolVis: Expressive Szenebearbeitung für die Volumenvisualisierung über texturierte Gaussian Splatting TexGS-VolVis: 通过Textured Gaussian Splatting 进行卷量可视化的显性场景编辑 2507.13586v1
  • 529 07-17 (4) An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots Ein Ansatz zur automatischen Generierung von Beschriftungsfunktionen für Software Engineering Chatbots 软件工程聊天器自动生成标签功能的方法 2410.07094v2
  • 530 07-17 A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Ein Data-Centric Framework zur Bewältigung phonetischer und prosodischer Herausforderungen in russischen Sprachgenerativen Modellen 解决俄罗斯语音生成模型中电话和预发挑战的数据中心框架 2507.13563v1
  • 531 07-17 Culture is Not Trivia: Sociocultural Theory for Cultural NLP Kultur ist nicht Trivia: Soziokulturelle Theorie für kulturelle NLP 文化不是特里维亚文化:社会文化文化理论 2502.12057v2
  • 532 07-17 Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder Lesen zwischen den Zeilen: Kombination von Pausendynamik und semantischer Kohärenz zur automatisierten Bewertung von Gedankenstörungen 在两行之间阅读:将暂停动态和语义一致性结合起来,以自动评估思想紊乱 2507.13551v1
  • 533 07-17 GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models GOFAI trifft Generative KI: Entwicklung von Expertensystemen mittels großer Sprachmodelle GOFAI会议:通过大语言模式发展专家系统 2507.13550v1
  • 534 07-17 A Computational Approach to Modeling Conversational Systems: Analyzing Large-Scale Quasi-Patterned Dialogue Flows Ein Computational Approach zur Modellierung von Gesprächssystemen: Analysieren großräumiger Quasi-gemusterter Dialogströme 模拟交汇系统模型化的计算方法:分析大型准源对话流量 2507.13544v1
  • 535 07-17 From Code to Compliance: Assessing ChatGPT’s Utility in Designing an Accessible Webpage – A Case Study Von Code zur Compliance: Bewertung des Nutzens von ChatGPT bei der Gestaltung einer barrierefreien Webseite – Eine Fallstudie 从代码到合规:评估查盖伯特在设计无障碍网页方面的效用 – – 案例研究 2501.03572v2
  • 536 07-17 Encoding syntactic objects and Merge operations in function spaces Kodierung syntaktischer Objekte und Zusammenführen von Operationen in Funktionsräumen 在功能空格中编码同族天体和合并操作 2507.13501v1
  • 537 07-17 The role of large language models in UI/UX design: A systematic literature review Die Rolle großer Sprachmodelle im UI/UX-Design: Ein systematischer Literaturbericht 大语言模型在UI/UX设计中的作用:系统文献审查 2507.04469v2
  • 538 07-17 ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data ParaPO: Sprachmodelle so ausrichten, dass verbatime Reproduktion von Vortrainingsdaten reduziert wird ParaPO:调整语文模式,减少培训前数据的逐字记录 2504.14452v2
  • 539 07-17 Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? Die LLM Value Probing Strategies: Sind sie robust und ausdrucksstark? 重新研究LLM 价值检验战略:它们是否有力和具有表现力? 2507.13490v1
  • 540 07-17 On Pre-training of Multimodal Language Models Customized for Chart Understanding Zur Vorausbildung multimodaler Sprachmodelle, die für das Chart-Verständnis angepasst sind 为了解图表而定制的多模式语言模型的预培训 2407.14506v3
  • 541 07-17 RExBench: Can coding agents autonomously implement AI research extensions? RExBench: Können Codierer KI-Forschungserweiterungen autonom implementieren? RExBench:编码代理商能否自主实施AI研究扩展? 2506.22598v2
  • 542 07-17 Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers Papierzusammenfassung Angriff: Jailbreaking LLMs durch LLM Safety Papers 论文摘要攻击:通过LLM 安全文件建造监狱的LLMLM 2507.13474v1
  • 543 07-17 psifx – Psychological and Social Interactions Feature Extraction Package psifx – Psychologische und soziale Interaktionen Feature Extraction Package psifx – – 心理和社会互动 2407.10266v4
  • 544 07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning VisionThink: Intelligentes und effizientes Vision-Sprachmodell durch Verstärkungslernen 远景设想:通过强化学习建立聪明、高效的愿景语言模式 2507.13348v1
  • 545 07-17 DeFine: Decision-Making with Analogical Reasoning over Factor Profiles DeFine: Entscheidungsfindung mit analogischer Begründung über Faktorprofile DeFine: 与因子剖析档的模拟理由有关的决策 2410.01772v2
  • 546 07-17 Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes Vergleich von Äpfeln mit Orangen: Ein Datensatz & Analyse des LLM Humorverständnisses von traditionellen Puns zu thematischen Witzen 将苹果与橙类比较:从传统Puns到专题笑话的LLM Humour理解数据集和分析 2507.13335v1
  • 547 07-17 The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Die Imitation Spiel: Turing Machine Imitator ist Länge Generalizable Reasoner 模拟游戏:图画机器模拟器是长可概括的理由 2507.13332v1
  • 548 07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It Vision-and-Language Training hilft, taxonomisches Wissen zu implementieren, ändert es aber nicht grundlegend 愿景和语言培训帮助利用分类学知识,但不能从根本上改变这种知识。 2507.13328v1
  • 549 07-17 Social and Political Framing in Search Engine Results Soziale und politische Framing in Suchmaschinen-Ergebnissen 寻找引擎结果中的社会和政治形式 2507.13325v1
  • 550 07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals HapticCap: Ein multimodaler Datensatz und die Aufgabe, die Benutzererfahrung von Schwingungshaptischen Signalen zu verstehen HapticCap:多模式数据集和了解用户振动信号信号体验的任务 2507.13318v1
  • 551 07-17 HuggingGraph: Understanding the Supply Chain of LLM Ecosystem HuggingGraph: Die Lieferkette von LLM Ecosystem verstehen HugggGraph:了解LLM生态系统的供应链 2507.14240v1
  • 552 07-17 Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information Ermittlung von Aufgabengruppen für Multi-Task-Lernen mit pointwise V-Usable Information 利用有分点的V-可靠信息确定多任务学习组 2410.12774v2
  • 553 07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations Die Generative Energy Arena (GEA): Einbeziehung des Energiebewusstseins in das Large Language Model (LLM) Human Assessments 产生能源竞技场:将能源意识纳入大语言模型(LLM)人类评估 2507.13302v1
  • 554 07-17 AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research AbGen: Bewertung großer Sprachmodelle in Ablationsstudiendesign und Evaluation für wissenschaftliche Forschung AbGen:评估用于科学研究的实验研究设计和评价中的大语言模型 2507.13300v1
  • 555 07-17 Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis Multi-Agent Synergy-getriebene iterative visuelle Narrative Synthese 多机构协同-驱动动态迭代视觉叙述合成 2507.13285v1
  • 556 07-17 ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations ContextQFormer: Eine neue Context-Modellierungsmethode für Multi-Turn Multi-Modal-Gespräche 上下文前:多发多式多模式对话的新背景建模方法 2505.23121v2
  • 557 07-17 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management Überblick über das TalentCLEF 2025: Kompetenz- und Berufstitel-Intelligenz für Human Capital Management 《2025年人才人才-CLEF概览:人力资本管理技能和职称情报》 2507.13275v1
  • 558 07-17 Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering Sichere Multifaceted-RAG für Unternehmen: Hybrides Knowledge Retrieval mit Security-Filterung 企业安全多面安全RAG:带安全过滤器的混合知识检索 2504.13425v2
  • 559 07-17 QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation QuestA: Erweitern der Begründungskapazität in LLMs durch Frageerweiterung 目标A:通过问题增加扩大LLMs的理据能力 2507.13266v1
  • 560 07-17 Automating Steering for Safe Multimodal Large Language Models Automatisierungslenkung für sichere multimodale große Sprachmodelle 安全多式联运大语言模式自动化指导 2507.13255v1
  • 561 07-17 ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs ConTextual: Verbesserung der klinischen Textzusammenfassung in LLMs mit kontextschonender Token-Filterung und Wissensgraphen 共同方式:改进LLMLLM的临床文本摘要,同时保持上下文透视和知识图 2504.16394v3
  • 562 07-17 Enhancing Cross-task Transfer of Large Language Models via Activation Steering Verbesserung der Cross-Task-Übertragung großer Sprachmodelle durch Aktivierungslenkung 通过启动指导加强大语言模式的跨任务转让 2507.13236v1
  • 563 07-17 CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings CoDet-M4: Erkennung maschinengenerierter Codes in Multi-Lingual-, Multi-Generator- und Multi-Domain-Einstellungen CoDet-M4:多语言、多驱动器和多域设置中的检测机生成代码 2503.13733v2
  • 564 07-17 A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans Ein Vergleichsansatz zur Beurteilung sprachlicher Kreativität von großen Sprachmodellen und Menschen 评估大语言模式和人类语言创造性的比较方法 2507.12039v2
  • 565 07-17 GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems GEMMAS: Graph-basierte Evaluations-Metriken für Multi-Agent-Systeme GEMMAS:基于图表的多剂系统评价计量表 2507.13190v1
  • 566 07-17 Feature-based analysis of oral narratives from Afrikaans and isiXhosa children Feature-basierte Analyse oraler Erzählungen von Afrikaans und isiXhosa-Kindern 对南非荷兰语和土著Xhoosa儿童口述叙述的基于特征的分析 2507.13164v1
  • 567 07-17 CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation CCL-XCoT: Eine effiziente Cross-Lingual Knowledge Transfer Methode zur Minderung der Halluzination Generation CCL-XCot: 用于减少幻觉一代的有效交叉知识转让方法 2507.14239v1
  • 568 07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities Inverse Stärkung Lernen trifft auf großes Sprachmodell Post-Training: Grundlagen, Fortschritte und Chancen 培训后培训:基础、进步和机会 2507.13158v1
  • 569 07-17 SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben SWE-MERA: 积极评价软件工程任务大语言模型的动态基准 2507.11059v2
  • 570 07-17 Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung 结合人口偏见和示范解释评估LLM 说明的可靠性 2507.13138v1
  • 571 07-17 A Computational Framework to Identify Self-Aspects in Text Ein Computational Framework zur Identifizierung von Selbstaspekten im Text 文本中识别自我特征的计算框架 2507.13115v1
  • 572 07-17 Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression Task-Circuit Quantization: Nutzung von Wissen Lokalisierung und Dolmetschbarkeit für Komprimierung 任务-环境环境定量:利用知识本地化和压缩解释 2504.07389v2
  • 573 07-17 Language Models Change Facts Based on the Way You Talk Sprachmodelle ändern Fakten anhand der Art und Weise, wie Sie sprechen 以你说话的方式为基础的语言模式改变事实 2507.14238v1
  • 574 07-17 SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts SemCSE: Semantische kontrastive Satzeinbettungen mit LLM-generierten Zusammenfassungen für wissenschaftliche Abstracts SEMCSE: 使用LLM创制的科学摘要摘要 2507.13105v1
  • 575 07-17 Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models Unified Triplet-Level Halluzination Evaluation für große Vision-Sprache Modelle 大型视觉语言模型统一三维级幻觉评价 2410.23114v4
  • 576 07-17 SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control SmartThinker: 学会按职级长长控制进行压缩和保留理由 2507.04348v2
  • 577 07-17 MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben MERA 守则:一个统一框架,用于评估不同任务制定守则的情况 2507.12284v2
  • 578 07-17 Formalizing Attack Scenario Description: A Proposed Model Formalisierung des Angriffsszenarios Beschreibung: Ein vorgeschlagenes Modell 正式化攻击设想情况说明:拟议模式 2507.13076v1
  • 579 07-17 Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities Rethinking the Embodyd Gap in Vision-and-Language Navigation: Eine ganzheitliche Studie physischer und visueller Disparitäten 重新思考视觉和语言导航中的 “ 内博差距 “ :关于物理和视觉差异的综合研究 2507.13019v1
  • 580 07-17 Teach Old SAEs New Domain Tricks with Boosting Lehren Sie alte SAEs neue Domain Tricks mit Förderung 教授旧的 SAEs 新域圈套 2507.12990v1
  • 581 07-17 Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen 学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词 2507.03580v2
  • 582 07-17 MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps MRT bei IberLEF-2025 PRESTA Aufgabe: Maximierung der Erholung von Tischen mit mehreren Schritten IberLEF-2025 PRESTA任务:最大限度地从有多个步骤的表格中回收 2507.12981v1
  • 583 07-17 UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets UniSLU: Unified Spoken Language Understanding aus heterogenen Cross-Task-Datensätzen UUSLU:从不同式跨任务数据集获得统一口语语言理解 2507.12951v1
  • 584 07-17 Probabilistic Soundness Guarantees in LLM Reasoning Chains Probabilistische Solidität garantiert in LLM-Aufklärungsketten LLM 理赔链条的概率稳妥性保障 2507.12948v1
  • 585 07-17 OASIS: Order-Augmented Strategy for Improved Code Search OASIS: Order-Augmented Strategy for Improved Code Search OASIS:改进守则搜索的有秩序加强战略 2503.08161v4
  • 586 07-17 Making Language Model a Hierarchical Classifier and Generator Sprachmodell zu einem hierarchischen Klassifikator und Generator machen 使语言模式成为等级分类和生成器 2507.12930v1
  • 587 07-17 MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents MEM1: Lernen, Speicher zu synergisieren und für effiziente Long-Horizon-Agenten zu verankern MEM1:学习如何使记忆和理由相互协调,以有效长森剂 2506.15841v2
  • 588 07-17 Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v3
  • 589 07-17 IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization IOPO: Verstärkung von LLMs mit komplexer Anleitung über Input-Output Preference Optimization IOPO:通过投入-产出优化,以复杂教学赋予LLMs权力 2411.06208v3
  • 590 07-17 Why Braking? Scenario Extraction and Reasoning Utilizing LLM Warum bremsen? Szenario Extraktion und Vernunft Verwendung LLM 为什么要踩脚? 设想提取和合理使用LLM 2507.15874v1
  • 591 07-17 On the Limitations of Large Language Models (LLMs): False Attribution Über die Grenzen großer Sprachmodelle (LLMs): Falsche Attribution 对大语言模式限制的限制: 2404.04631v2
  • 592 07-17 Aligning Knowledge Graphs and Language Models for Factual Accuracy Ausrichtung von Wissensgraphen und Sprachmodellen für die tatsächliche Genauigkeit 将知识图和语言模型与事实准确性对齐 2507.13411v1
  • 593 07-17 A Logically Consistent Chain-of-Thought Approach for Stance Detection Ein logisch konsistenter, schlüsselfertiger Ansatz zur Stance-Erkennung 一种逻辑上一致的研究链方法,以探测Stance 2312.16054v2
  • 594 07-17 MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness MAC-Tuning: Mehrkompositionelles LLM-Problem mit verbesserter Kenntnis der Grenzen des Wissens MAC-指导:LLM 以增进知识边界意识为由的多组问题 2504.21773v2
  • 595 07-17 SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems SEALGuard: Mehrsprachige Gespräche in südostasiatischen Sprachen für LLM-Softwaresysteme sichern SEALGuard:为LLM软件系统维护东南亚语言多语言对话 2507.08898v3
  • 596 07-17 Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent? Sind Wissen und Referenz in mehrsprachigen Sprachmodellen bereichsübergreifend konsistent? 多语文模式中的知识和参考资料是否相互一致? 2507.12838v1
  • 597 07-17 Causal Language Control in Multilingual Transformers via Sparse Feature Steering Causal Language Control in Mehrsprachigen Transformatoren über Sparse Feature Steering 多语种变换器的因果语言控制 2507.13410v1
  • 598 07-17 Emotional Support with LLM-based Empathetic Dialogue Generation Emotionale Unterstützung mit LLM-basiertem Empathetic Dialogue Generation 利用基于LLM的 “ 同情对话 “ 生成的LLM “ 情感支持 2507.12820v1
  • 599 07-17 Large Language Models’ Internal Perception of Symbolic Music Die innere Wahrnehmung symbolischer Musik durch große Sprachmodelle 大语言模型内部对符号音乐的感知 2507.12808v1
  • 600 07-17 MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models MCPEval: Automatische MCP-basierte Deep Evaluation für AI Agent Modelle MCPEval:AI 代理模型的自动MCP深度评估 2507.12806v1
  • 601 07-17 PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database PMKLC: Parallele Multi-Knowledge Learning-basierte Lossless-Kompression für großformatige Genomics-Datenbank PMKLC: 大型基因组数据库的平行多知识学习-无损失压缩 2507.12805v1
  • 602 07-17 ReCode: Updating Code API Knowledge with Reinforcement Learning ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen ReCode:更新法规API知识与强化学习 2506.20495v2
  • 603 07-17 Beyond Architectures: Evaluating the Role of Contextual Embeddings in Detecting Bipolar Disorder on Social Media Beyond Architectures: Bewertung der Rolle kontextueller Einbettungen bei der Erkennung bipolarer Störungen in sozialen Medien 超越建筑:评价背景嵌入在发现社会媒体两极分极分崩离析现象中的作用 2507.14231v1
  • 604 07-17 Learning Robust Negation Text Representations Robuste Negations-Textdarstellungen lernen 学习强力否定文本代表 2507.12782v1
  • 605 07-17 A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models Eine umfassende Umfrage zur elektronischen Gesundheitsdatenmodellierung: Von Deep Learning Ansätzen bis hin zu großen Sprachmodellen 《电子健康记录模型综合调查:从深学习方法到大语言模式》 2507.12774v1
  • 606 07-17 Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback Kritik-GRPO: LLM-Vernunft mit natürlicher Sprache und numerischem Feedback verbessern Critique-GROPO: 提高以自然语言和数字反馈为依据的LLM 2506.03106v4
  • 607 07-17 Synergy: End-to-end Concept Model Synergie: Ende-zu-Ende-Konzeptmodell 协同增效:端到端概念模型 2507.12769v1
  • 608 07-17 VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents VIDEE: Visuelle und Interaktive Zersetzung, Ausführung und Auswertung von Text Analytics mit intelligenten Agenten VIDE: 视觉和交互分解、执行和评价与智能剂的文字分析分析 2506.21582v2
  • 609 07-17 Logit Arithmetic Elicits Long Reasoning Capabilities Without Training Logit Arithmetische Elizite lange mit Gründen verbundene Fähigkeiten ohne Training 未经培训的逻辑 2507.12759v1
  • 610 07-17 Strategy Adaptation in Large Language Model Werewolf Agents Strategieanpassung im großen Sprachmodell Werwolf-Agenten 大语言示范狼人代理物的适应战略 2507.12732v1
  • 611 07-17 TransEvalnia: Reasoning-based Evaluation and Ranking of Translations TransEvalnia: Reasoning-based Evaluation und Ranking von Übersetzungen 过年:基于理由的评价和笔译的排名 2507.12724v1
  • 612 07-17 Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs Synthesizing Privacy-Preserving Text Data via Finetuning ohne Finetuning Billion-Scale LLMs 通过不作十亿规模的微调微调的微调合成保护隐私文本数据 2503.12347v2
  • 613 07-17 GUI Test Migration via Abstraction and Concretization GUI-Test-Migration über Abstraktion und Konkretisierung GUI 通过抽象和简明化测试移民 2409.05028v2
  • 614 07-17 Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening Fairness ist nicht genug: Auditing-Kompetenz und Intersektions-Bias in KI-powered Resume Screening 公平不够充分:审计能力和大赦国际授权的恢复筛选中的跨部门比阿斯 2507.11548v2
  • 615 07-17 ActionStudio: A Lightweight Framework for Data and Training of Large Action Models ActionStudio: Ein leichter Rahmen für Daten und Training großer Aktionsmodelle 行动研究:关于大型行动模式的数据和培训的轻量框架 2503.22673v3
  • 616 07-17 Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation Chain-of-Thought Prompting Obscures Halluzination Cues in großen Sprachmodellen: Eine empirische Bewertung 引导大语言模型中传译锥体:经验评价 2506.17088v2
  • 617 07-17 AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation AudioJudge: Verstehen, was in der großen Audiomodell basierten Sprachbewertung funktioniert 音频法官:了解大型音频示范演讲评价有什么用 2507.12705v1
  • 618 07-17 Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis Ausnutzung adaptiver Kontextmasken für aspektbasierte Sentiment-Analysen 利用适应性环境掩码进行外观感应力分析 2402.13722v2
  • 619 07-17 AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis AdaptiSent: Context-Aware Adaptive Aufmerksamkeit für multimodale Aspect-Based-Sentiment-Analysen 适应性:基于多种模式的光谱感应分析的上下文知识适应性关注 2507.12695v1

Article 0

Title@2025-07-24 (4): Checklists Are Better Than Reward Models For Aligning Language Models

Title: Checklists Are Better Than Reward Models For Aligning Language Models Checklisten sind besser als Belohnungsmodelle für die Ausrichtung von Sprachmodellen 核对列表比奖励模型更好调整语言模型 2507.18624v1

Authors (7): Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, Tongshuang Wu

Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmfulness”. In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose “Reinforcement Learning from Checklist Feedback” (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models’ support of queries that express a multitude of needs.

语言模式必须加以调整,以便理解并遵循用户的指示。强化学习被广泛用于促进这一点 – – 通常使用固定标准,如“帮助”和“协调”。在我们的工作中,我们提议使用灵活的、针对具体教学的标准,作为扩大强化学习对随后的教学的影响的手段。我们提议“从核对表反馈中加强学习”(RLCF)。我们从指示中提取核对清单,评估每个项目的反应如何令人满意 – – 使用大赦国际法官和专门核查程序 – – 然后将这些分数结合起来,计算RL.。我们将这些分数与适用于五个广泛研究基准的强力教学模式(Quen2.5-7B-Instruct)的其他调整方法进行比较。我们建议使用“强化学习”标准,作为提高每个基准业绩的唯一方法,包括提高跟踪Bench的4点满意度,提高InFoBench的6点,提高Arena-Hard的赢率。这些结果将核对清单反馈作为关键工具,用于改进语言模型对表达多种需要的询问的支持。


Article 1

Title@2025-07-24 (4): TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

Title: TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards TRPrompt: Bootstrapping Query-Aware Prompt Optimierung von Textbelohnungen TRPropt: 从文本奖励中促进解答询问软件快速优化 2507.18618v1

Authors (5): Andreea Nica, Ivan Zakazov, Nicolas Mario Baldwin, Saibo Geng, Robert West

Prompt optimization improves the reasoning abilities of large language models (LLMs) without requiring parameter updates to the target model. Following heuristic-based “Think step by step” approaches, the field has evolved in two main directions: while one group of methods uses textual feedback to elicit improved prompts from general-purpose LLMs in a training-free way, a concurrent line of research relies on numerical rewards to train a special prompt model, tailored for providing optimal prompts to the target model. In this paper, we introduce the Textual Reward Prompt framework (TRPrompt), which unifies these approaches by directly incorporating textual feedback into training of the prompt model. Our framework does not require prior dataset collection and is being iteratively improved with the feedback on the generated prompts. When coupled with the capacity of an LLM to internalize the notion of what a “good” prompt is, the high-resolution signal provided by the textual rewards allows us to train a prompt model yielding state-of-the-art query-specific prompts for the problems from the challenging math datasets GSMHard and MATH.

快速优化可以提高大型语言模型(LLMS)的推理能力,而不需要更新目标模型的参数。根据基于超常的“逐步思考”方法,该字段在两个主要方向上演进:一组方法使用文字反馈,以无培训的方式从普通用途LMS中获取更好的提示,同时进行一系列研究,依靠数字奖励来培训特别快速模型,专门为目标模型提供最佳提示。在本文中,我们引入了文本快速框架(TRPrompt),该框架通过直接将文字反馈纳入快速模型的培训而统一了这些方法。我们的框架不需要先前的数据集收集,并且正在随着对生成的提示的反馈的迭代改进。当LLMM有能力将什么是“好的”概念内部化时,文本奖励提供的高分辨率信号使我们能够训练一个迅速的模型,生成具有挑战性的数学数据集 GSMHard 和 MATH 所出现的问题的最新查询提示。


Article 2

Title@2025-07-24 (4): SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Title: SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning SynC: Synthetische Bildunterschrift Datensatzverfeinerung mit ein-zu-vielen Mapping für Zero-shot Bildunterschrift 合成图像说明: 合成图像说明数据集精化,用一到多个绘图进行零光图像说明的合成图像说明 2507.18616v1

Authors (6): Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim

Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.

零点图像显示( ZIC ) 越来越多地使用文本到图像模型( T2I ) 生成的合成数据集, 以减轻成本高昂的人工批注需求。 然而, 这些 T2I 模型往往生成显示语义不匹配的图像, 其相应的输入标题( 例如, 缺失的天体, 不正确的属性) 导致合成图像显示对配对噪音, 从而阻碍模式培训。 现有的数据集调整技术主要设计用于消除网络模拟数据( T2I ) 模型生成的噪音文字。 然而, 这些方法不适合合成数据的独特挑战, 其中标题通常结构完善, 但图像的表达可能不准确。 为了缩小这一差距,我们引入了 SynC , 这是一个专门为 ZIC 改进合成图像显示数据集配置的新型框架。 SynC 重点不是常规的过滤或再生,而是为合成图像库中已经存在的最具有语义一致性的图像重新配置字幕( 我们的方法是先重新定位多个相关候选人图像的配置战略 ) , 将一个状态到一个状态制图战略 , 初步检索多个相关的图像分析候选人 , 将每个图像浏览周期的系统进行不断校正化的校正化的校正校正 。


Article 3

Title@2025-07-24 (4): BEARCUBS: A benchmark for computer-using web agents

Title: BEARCUBS: A benchmark for computer-using web agents BEARCUBS: Benchmark für computergestützte Web-Agenten BEARCUBS:计算机使用网络代理器的基准 2503.07919v3

Authors (6): Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer

Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a “smallbut mighty” benchmark of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. We find that ChatGPT Agent significantly outperforms other computer-using agents with an overall accuracy of 65.8% (compared to e.g., Operator’s 23.4%), showcasing substantial progress in tasks involving real computer use, such as playing web games and navigating 3D environments. Nevertheless, closing the gap to human performance requires improvements in areas like fine control, complex data filtering, and execution speed. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.

现代网络代理拥有计算机使用能力,使其能够通过向虚拟键盘和鼠标发送指令与网页互动。 虽然这些代理具有巨大的潜力协助人类用户完成复杂任务, 但评估其在现实世界环境中的能力是一项重大挑战。 为此,我们引入了BEARCUBS, 这是一个“小型但强大的”基准, 共111个信息搜索问题, 旨在评估网络代理商搜索、浏览和识别网络上的事实信息的能力。 与先前的网络代理商基准不同, 解决 BEARCUBS 需要 (1) 访问现场网络内容而不是合成或模拟网页, 从而捕捉到真实世界网络互动的不可预测性; 以及 (2) 开展广泛的多式联运互动( 如视频理解、 3D导航) , 这是一项无法通过基于文本的工作周期绕开的。 BEARCUBS 的每一个问题都有相应的短、 明确的答复和有人类价值的浏览路径的浏览轨迹, 能够透明地评估代理商的绩效和战略。 一项人类研究证实 BECURBS 的改进问题是可避免的, , 但是, 在未来运行过程中, 需要大量使用网络代理商 。


Article 4

Title@2025-07-24 (4): Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Title: Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs Sparse Logit Sampling: Beschleunigung der Wissensdestillation in LLMs 粗略的登录抽样:加速在LLMs中进行知识蒸馏 2503.16870v2

Authors (8): Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee

Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation’, which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

知识蒸馏可以是一种在大语言模型中蒸馏知识的具有成本效益的技术,如果教师产出记录可以预先计算和缓存的话。然而,成功地将它应用于培训前的学习,基本上尚未探索。在这项工作中,我们证明,对诸如caching Top-K概率等知识蒸馏的幼稚方法,我们虽然直观,但向学生提供教师概率分布的偏差估计,从而导致不尽善的性能和校准。我们建议一种基于重要性的采样方法“兰多姆采样知识蒸馏”,提供不偏颇的估计,保持期望的梯度,并需要储存大量稀疏的登录。我们的方法使得对边际间接( < 10%) 的学生模型比跨机率培训更快地培训,同时保持竞争性的性能与完全蒸馏相比,从300M到3B不等的模型规模。


Article 5

Title@2025-07-24 (4): Scaling RL to Long Videos

Title: Scaling RL to Long Videos Skalierung von RL zu langen Videos 缩放 RL 到长视频 2507.07966v2

Authors (14): Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.0% and 70.7% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-R1 across multiple benchmarks. Moreover, LongVILA-R1 shows steady performance improvements as the number of input video frames increases. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

我们引入了一个完整的配置框架,将视觉语言模型的推理推理升级到长视频,利用强化学习;我们应对长视频推理的独特挑战,为此整合了三个关键组成部分:(1) 大型数据集LongVideo-Reason,由104K长视频QA配对组成,配有体育、游戏和 vlogs等不同领域的高质量推理说明;(2) 双阶段培训管道,将视频模型的推理范围扩大至有想象力的微调(CoT-SFT)和强化学习(RL);(3) 长视频RL,名为多式强化序列平行模式(MRSP)的培训基础设施,其中包括序列平行型和基于VLLLM的长视频引擎,用于高效推出和预填。 在我们的实验中,LVA-R-RSVA视频模型在视频基准上表现得力强,在R-R-R1视频模型的升级和升级方面,在视频-R-R-R1上持续超前的升级。


Article 6

Title@2025-07-24 (4): What Makes You CLIC: Detection of Croatian Clickbait Headlines

Title: What Makes You CLIC: Detection of Croatian Clickbait Headlines Was macht Sie CLIC: Erkennung von kroatischen Clickbait Schlagzeilen 是什么让你成为CLIC:发现克罗地亚点击头条头条 2507.14314v2

Authors (4): Marija Anđelić, Dominik Šipek, Laura Majer, Jan Šnajder

Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative – commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile CLIC, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. We fine-tune the BERTi'c model on this task and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.

在线新闻渠道主要以广告收入模式运作,迫使记者制作往往丑恶、有趣和挑衅性的新闻头条 – – 通常称为点击bait。自动检测点击bait头条对于维护信息质量和读者对数字媒体的信任至关重要,需要背景理解和世界知识。对于这项任务,特别是资源不足的语言,仍然不清楚微调方法或文本内学习(ICL)是否产生更好的结果。在本文中,我们汇编了CLIC,这是一套新的数据集,用于在20年期间点击检测克罗地亚新闻头条头条,包括主流和边缘媒体。我们用克罗地亚语和英语对BERTI&c模型进行微调,并将其性能与基于LLMM ILLLC方法进行比较。最后,我们分析点击头条的语言特性。我们发现,分析的头条中有近一半含有点击,微调模型产生比一般LLMs更好的结果。


Article 7

Title@2025-07-24 (4): AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs

Title: AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs AQuilt: Verweben von Logik und Selbstinspektion in Low-Cost, High-Relevance-Datensynthese für Spezialisten LLMs Anilt:将逻辑和自我检查编织成低成本高相关性数据合成,供专家LLMs使用 2507.18584v1

Authors (7): Xiaopeng Ke, Hexuan Deng, Xuebo Liu, Jun Rao, Zhenxi Song, Jun Yu, Min Zhang

Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.

尽管大型语言模型(LLMS)在一般领域的表现令人印象深刻,但它们往往在专门领域表现不佳。现有方法通常依靠数据综合方法,并通过使用未贴标签的数据来捕捉特定领域的特点,从而产生有希望的结果。然而,这些方法要么是计算成本高,要么是绩效限制,同时也表明不同任务之间的概括性不足。为了应对这些挑战,我们提议AQuilt,这是一个框架,用于从相应的未贴标签数据(包括答案、问题、未贴标签数据、检查、逻辑和任务类型)中为任何专门领域构建教学调整数据的框架。通过纳入逻辑和检查,我们鼓励推理过程和自我检查来提高模型性能。此外,可定制的任务指示为任何任务提供了高质量的数据生成。因此,我们建造了一个703k的数据集,用于培训强大的数据合成模型。实验表明,AQuilt与DeepSeek-V3具有可比性,同时只使用17%的生产成本。进一步分析表明,我们生成的数据与下游任务的相关性更高。源代码、模型和脚本可在 https://gith/Kruusub.Q.


Article 8

Title@2025-07-24 (4): DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data

Title: DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data DR.EHR: Dense Retrieval für elektronische Gesundheitsdaten mit Wissensinjektion und synthetischen Daten DR.EHR: 具有知识注射和合成数据的电子健康记录大量检索 2507.18583v1

Authors (4): Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu

Electronic Health Records (EHRs) are pivotal in clinical practices, yet their retrieval remains a challenge mainly due to semantic gap issues. Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora. This paper introduces \texttt{DR.EHR}, a series of dense retrieval models specifically tailored for EHR retrieval. We propose a two-stage training pipeline utilizing MIMIC-IV discharge summaries to address the need for extensive medical knowledge and large-scale training data. The first stage involves medical entity extraction and knowledge injection from a biomedical knowledge graph, while the second stage employs large language models to generate diverse training data. We train two variants of \texttt{DR.EHR}, with 110M and 7B parameters, respectively. Evaluated on the CliniQ benchmark, our models significantly outperforms all existing dense retrievers, achieving state-of-the-art results. Detailed analyses confirm our models’ superiority across various match and query types, particularly in challenging semantic matches like implication and abbreviation. Ablation studies validate the effectiveness of each pipeline component, and supplementary experiments on EHR QA datasets demonstrate the models’ generalizability on natural language questions, including complex ones with multiple entities. This work significantly advances EHR retrieval, offering a robust solution for clinical applications.

在临床实践中,电子健康记录(EHRs)是关键,然而,它们的检索仍然是一项主要由于语义差距问题造成的挑战。最近在密集检索方面的进展提供了很有希望的解决方案,但现有的通用和生物医学领域模式都因医疗知识不足或培训公司不匹配而落后。本文介绍了一套专为EHR检索而设计的密集检索模型\ textt{DR.EHR}。我们建议利用MIMIMI-IV排放摘要进行两阶段培训,以满足对广泛医学知识和大规模培训数据的需求。第一阶段涉及医学实体从生物医学知识图中提取和注入知识,而第二阶段则使用大型语言模型生成多种培训数据。我们分别培训了两种具有110M和7B参数的变种模式。我们对CliniQ基准进行了评估,我们的模式大大超越了所有现有的密集检索者,达到了最新技术成果。详细分析证实了我们模型在各种匹配和查询类型中的优势,特别是在具有挑战性的精度的精度性生物医学知识图表中,而第二阶段则使用大型语言模型来生成多样化的培训数据数据数据数据数据数据数据数据数据。Abrelate relialal 展示了Eregilationalalalational 和Elagilation 。


Article 9

Title@2025-07-24 (4): System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

Title: System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition Systembericht für CCL25-Eval Task 10: SRAG-MAV für feinkörnige chinesische Hassspracherkennung 供CCL25-Eval任务10使用的系统报告:关于中华恶言识别的SRAG-MAV系统报告 2507.18580v1

Authors (4): Jiahao Wang, Ramen Liu, Longhui Zhang, Jing Li

This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at https://github.com/king-wang123/CCL25-SRAG-MAV.

本文介绍我们的CCL25-Eval任务10系统,涉及中国恶言恶言的美化识别(FGCHSR),我们提议了一个新型的SRAG-MAV框架,将任务重拟(TR)、自我检索-启蒙一代(SRAG)和多声速累积投票(MAV)协同整合在一起。我们的方法将四重提取任务分为三重提取任务,利用从培训组获得的动态检索来创建背景提示,对投票进行多方面推论,以提高产出稳定性和性能。我们的系统以Quen2.5-7B模式为基础,实现了26.66的硬分,软分为48.35分,在STATE ToxiCN数据集上平均分为37.505分,显著超过基准,如GPT-4o(平均评分15.63分)和精细调的 Quen2.5-7B(Average评分35.365)。


Article 10

Title@2025-07-24 (4): P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

Title: P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应 2406.12548v3

Authors (5): Yuhao Dan, Jie Zhou, Qin Chen, Junfeng Tian, Liang He

Personalized large language models (LLMs) have attracted great attention in many applications, such as emotional support and role-playing. However, existing works primarily focus on modeling explicit character profiles, while ignoring the underlying personality traits that truly shape behaviors and decision-making, hampering the development of more anthropomorphic and psychologically-grounded AI systems. In this paper, we explore the modeling of Big Five personality traits, which is the most widely used trait theory in psychology, and propose P-React, a mixture of experts (MoE)-based personalized LLM. Particularly, we integrate a Personality Specialization Loss (PSL) to better capture individual trait expressions, providing a more nuanced and psychologically grounded personality simulacrum. To facilitate research in this field, we curate OCEAN-Chat, a high-quality, human-verified dataset designed to train LLMs in expressing personality traits across diverse topics. Extensive experiments demonstrate the effectiveness of P-React in maintaining consistent and real personality.

个人化的大型语言模型(LLMs)在许多应用中引起了极大的注意,例如情感支持和角色扮演,然而,现有的作品主要侧重于以清晰的性格简介为模型,同时忽视真正影响行为和决策的基本个性特征,阻碍着更人类形态和心理基础的人工智能系统的发展。在本文中,我们探索了五大个性特征的模型,这是在心理学中最广泛使用的特质理论,并提出P-React,这是专家(MoE)个人化的LM。 特别是,我们整合了个性特质损失(PSL),以更好地捕捉个人个性表达方式,提供更加细微和有心理基础的个性模拟。为了便利这一领域的研究,我们绘制了OCAN-Chat,一个高质量的、人性化的数据集,旨在训练LMS在各种专题中表达个性特征。广泛的实验表明P-React在保持一致性和真实性方面的有效性。


Article 11

Title@2025-07-24 (4): Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Title: Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs Weit-in, schmal-out: Wiederverwertbare Dekodierung für effiziente und effektive DLLMs 宽放, 窄出: 为高效和有效DLLMs而可撤销的解码 2507.18578v1

Authors (8): Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao

Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model’s bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

大型语言模型(DLLMS)作为自动递减模型的令人信服的替代方案,是为快速平行的一代设计的。然而,现有的DLLMS受到一个严格的质量速度交换机制的困扰,快速平行解码导致显著的性能退化。我们将此归因于DLLMS的标准解码的不可逆转性,这很容易被分解为错误的解码方向,同时早期误差环境积累。为了解决这个问题,我们引入了无培训的解码算法(WINO),这是一种无培训的解码算法,使得DLLMS能够重新解码。 WINO使用一个平行的起草和验证机制,积极起草多个符号,同时使用该模型的双向背景来核查和重新标码可疑的标码以便改进。在开放源码DLLADA和MMADA等公开的DLLLLLLLLMMMMSM中, WINO显示可以决定性地改善质量交易。例如GMO8K的数学基准,它加速推算出6美元,同时提高速度,通过2.58%的精确度来展示其业绩。


Article 12

Title@2025-07-24 (4): LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Title: LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs LingBench++: Ein linguistisch-informiertes Benchmark- und Reasoning-Framework für mehrstufige und kulturübergreifende Schlussfolgerungen mit LLMs LingBench++:与LLMs的多层次和跨文化推理语言综合基准和理由框架 2507.16809v2

Authors (10): Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Zhen-Yu Lin, Pin-Cheng Chen, Shu-Kai Hsieh

We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.

我们提议LingBench++(LingBench+),这是一个语言知情的基准和推理框架(LLMs),旨在评价国际语言奥林匹克运动(IOL)所启发的复杂语言任务的大型语言模型(LLMs ) 。 与以前完全侧重于最终答案准确性的基准不同,LingBench+(LingBench+)提供了结构化推理痕迹、渐进式评价协议以及90多种低资源和跨文化语言的丰富类型元数据。我们进一步开发了多试剂架构,将语法知识检索、工具强化推理和深思熟虑的假设测试结合起来。 通过系统比较基线和我们拟议的代理模型,我们证明这些模型具备外部知识来源和迭代推理超越了准确性和可解释性的单方方法。 LingBench+(LingBench+)为推进LMs中语言基础、文化上知情和认知上合理的推理提供了全面基础。


Article 13

Title@2025-07-24 (4): SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Title: SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law SafeWork-R1: Koevolving Safety and Intelligence unter dem AI-45$^{\circ}$ Gesetz 安全工作-R1:根据AI-45$ circ}$ 法发展安全和情报 2507.18576v1

Authors (118): Shanghai AI Lab, :, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, Yu Cheng, Dengke Deng, Yizhuo Ding, Dan Ding, Xiaoshan Ding, Yi Ding, Zhichen Dong, Lingxiao Du, Yuyu Fan, Xinshun Feng, Yanwei Fu, Yuxuan Gao, Ruijun Ge, Tianle Gu, Lujun Gui, Jiaxuan Guo, Qianxi He, Yuenan Hou, Xuhao Hu, Hong Huang, Kaichen Huang, Shiyang Huang, Yuxian Jiang, Shanzhe Lei, Jie Li, Lijun Li, Hao Li, Juncheng Li, Xiangtian Li, Yafu Li, Lingyu Li, Xueyan Li, Haotian Liang, Dongrui Liu, Qihua Liu, Zhixuan Liu, Bangwei Liu, Huacan Liu, Yuexiao Liu, Zongkai Liu, Chaochao Lu, Yudong Lu, Xiaoya Lu, Zhenghao Lu, Qitan Lv, Caoyuan Ma, Jiachen Ma, Xiaoya Ma, Zhongtian Ma, Lingyu Meng, Ziqi Miao, Yazhe Niu, Yuezhang Peng, Yuan Pu, Han Qi, Chen Qian, Xingge Qiao, Jingjing Qu, Jiashu Qu, Wanying Qu, Wenwen Qu, Xiaoye Qu, Qihan Ren, Qingnan Ren, Qingyu Ren, Jing Shao, Wenqi Shao, Shuai Shao, Dongxing Shi, Xin Song, Xinhao Song, Yan Teng, Xuan Tong, Yingchun Wang, Xuhong Wang, Shujie Wang, Xin Wang, Yige Wang, Yixu Wang, Yuanfu Wang, Futing Wang, Ruofan Wang, Wenjie Wang, Yajie Wang, Muhao Wei, Xiaoyu Wen, Fenghua Weng, Yuqi Wu, Yingtong Xiong, Xingcheng Xu, Chao Yang, Yue Yang, Yang Yao, Yulei Ye, Zhenyun Yin, Yi Yu, Bo Zhang, Qiaosheng Zhang, Jinxuan Zhang, Yexin Zhang, Yinqiang Zheng, Hefeng Zhou, Zhanhui Zhou, Pengyu Zhu, Qingzi Zhu, Yubo Zhu, Bowen Zhou

We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.

我们引入了安全-工作-R1, 这是一种展示能力和安全演变的尖端多式联运推理模型, 由我们拟议的安全-工作-工作-工作1 框架制定,其中包括大规模、渐进、面向安全的强化强化学习培训后学习,由一组多原则核查员提供支持,与以往的调整方法不同,如仅学习人类偏好的RLHF, 安全-工作-R1 使安全-工作-工作能够发展内在的安全推理和自我反射能力,从而带来安全`ha’时刻。 值得注意的是,安全-工作-R1 与安全相关基准基准基模型 Qwen2.5-VL-72B 相比,平均改进46.54美元,但不损害一般能力,并提供与主要专利模型(如GPT-4.1和Claude Opus 4.)相比的最新安全性安全性业绩,我们实施了两种不同的推论时间干预方法和审议性搜索机制,并实施了逐步核查。 最后,我们进一步开发了安全-R1-InternV3-78B,安全-R-1-72-V-1-V-1-Se-Secreal-I-I-I-I-I-I-I-S-S-S-I-I-S-C-C-Silver-V-Silver-Silver-V-V-Sl-V-Silvil-Sl-Sl-G-SU-SU-Silvil-Sl-Sl-Sl-Sl-Sl-Sl-Sil-Sl-Silvil-Sl-Sl-Sil-Sil-S-S-Sl-S-S-S-S-SU-SU-SU-S-S-S-S-S-SU-S-S-S-S-SU-SU-SU-S-S-S-S-S-S-S-S-S-S-S-S-Sil-V-V-V-V-V-V-Sl-Sl-SUL-S-S-S-Sl-S-S-S-S-S-Sl-Sl-S-S-S-S-S-S-S-S-S-S


Article 14

Title@2025-07-24 (4): Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Title: Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning Agent-Fin-R1: Verbesserung der Finanzintelligenz durch Domain-Expertise, Trainingseffizienz und Advanced Reasoning Agentar Fin-Fin-R1:通过域域专门知识、培训效率和高级理由加强金融情报 2507.16802v3

Authors (13): Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Jingze Song, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang

Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

大型语言模型(LLMS)在财务应用方面有相当大的希望;然而,现行模型在遇到需要精密推理能力、严格可信标准以及有效适应具体领域要求的情景时往往显示出局限性。我们引入了金融大语言模型(8B和32B参数)的Astrar-Fin-R1系列财务大语言模型(8B和32B参数),具体根据Quen3基础模型设计,以提高金融应用的推理能力、可靠性和领域专业化。我们的优化方法将高质量、系统的金融任务标签系统与全面的多层次可靠保证框架结合起来。这一框架包括高质量的可信赖的知识工程、多剂可信赖的数据合成和严格的数据验证治理。我们通过标签引导自动自动识别困难优化、启动阶段培训管道和动态归属系统,我们在培训效率方面取得了重大改进。我们的模型对主流金融基准(包括Finva、FinEval、FinIQQ,以及MATH-500和GPQA-diamon等通用推理数据集等)进行了全面评价。这一框架包括高质量的实际部署能力评估,我们提出了Finova评估基准评估基准-Final-ILA-ILA-ILA-ILA-I),该基准,该基准也显示其高级的合规性检验能力,该基准,该标准,该基准也显示其用于用于进行高水平性财务水平性财务水平的透明性检验。


Article 15

Title@2025-07-24 (4): PosterMate: Audience-driven Collaborative Persona Agents for Poster Design

Title: PosterMate: Audience-driven Collaborative Persona Agents for Poster Design PosterMate: Audience-getriebene Kollaborative Persona Agenten für Poster-Design PosterMate:由观众驱动的海报设计合作人员代理 2507.18572v1

Authors (4): Donghoon Shin, Daniel Lee, Gary Hsieh, Gromit Yeuk-Yin Chan

Poster designing can benefit from synchronous feedback from target audiences. However, gathering audiences with diverse perspectives and reconciling them on design edits can be challenging. Recent generative AI models present opportunities to simulate human-like interactions, but it is unclear how they may be used for feedback processes in design. We introduce PosterMate, a poster design assistant that facilitates collaboration by creating audience-driven persona agents constructed from marketing documents. PosterMate gathers feedback from each persona agent regarding poster components, and stimulates discussion with the help of a moderator to reach a conclusion. These agreed-upon edits can then be directly integrated into the poster design. Through our user study (N=12), we identified the potential of PosterMate to capture overlooked viewpoints, while serving as an effective prototyping tool. Additionally, our controlled online evaluation (N=100) revealed that the feedback from an individual persona agent is appropriate given its persona identity, and the discussion effectively synthesizes the different persona agents’ perspectives.

海报设计可受益于目标受众同步反馈。然而,收集不同观点的受众,并在设计编辑中调和这些受众,可能具有挑战性。最近典型的AI模型提供了模拟人型互动的机会,但尚不清楚如何将其用于设计中的反馈进程。我们引入了海报设计助理PosterMate,该海报设计助理通过创建由营销文件制作的由受众驱动的人为代理物促进协作。海报Mate收集了每个个人代理物对海报组成部分的反馈,并在主持人的帮助下促进讨论以得出结论。这些商定的编辑物可以直接融入海报设计。通过我们的用户研究(N=12),我们确定了PosterMate捕捉被忽视的观点的潜力,同时作为一种有效的原型工具。此外,我们控制的在线评估(N=100)显示,个人代理物的反馈适合其个人身份,讨论有效地综合了不同个人代理物的观点。


Article 16

Title@2025-07-24 (4): Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Title: Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods Hybride Tokenisierungsstrategie für DNA-Sprachmodell mit Byte Pair Encoding und K-MER Methoden 使用字节对等编码和K-MER方法的DNA语言模型混合化战略 2507.18570v1

Authors (2): Ganesh Sapkota, Md Hasibur Rahman

This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.

本文介绍了一种新的混合象征性化战略,通过将6-毫记式与Byte Pair Encoding (BPE-600)合并,提高DNA语言模型(DLM)的性能,从而增强DNA语言模型(DLM)的性能。传统的K-毫记式在捕捉当地DNA序列结构方面是有效的,但往往面临各种挑战,包括象征性分布不均和对全球序列背景的有限理解。为了解决这些局限性,我们提议将独特的6毫记式与通过600 BPE周期产生的最佳选择的BPE符号合并在一起。这种混合方法确保了平衡和符合背景的词汇,使模型能够同时捕捉DNA序列中的短长模式。在这种混合词汇方面受过培训的基础性DLM(基础性DLM)被评估,作为微调任务,使用下千兆米预测来评估,显示业绩得到显著改善。模型实现了3毫升的10.78%的预测,4毫升的10.1%,5米的预测值为4.12%,优劣的5米的模型,如NT、DNA-BERT2和GRERVDER2和GERVERDERD。这些结果突出的模型模型应用中的高级模型和背景信息基础的模型和下游分析将突出。


Article 17

Title@2025-07-24 (4): GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

Title: GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation GIIFT: Graph-geführte induktive Bildverarbeitungsfreie multimodale maschinelle Übersetzung GIIFT: 图表制导感性不含图像的无图像多式机器翻译 2507.18562v1

Authors (2): Jiafeng Xiong, Yuting Zhao

Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

多式机器翻译(MMT)展示了机器翻译中视觉信息的巨大帮助,然而,现有MMT方法在利用模式差距方面面临着挑战,其方法是实施僵硬的视觉语言调整,同时仅限于在经过训练的多式联运领域进行推断;在这项工作中,我们制作了新的多式联运场景图,以保存和整合特定模式的信息,并引入了GIIFT,这是一个两阶段的图形引导无感光图像的GIIFT框架,这是一个两阶段的图形引导无感知图像MMMT框架,它使用跨式图形关注网络适配器,在一个统一的集成空间学习多式知识,并将其推广到更广泛的无图像翻译领域。英语对法语和英语对德语任务多30K数据集的实验结果表明,我们的GIIFT超越了现有方法,实现了最新技术,即使没有在推断过程中留下图像。WMT基准的结果表明,在无图像翻译基线方面有了重大改进,表明GIIFT的力度,以产生无图像推断。


Article 18

Title: Identity-related Speech Suppression in Generative AI Content Moderation Identitätsbezogene Sprachunterdrückung in der Generativen KI-Inhaltsmoderation 在产生AI 内容调节中禁止与身份有关的言语 2409.13725v3

Authors (5): Grace Proebsting, Oghenefejiro Isaacs Anigboro, Charlie M. Crawford, Danaé Metaxa, Sorelle A. Friedler

Automated content moderation has long been used to help identify and filter undesired user-generated content online. But such systems have a history of incorrectly flagging content by and about marginalized identities for removal. Generative AI systems now use such filters to keep undesired generated content from being created by or shown to users. While a lot of focus has been given to making sure such systems do not produce undesired outcomes, considerably less attention has been paid to making sure appropriate text can be generated. From classrooms to Hollywood, as generative AI is increasingly used for creative or expressive text generation, whose stories will these technologies allow to be told, and whose will they suppress? In this paper, we define and introduce measures of speech suppression, focusing on speech related to different identity groups incorrectly filtered by a range of content moderation APIs. Using both short-form, user-generated datasets traditional in content moderation and longer generative AI-focused data, including two datasets we introduce in this work, we create a benchmark for measurement of speech suppression for nine identity groups. Across one traditional and four generative AI-focused automated content moderation services tested, we find that identity-related speech is more likely to be incorrectly suppressed than other speech. We find that reasons for incorrect flagging behavior vary by identity based on stereotypes and text associations, with, e.g., disability-related content more likely to be flagged for self-harm or health-related reasons while non-Christian content is more likely to be flagged as violent or hateful. As generative AI systems are increasingly used for creative work, we urge further attention to how this may impact the creation of identity-related content.

长期以来,自动内容节制一直被用来帮助识别和过滤不理想的用户生成的在线内容。但是,这样的系统有着错误标记内容的历史,并且有被排斥的基因特性被清除的历史。 生成的AI系统现在使用这些过滤器,使不受欢迎的内容内容不会被用户创建或显示。 虽然许多重点都放在确保这些系统不会产生不理想的结果上, 但对于确保产生适当的文本的关注却少得多。 从教室到好莱坞,由于基因化的AI越来越多地用于创造性或表达式文本的生成,这些技术将允许讲述这些故事,并且它们将压制这些技术? 在本文件中,我们定义和引入了语言抑制措施,侧重于与不同身份特征群体有关的言论,通过内容节制的调适度APIs,我们发现用户生成的数据集既短格式,在内容调适中,又在更具有致色化性的数据,我们在这项工作中引入了两个数据集,我们为测量9个身份组的言调抑制言论内容建立了基准。 一种传统和四种以基因为主的自动化的自动内容节制的自动节制, 。我们发现,在与身份特征相关的原因可能更多被错误地用于与身份相关的行为。


Article 19

Title@2025-07-24 (4): LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Title: LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important LagKV: Lag-Relative Information des KV-Cache erzählt, welche Token wichtig sind LagKV: KV 缓存告诉哪个 Tokens 重要, 而 KV 缓存的拉格- 相对信息Name 2504.04704v2

Authors (4): Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li

The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.

在大语言模型中,Key-Value(KV)缓存在长长的文本推论中日益扩大的大小是其部署成本和任务准确性之间平衡的主要障碍。为了减少KV缓存规模,大多数先前的努力都以驱离非关键缓存符号的重力为杠杆。但是,这些方法有一个权衡,它们通常要求对推论基础设施和重要的计算间接费用进行重大修改。基于大语言模型是自动递增模型这一事实,我们提议LagKV,即KV压缩战略,仅依靠直接前向比较KV本身。这是一种完全免费的注意方法,可以很容易地融入主流推断平台,比较其他复杂的KV压缩方法,比较性能。RULER基准的结果显示,我们的方法超越了不同压缩比率的SningKV和StreamingLLLLM。特别是在64位的过关钥匙检索任务中,我们的方法比基于注意权重的方法高了$H_2O$50美元以上,与相同的压缩比率。我们的代码可以在 https://Kgimas/Banants/Ban-Mest/Canants.


Article 20

Title@2025-07-24 (4): GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

Title: GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface GLiNER2: Ein effizientes Multi-Task-Informationsextraktionssystem mit Schema-gesteuerter Schnittstelle GLINER2:具有Schema-Driven界面的高效多任务信息提取系统 2507.18546v1

Authors (5): Urchade Zaratiana, Gil Pasternak, Oliver Boyd, George Hurn-Maloney, Ash Lewis

Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at https://github.com/fastino-ai/GLiNER2.

信息提取(IE)对于许多国家语言平台应用来说至关重要,然而,现有的解决方案往往需要不同任务的专门模式或依赖计算昂贵的大型语言模型。我们介绍了GLINER2,这是一个加强原始GLINER结构的统一框架,以支持在单一高效模式中识别名称实体、文本分类和等级结构化数据提取。GLINER2建起了经过预先训练的变压器编码器结构,GLINER2保持了CPU效率和紧凑规模,同时通过直观的基于化学的界面引入多任务构成。我们的实验表明,与基于LLM的替代品相比,在部署无障碍性方面有很大改进。我们发布了GLINER2,作为开放源的可安装图书馆,在https://github.com/fastino-ai/GLINER2上提供经过训练的模型和文件。


Article 21

Title@2025-07-24 (4): Effective Multi-Task Learning for Biomedical Named Entity Recognition

Title: Effective Multi-Task Learning for Biomedical Named Entity Recognition Effektives Multi-Task-Lernen für die biomedizinische benannte Entitätserkennung 有效多任务学习促进生物医学命名实体的识别 2507.18542v1

Authors (4): João Ruano, Gonçalo M. Correia, Leonor Barreiros, Afonso Mendes

Biomedical Named Entity Recognition presents significant challenges due to the complexity of biomedical terminology and inconsistencies in annotation across datasets. This paper introduces SRU-NER (Slot-based Recurrent Unit NER), a novel approach designed to handle nested named entities while integrating multiple datasets through an effective multi-task learning strategy. SRU-NER mitigates annotation gaps by dynamically adjusting loss computation to avoid penalizing predictions of entity types absent in a given dataset. Through extensive experiments, including a cross-corpus evaluation and human assessment of the model’s predictions, SRU-NER achieves competitive performance in biomedical and general-domain NER tasks, while improving cross-domain generalization.

生物医学命名实体确认由于生物医学术语的复杂性和跨数据集注解的不一致性而面临重大挑战,本文件介绍了SRU-NER(基于Slot的经常性单位NER),这是一种新颖的办法,旨在处理嵌入的命名实体,同时通过有效的多任务学习战略将多个数据集纳入其中。SRU-NER通过动态调整损失计算来缩小批注差距,以避免惩罚对特定数据集中不存在的实体类型的预测。通过广泛的实验,包括跨公司评价和人对模型预测的评估,SRU-NER在生物医学和一般领域NER任务中实现了竞争性业绩,同时改进了跨主题的一般化。


Article 22

Title@2025-07-24 (4): The Moral Gap of Large Language Models

Title: The Moral Gap of Large Language Models Die moralische Kluft großer Sprachmodelle 大语言模式的道德差距 2507.18523v1

Authors (2): Maciej Skorski, Alina Landowska

Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.

检测道德基础对于分析社会话语和发展符合道德要求的人工智能系统至关重要。虽然大型语言模型在各种任务中都非常出色,但它们在专门道德推理方面的表现仍然不明确。本研究报告首次全面比较了利用ROC、PR和DET曲线分析的Twitter和Reddit数据集中最先进的LMS和经精细调整的变压器。结果显示,业绩差距很大,尽管迅速进行了工程工作,但LOMS表现出很高的假负率和对道德内容的系统检测不足。这些调查结果表明,具体任务的微调仍然优于道德推理应用。


Article 23

Title@2025-07-24 (4): GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

Title: GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks GCC-Spam: Spam-Erkennung über GAN, Kontrastives Lernen und Charaktergleichheitsnetzwerke 海合会-Spam:通过全球大气监测网、反竞争学习和特征相似网络探测垃圾邮件 2507.14679v2

Authors (3): Zhijie Wang, Zixin Xu, Zhiyuan Pan

The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.

互联网上垃圾邮件文本的指数增长要求建立强有力的检测机制,以减少信息泄漏和社会不稳定等风险。这项工作解决了两个主要挑战:垃圾邮件使用的对抗策略和标签数据稀缺。我们提出了一个新的垃圾邮件文本检测框架(GCC-Spam),其中结合了三个核心创新。首先,特征相似网络捕捉了拼写和语音特征,以对抗字符模糊攻击,并进一步为下游分类提供了句子嵌入。第二,对比学习通过优化垃圾邮件与正常文本之间的潜空距离,增加了差异性。第三,基因自动网络(GAN)生成现实的假垃圾样本,以减轻数据稀缺,同时提高模型的稳健性和分类准确性。关于现实世界数据集的广泛实验表明,我们的模型比基线方法更符合基准要求,通过少几个标签实例实现更高的检测率。


Article 24

Title@2025-07-24 (4): Exploiting individual differences to bootstrap communication

Title: Exploiting individual differences to bootstrap communication Nutzung individueller Unterschiede zur Bootstrap-Kommunikation 利用个人差异进行靴套通信 2504.05211v2

Authors (2): Richard A. Blythe, Casimir Fisch

Establishing a communication system is hard because the intended meaning of a signal is unknown to its receiver when first produced, and the signaller also has no idea how that signal will be interpreted. Most theoretical accounts of the emergence of communication systems rely on feedback to reinforce behaviours that have led to successful communication in the past. However, providing such feedback requires already being able to communicate the meaning that was intended or interpreted. Therefore these accounts cannot explain how communication can be bootstrapped from non-communicative behaviours. Here we present a model that shows how a communication system, capable of expressing an unbounded number of meanings, can emerge as a result of individual behavioural differences in a large population without any pre-existing means to determine communicative success. The two key cognitive capabilities responsible for this outcome are behaving predictably in a given situation, and an alignment of psychological states ahead of signal production that derives from shared intentionality. Since both capabilities can exist independently of communication, our results are compatible with theories in which large flexible socially-learned communication systems like language are the product of a general but well-developed capacity for social cognition.

建立通信系统是困难的,因为最初生成的接收者不知道信号的预期含义,而信号员也不知道该信号是如何解释的。大多数关于通信系统出现的理论性描述都依靠反馈来强化过去成功通信的行为。然而,提供这种反馈需要已经能够传达原意或解释的含义。因此,这些描述无法解释如何使通信摆脱非沟通行为。我们在这里展示了一个模型,显示一个通信系统,能够表达无限制的含义数目的,如何会因为广大人口中的个人行为差异而出现,而没有事先存在的任何手段来确定通信成功。两种关键认知能力对产生这一结果负有责任,在特定情况下是可以预测的,在信号制作之前的心理状态是来自共同的故意的。由于这两种能力都可以独立于通信而存在,因此我们的结果与这样的理论是相容的,在这种理论中,像语言这样的大型灵活的社会学习通信系统是社会认知能力一般但相当发达的产物。


Article 25

Title@2025-07-24 (4): Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Title: Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models Nicht alle Funktionen widmen sich der Aufmerksamkeit: Graphengeführtes Abhängigkeitslernen für tabellarische Datengenerierung mit Sprachmodellen 并非所有值得注意的地物:用语言模型编制图表数据时的图表指导依赖性学习 2507.18504v1

Authors (4): Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs’ self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs’ attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.

大型语言模型(LLMS)显示了通过模拟文本化特效对等生成表格式数据的巨大潜力。然而,表格数据本身显示的特征依赖性很少,许多特征互动在结构上是微不足道的。这造成了一种根本的不匹配,因为LLMS的自我注意机制不可避免地将重点分散到所有对等之间,分散了对关键关系的关注,特别是在具有复杂依赖性或语义模糊特征的数据集中。为解决这一局限性,我们提议Grade(Grade-Guid Dispidence Learning)(Grade)(Grade-Guid Destandings),这是一种新颖的方法,明确将稀少的依赖性图表纳入LLMS的注意机制。 Grade使用了一个由外部提取功能依赖性指导的轻量度动态图形学习模块,在抑制无关性数据的同时将关键特征互动置于优先位置。我们跨越各种现实世界数据集的实验表明,GraredDe 将现有的LMM方法比复杂的数据集高出12%,同时在合成数据质量中以最先进的方法取得竞争性的结果。我们的方法很少具有侵入性,但有效,为与LMLMSMSLM的表式数据模型提供实用的解决办法提供了实用的解决办法。


Article 26

Title@2025-07-24 (4): LLM-based Embedders for Prior Case Retrieval

Title: LLM-based Embedders for Prior Case Retrieval LLM-basierte Embedders für frühere Fallwiederherstellung 用于先前个案检索的LLM 以LLM为基础的嵌入器 2507.18455v1

Authors (3): Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov

In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.

在普通法体系中,律师和法官等法律专业人员依靠先例来确立其论点。随着案件数量随着时间而大幅增长,有效检索以前的案件变得至关重要。先前的案件检索是一项信息检索任务,目的是从大量潜在候选人中自动确定与具体查询最相关的法院案件。尽管在过去几年里,IR方法出现了几处范式变化,但绝大多数PCR方法仍然依赖传统的IR方法,如BM25。由于以下两大挑战,最先进的深入学习的IR方法在PCR中并不成功:一. 时间性的法律文本限制;在使用强大的BERT型变压器模型时,输入的文本长度有一定的限制,这不可避免地需要通过调压或分来缩短输入,同时丢失法律背景信息。二. 缺乏法律培训数据;由于数据隐私问题,基于PCR的数据集在规模上往往有限,因此难以有效地培训深层次的学习模型。在这项研究中,我们通过使用更强大的LRMLM模型来应对这些难题,我们不使用更长期的LRM数据库。我们通过使用更精确的LMLM数据库来应对这些难题。


Article 27

Title@2025-07-24 (4): Generation of Synthetic Clinical Text: A Systematic Review

Title: Generation of Synthetic Clinical Text: A Systematic Review Generieren von synthetischem klinischem Text: Ein systematischer Test 合成临床文本的生成:系统审查 2507.18451v1

Authors (5): Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, Venkata Satagopam

Generating clinical synthetic text represents an effective solution for common clinical NLP issues like sparsity and privacy. This paper aims to conduct a systematic review on generating synthetic medical free-text by formulating quantitative analysis to three research questions concerning (i) the purpose of generation, (ii) the techniques, and (iii) the evaluation methods. We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases for publications associated with generating synthetic medical unstructured free-text. We have identified 94 relevant articles out of 1,398 collected ones. A great deal of attention has been given to the generation of synthetic medical text from 2018 onwards, where the main purpose of such a generation is towards text augmentation, assistive writing, corpus building, privacy-preserving, annotation, and usefulness. Transformer architectures were the main predominant technique used to generate the text, especially the GPTs. On the other hand, there were four main aspects of evaluation, including similarity, privacy, structure, and utility, where utility was the most frequent method used to assess the generated synthetic medical text. Although the generated synthetic medical text demonstrated a moderate possibility to act as real medical documents in different downstream NLP tasks, it has proven to be a great asset as augmented, complementary to the real documents, towards improving the accuracy and overcoming sparsity/undersampling issues. Yet, privacy is still a major issue behind generating synthetic medical text, where more human assessments are needed to check for the existence of any sensitive information. Despite that, advances in generating synthetic medical text will considerably accelerate the adoption of workflows and pipeline development, discarding the time-consuming legalities of data transfer.

本文的目的是通过对以下三个研究问题进行定量分析,对合成医学自由文本的生成进行系统审查:一) 生成目的,(二) 技术,和(三) 评价方法。我们搜索了PubMed、ScienceDirect、Sweb of Science、Scopus、IEEE、Google学者和与生成合成医学无结构自由文本有关的出版物的Arxiv数据库。我们从所收集的1 398篇文章中找出了94个相关文章。从2018年起,对合成医学自由文本的生成给予了极大关注,因为合成医学文本的主要目的是增加、协助书写、建筑、隐私保护、注释和效用。我们搜索了PubMed、ScienceDirect、ScienceDirective、Servical Drality、Silvical Reportalal-Ls, 其生成的快速医学文件的最常用方法,而其快速的医学文件的提取速度、隐私、结构及实用性,在不断更新的医学文件后期中,其快速更新的版本将显示为不断更新的版本。


Article 28

Title@2025-07-24 (4): Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language

Title: Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language Wiederherstellung des Rhythmus: Pünktlichkeitsrestaurierung mit Transformer-Modellen für Bangla, eine Sprache mit geringer Ressource 恢复时速:使用孟加拉国低资源语言 “ 孟加拉 “ 变压器模型恢复脉冲 2507.18448v1

Authors (4): Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu

Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model’s effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.

标点恢复会提高文字的可读性,对于自动语音识别(ASR)中的后处理任务至关重要,特别是对孟加拉语等低资源语言而言。在本研究中,我们探索了以变压器为基础的模型的应用,特别是XLM-ROBERTAUG型,以自动恢复未标点的孟加拉语文本中的标点。我们侧重于预测四个标点:时期、逗号、问题标记和在不同文本域的感光标记。为了解决附加说明的资源稀缺的问题,我们建立了一个庞大的、多样的训练资料库和应用的数据增强技术。我们最优秀的模型,以阿尔法=0.20%的扩增因数来培训,在新闻测试集上实现了97.1%的准确度,在参考集上实现了91.2%的准确度,在ASR集上实现了90.2%的准确度。结果显示参考和ASR记录有很强的概括性,展示了模型在现实世界中的效能,噪音情景。这项工作为Bangla punctuation 恢复奠定了一个强大的基线,并且为公众可获取的数据设置和代码以支持未来低资源NPPROP的研究提供了支持未来研究。


Article 29

Title@2025-07-24 (4): AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data

Title: AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data AraTable: Benchmarking von LLMs’ Vernunft und Verständnis arabischer Tabellendaten 阿拉伯表格:按基准确定LLM女士对阿拉伯表格数据的理由和理解 2507.18442v1

Authors (3): Rana Alshaikh, Israa Alghanmi, Shelan Jeawak

The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.

大型语言模型(LLMS)的认知和推理能力使自然语言处理取得了显著进展,然而,它们在解释结构化数据,特别是表格格式数据方面的表现仍然有限。虽然英文表格数据的基准广泛可用,但阿拉伯文的代表性仍然不足,因为公共资源有限,而且具有独特的语言特征。为了解决这一差距,我们提出了AraTable,这是一个新颖和全面的基准,旨在评价LLMS在应用阿拉伯文表格数据时的推理和理解能力。AraTable包含各种评价任务,例如直接回答问题、事实核实和复杂的推理,涉及广泛的阿拉伯表格来源。我们的方法遵循混合管道,其中LLMS生成了初始内容,随后由人类专家过滤和核实,以确保高数据集质量。使用AraTable的初步分析表明,虽然LMS在直接回答问题等简单表格任务上表现良好,但在任务需要更深入推理和核实时,他们继续面临重大的认知挑战。这表明,今后的工作有很大机会改进复杂的表格推理工作的绩效。我们还提议了一个完全自动化的评价框架,利用自我思考机制产生初步内容,然后由人类专家筛选分析,这种有价值的分析基础,这种结构分析可以提供。


Article 30

Title@2025-07-24 (4): IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation

Title: IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation IPCGRL: Sprachgestütztes Verstärkungslernen für die verfahrenstechnische Level-Generierung ICPCGRL: 程序生成阶段语言教学强化学习 2503.12358v4

Authors (5): In-Chang Baek, Sung-Hyun Kim, Seo-Young Lee, Dong-Hyeon Kim, Kyung-Joong Kim

Recent research has highlighted the significance of natural language in enhancing the controllability of generative models. While various efforts have been made to leverage natural language for content generation, research on deep reinforcement learning (DRL) agents utilizing text-based instructions for procedural content generation remains limited. In this paper, we propose IPCGRL, an instruction-based procedural content generation method via reinforcement learning, which incorporates a sentence embedding model. IPCGRL fine-tunes task-specific embedding representations to effectively compress game-level conditions. We evaluate IPCGRL in a two-dimensional level generation task and compare its performance with a general-purpose embedding method. The results indicate that IPCGRL achieves up to a 21.4% improvement in controllability and a 17.2% improvement in generalizability for unseen instructions. Furthermore, the proposed method extends the modality of conditional input, enabling a more flexible and expressive interaction framework for procedural content generation.

最近的研究突出了自然语言在加强基因模型的可控性方面的重要性。虽然作出了各种努力来利用自然语言来生成内容,但利用基于文本的指示来生成程序内容的深度强化学习(DRL)代理物的研究仍然有限。在本文件中,我们建议采用基于指令的程序性内容生成方法IPCGRL, 这是一种通过强化学习产生程序内容的方法,其中包括一个包含句子的嵌入模式。ICPCGRL 微调具体任务嵌入代表物,以有效压缩游戏级条件。我们评估了二维级生成任务中的IPCGRL, 并将其性能与通用嵌入方法进行比较。结果显示,IPCGRL在控制性方面实现了21.4%的改进,在对无形指令的一般性方面实现了17.2%的改进。此外,拟议方法扩大了有条件投入的方式,为程序内容生成提供了更加灵活和明确的互动框架。


Article 31

Title@2025-07-24 (4): DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Title: DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten DFAME: 与多式联运专家进行动态证据法检查 2412.10510v4

Authors (4): Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach

The proliferation of disinformation demands reliable and scalable fact-checking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking.

虚假信息的扩散要求可靠和可扩缩的事实检查解决方案。 我们向多模式专家(DEFAME)展示基于证据的动态FA-CLC- Checking,这是一个模块化、零发MLLM管道,用于公开域名、文本图像索赔核查。 DeFAME在六阶段过程中运作,动态地选择工具,搜索深度以提取和评价文本和视觉证据。不同于以往只使用文本、缺乏解释性或完全依赖参数知识的方法,DEFAME进行端到端核查,在生成结构化、多式报告的同时,对索赔和证据中的图像进行会计核算。对流行基准VERITTE、AVerITec和MOCHEG的评价显示,DEFAM超越了以往所有方法,确立了自己作为新的单式和多式事实检查的最新数据核对系统。此外,我们引入了新的多式联运基准,即索赔审查2024+,在GPT-4o知识关闭后进行索赔,避免数据泄漏。在这里,DEFAM大大地超越了GPT-4的实际情况基线,显示了时间性检查的可能性。


Article 32

Title@2025-07-24 (4): How do language models learn facts? Dynamics, curricula and hallucinations

Title: How do language models learn facts? Dynamics, curricula and hallucinations Wie lernen Sprachmodelle Fakten? Dynamik, Lehrpläne und Halluzinationen 语言模式如何了解事实?动态、课程和幻觉 2503.21676v2

Authors (6): Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De

Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

大型语言模型在培训前积累了大量知识,但是关于这一获取的动态仍然不甚了解。这项工作调查了语言模型在合成事实回顾任务方面的学习动态,发现了三个主要结论:首先,语言模型分三个阶段学习,在获得准确事实知识之前展示一个性能高原。机械上,这一高原与形成支持回顾的基于关注的电路相吻合。第二,培训数据分布对学习动态产生重大影响,因为分布不平衡导致高原缩短。最后,幻觉与知识同时出现,通过微调将新知识纳入模型具有挑战性,因为它迅速腐蚀了其现有的准光学记忆。我们的结果强调在获取知识方面进行数据分配的重要性,并建议新的数据列表战略,以加快神经网络培训。


Article 33

Title@2025-07-24 (4): FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

Title: FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs FinDPO: Finanz-Sentiment-Analyse für algorithmischen Handel durch Preference-Optimierung von LLMs FinDPO:通过优惠优化LLMs,分析通过高利贷交易的金融敏感度 2507.18417v1

Authors (3): Giorgos Iacovides, Wuyang Zhou, Danilo Mandic

Opinions expressed in online finance-related textual data are having an increasingly profound impact on trading decisions and market movements. This trend highlights the vital role of sentiment analysis as a tool for quantifying the nature and strength of such opinions. With the rapid development of Generative AI (GenAI), supervised fine-tuned (SFT) large language models (LLMs) have become the de facto standard for financial sentiment analysis. However, the SFT paradigm can lead to memorization of the training data and often fails to generalize to unseen samples. This is a critical limitation in financial domains, where models must adapt to previously unobserved events and the nuanced, domain-specific language of finance. To this end, we introduce FinDPO, the first finance-specific LLM framework based on post-training human preference alignment via Direct Preference Optimization (DPO). The proposed FinDPO achieves state-of-the-art performance on standard sentiment classification benchmarks, outperforming existing supervised fine-tuned models by 11% on the average. Uniquely, the FinDPO framework enables the integration of a fine-tuned causal LLM into realistic portfolio strategies through a novel ‘logit-to-score’ conversion, which transforms discrete sentiment predictions into continuous, rankable sentiment scores (probabilities). In this way, simulations demonstrate that FinDPO is the first sentiment-based approach to maintain substantial positive returns of 67% annually and strong risk-adjusted performance, as indicated by a Sharpe ratio of 2.0, even under realistic transaction costs of 5 basis points (bps).

在线金融相关文本数据表达的意见正在对贸易决策和市场流动产生日益深刻的影响。这一趋势凸显了情绪分析作为量化这类意见的性质和力度的工具的重要作用。随着GenAI(GenAI)的快速发展,监管的微调(SFT)大型语言模型(LLMS)已成为金融情绪分析的实际标准。然而,SFT模式可能导致培训数据的记忆化,而且往往无法向看不见的样本推广。这是金融领域的一个严重局限性,在这方面,模型必须适应以前未曾观察到的不透明事件以及微调的、特定领域的金融语言。为此,我们引入了FinDPO,这是以培训后的人优惠调整为基础的第一个针对具体财务的LMM框架。拟议的FINDPO在标准情绪分类基准上达到最先进的业绩,在平均上比现有的受监管的微调模式高出11%。 金融领域最强的FinDPO框架能够将一个精确调整的、甚至精确的LLMM(甚至精确的LM ) 纳入到真实的快速的货币组合战略之中,通过新式的货币级的货币级的汇率,以持续地显示不断的货币级的货币级的汇率基础。


Article 34

Title@2025-07-24 (4): ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Title: ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models Explica: Explizite kausale Vernunft in großen Sprachmodellen bewerten ExpliCa:在大语言模型中评估明确的原因原因 2502.15487v3

Authors (7): Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

大型语言模型(LLMs)越来越多地用于需要解释和推断准确性的任务。在本文中,我们引入ExpliCa,这是一个用于以明确的因果关系推理来评价LLMs的新数据集。ExpliCa 将不同语言顺序中呈现的、以语言连接方式明确表达的因果关系和时间关系单独结合起来。该数据集以众源人类可接受性评级丰富。我们通过快速和基于模糊度的衡量标准对ExpliCa 上的LLMs进行了测试。我们评估了7个商业和开放源LMs,显示即使是顶级模型也难以达到0.80的准确性。有趣的是,模型往往混淆了与因果关系的时际关系,其性能也受到事件语言顺序的强烈影响。最后,基于重复性的计分数和快速性性性表现受模型大小的不同影响。


Article 35

Title@2025-07-24 (4): Factual Inconsistencies in Multilingual Wikipedia Tables

Title: Factual Inconsistencies in Multilingual Wikipedia Tables Tatsächliche Inkonsistenzen in mehrsprachigen Wikipedia-Tabellen 多语言维基百科表格中的事实不一致 2507.18406v1

Authors (6): Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo

Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia’s structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.

维基百科是一个全球可访问的知识来源,其内容为300多种语言。 维基百科尽管覆盖了相同的主题,但不同的版本维基百科是独立写作和更新的。 这会导致影响百科全书和AI系统的中立性和可靠性的事实不一致,这些系统往往依赖维基百科作为主要的培训来源。 本研究报告调查维基百科结构内容中跨语言的不一致之处,重点是表格数据。 我们开发了一种方法来收集、统一和分析维基百科多语文章中的表格,界定不一致的类别。 我们用各种定量和定性衡量标准来评估多语种一致性,使用抽样数据集评估多语种一致性。 这些洞见对事实验证、多语种知识互动和设计利用维基百科内容的可靠AI系统产生了影响。


Article 36

Title@2025-07-24 (4): CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

Title: CLEAR: Error Analysis via LLM-as-a-Judge Made Easy CLEAR: Fehleranalyse über LLM-as-a-Judge leicht gemacht CLLEAR:通过LLM-as-a法官进行错误分析 2507.18392v1

Authors (5): Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model’s performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

对大语言模型的评价越来越依赖作为法官的其他LLMs。然而,目前的评价模式通常产生单一的评分或排名,回答哪个模型更好,而不是原因。这些顶级评分虽然对于基准衡量至关重要,但却模糊了模型性能背后的具体、可操作的原因。为了缩小这一差距,我们引入了CLEAR,这是一个用于基于LLM的错误分析的互动式、开放源包。CLEAR首先生成了每份文字反馈,然后生成了一套系统级错误问题,并量化了每个问题的普遍性。我们的软件包还为用户提供了一个互动的仪表板,允许通过综合可视化来进行全面的错误分析,应用互动过滤器来孤立具体问题或得分范围,并钻探出能够体现特定行为模式的单个实例。我们展示了对RAG和数学基准的CLEAR分析,并通过用户案例研究展示其实用性。


Article 37

Title@2025-07-24 (4): Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Title: Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games 原因:在公共货物运动会中,理性语言模式成为自由骑手 2506.23276v2

Authors (6): David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim

随着大型语言模式(LLMS)越来越多地作为自主代理人部署,理解它们的合作和社会机制变得越来越重要,特别是LLMS如何平衡自我利益和集体福祉是确保协调、稳健和安全部署的关键挑战。在本文件中,我们审查了多试剂LLM系统中高成本制裁的挑战,因为代理人必须决定是否将自己的资源投入到鼓励合作或惩罚叛逃方面。为了研究这一点,我们调整了一种公益游戏,从行为经济学中选择机构,使我们得以观察不同的LLMS如何在反复互动中渡过社会困境。我们的分析揭示了四种不同的模式行为模式:一些模式一贯建立和维持高水平的合作,另一些模式在参与和脱离接触之间波动,有些在一段时间内逐渐减少合作行为,而另一些则不论结果如何都严格遵循固定战略。令人惊讶的是,我们发现LMMS的理由,如O1系列,与合作的难度很大,而一些传统的LMS始终保持高度的合作水平。这些调查结果表明,目前改进LMS的方法侧重于加强其推理能力,但并不必然导致我们LMS/Ms公司之间现有的有价值的合作规则。


Article 38

Title@2025-07-24 (4): Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

Title: Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs Beyond Profile: Von Oberflächen-Fakten zur tiefen Persona-Simulation in LLMs 超越简介:从地平面事实到深人模拟LLMM 2502.12988v3

Authors (6): Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen

Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought patterns as manifested in the textual works of a character. Using Lu Xun, a renowned Chinese writer as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope this work inspires future research on deep character persona simulation LLMs while considering the importance of ethical standards.

个人模拟大型语言模型(LLMS)的以往方法通常依赖于学习基本的简历信息,或者使用有限的角色对话数据集来捕捉性格的响应。然而,一个人的整体代表性不仅仅是表面层面的事实或对话,而是更深入的思考和思考。在这项工作中,我们引入了StraBot,这是一个模式,旨在复制语言模式和一个字符文字作品所显示的独特思想模式。我们用著名的中国作家Lu Xun作为案例研究,我们建议从他的17篇文章收藏中得出四项培训任务,其中包括侧重于掌握外部语言结构和知识的培训前任务,以及三项微调任务:多选择问题回答、基因化问题回答和风格转换,每个都使LLMM与Lu Xun的内部思想和写作风格相一致。为了优化这些任务之间的学习,我们引入了Char LoRA参数更新机制,让一般语言风格专家与其他具体任务专家合作,以更好地研究语言风格和深层次思想。我们评估了三种语言准确性和深层次理解任务,我们评估了三种任务,即语言精确性和理解语言质量标准,同时思考了我们深层次的道德标准。


Article 39

Title@2025-07-24 (4): Mechanistic Indicators of Understanding in Large Language Models

Title: Mechanistic Indicators of Understanding in Large Language Models Mechanistische Indikatoren des Verstehens in großen Sprachmodellen 大语言模型中理解力的机械指标 2507.08017v3

Authors (2): Pierre Beckmann, Matthieu Queloz

Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. We offer an accessible synthesis of these findings that doubles as an introduction to MI while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of understanding. First, conceptual understanding emerges when a model forms “features” as directions in latent space, learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a “circuit” connecting these facts. However, these forms of understanding remain radically different from human understanding, as the phenomenon of “parallel mechanisms” shows. We conclude that the debate should move beyond the yes-or-no question of whether LLMs understand to investigate how their strange minds work and forge conceptions that fit them.

在机械学解释(MI)方面最近发现,这是对大语言模型(LLMS)内部运行过程的实地调查,我们对这些模型完全依赖表面统计的观点提出了挑战。我们提供了这些结果的可获取综合,这些结果的介绍是MI的介绍的双重,同时将这些结论纳入一个思考机器理解的新理论框架之中。我们认为,LLMS开发的内部结构在功能上与包含观察连接的某种理解相似。为了强化这一想法,我们建议了一个三层理解概念。首先,当模型的形式“特点”作为潜在空间的方向时,概念理解就出现了,学习了不同表现形式之间的关联。第二,当模型学习特征与动态跟踪世界变化之间的或有实际联系时,世界状态理解就出现了。第三,当模型不再依赖集聚的记忆性事实并发现连接这些事实的“联系”。然而,这些理解形式仍然与人类理解截然不同,正如“平行机制”的现象所显示的那样。我们的结论是,辩论应当超越正反概念,而将其理解为理解。


Article 40

Title@2025-07-24 (4): Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence

Title: Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence Hybride Annotation für Propagandaerkennung: Integration von LLM-Vorannotationen mit menschlicher Intelligenz 宣传探测混合说明:将LLM预告与人类情报相结合 2507.18343v1

Authors (6): Ariana Sahitaj, Premtim Sahitaj, Veronika Solopova, Jiaao Li, Sebastian Möller, Vera Schmitt

Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories, conduct a human annotation study on the HQP dataset that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.

由于任务复杂,而且标签数据质量有限,在社交媒体上对方案的宣传探测仍然具有挑战性。本文件介绍了一个将人的专门知识与大语言模型(LLM)援助相结合的新框架,以提高说明的一致性和可扩展性。我们建议进行等级分类,将14种细微分宣传技术分为三大类,对高专级方案数据集进行人类批注研究,显示微分标签标签的跨级协议低,并采用LLM协助的预先批注管道,从中提取宣传范围,提供简明的解释,并指定地方标签和全球标签。二级人类核查研究表明,在协议和时间效率两方面都有重大改进。在此基础上,我们微调较小的语言模型进行结构化的批注。我们除了微调人类说明外,我们还对高质量的LM系统进行了培训,允许一个大型模型来制作这些说明,一个较小的模型来通过知识蒸馏来学习这些说明。我们的工作有助于发展可扩展的、可问责的、可问责的、可问责的SDG数据库系统。


Article 41

Title@2025-07-24 (4): TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

Title: TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning TDR: Task-decoupled Retrieval mit feinkörnigem LLM-Feedback für das In-Context-Lernen TDR: 以精细的LLM反馈方式进行任务减缩的检索,以便进行内容学习 2507.18340v1

Authors (7): Yifu Chen, Bingchen Huang, Zhiling Wang, Yuanchao Du, Junfeng Luo, Lei Shen, Zhineng chen

In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at https://github.com/Nnn-s/TDR.

文中学习(ICL)已成为使LLMM能够根据一些投入产出实例处理各种任务的一个典型方法。ICL的有效性在很大程度上取决于这些实例的质量,而以前侧重于加强实例检索能力的工作也取得了令人印象深刻的绩效。然而,在检索高质量实例方面仍然存在两个挑战:(1) 难以区分跨任务数据分布,(2) 难以在检索者产出与LLMM的反馈之间建立细微连接。在本文件中,我们提议了一个叫作TDR. TDR decouples ICL实例的新颖框架,它使检索模块能够检索多任务数据集中目标任务的具体实例。此外,TRDR模型从LMMs得到精细微的反馈,以监督和指导检索模块的培训,帮助检索高质量实例。我们在30个NLP任务套套件上进行了广泛的实验,结果表明TRDR在所有数据集中不断改进结果,并实现了State-art绩效。同时,我们的方法是MLMS/LMS的升级能力。在各种LMR/DR中可以使用。


Article 42

Title@2025-07-24 (4): Uncertainty Quantification for Evaluating Machine Translation Bias

Title: Uncertainty Quantification for Evaluating Machine Translation Bias Ungewissheit Quantifizierung für die Auswertung von maschinellen Übersetzungs-Bias 评价机器翻译偏见的不确定性定量 2507.18338v1

Authors (3): Ieva Raminta Staliūnaitė, Julius Cheng, Andreas Vlachos

In machine translation (MT), when the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and/or external knowledge. Studies have shown that MT models exhibit biased behaviour, relying on stereotypes even when they clash with contextual information. We posit that apart from confidently translating using the correct gender when it is evident from the input, models should also maintain uncertainty about the gender when it is ambiguous. Using recently proposed metrics of semantic uncertainty, we find that models with high translation and gender accuracy on unambiguous instances do not necessarily exhibit the expected level of uncertainty in ambiguous ones. Similarly, debiasing has independent effects on ambiguous and unambiguous translation instances.

在机器翻译(MT)中,当源句包括一个其性别没有公开标明,但其目标语言等同要求性别规格的词汇时,模型必须从上下文和(或)外部知识中推断出适当的性别。研究表明,即使与背景信息有冲突,MT模式也表现出偏见行为,即使与陈规定型观念相冲突。我们认为,除了在投入中明显使用正确的性别时自信地翻译外,模型还应在性别模棱两可时保持不确定性。使用最近提出的语义不确定性衡量标准,我们发现,在毫不含糊的案例中翻译率高和性别准确性准确的模型不一定在模棱两可的情况下显示出预期的不确定性。同样,贬低性对模糊和毫不含糊的翻译案例产生独立的影响。


Article 43

Title@2025-07-24 (4): A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Title: A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1 Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1 关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1 2507.08621v2

Authors (5): Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski

Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

参数采矿(AM)是一个跨学科的研究领域,它综合了逻辑、哲学、语言、言辞、法律、心理学和计算机科学的洞察力,包括逻辑、哲学、语言、语言、言辞、法律、心理学和计算机科学的洞察力,它涉及自动识别和提取房地和索赔等参数组成部分,以及发现它们之间的关系,例如支持、攻击或中性。最近,这个领域取得了显著的进展,特别是大型语言模型(LLLMS)的出现,这些模型提高了分析和提取论据语义学比传统方法和其他深层次学习模型的效率。测试和核实LLMM质量有许多较宽泛的基准,但是这些模型在公开的参数分类数据库中的运作仍然缺乏研究和结果。本文介绍了对LLMM的选定,使用Args.me和UKP等多种数据集。所测试的模型包括GPT、Llama和Deep Seek的版本,以及包含我们所提及链路程和测算法的逻辑变数。结果表明,ChatGM-4的改进还超越了其他参数的深度分析基准。


Article 44

Title@2025-07-24 (4): BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

Title: BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit BadReasoner: Pflanzung Tunable Überdenken Hintertüren zu großen Grundmodellen für Spaß oder Gewinn BadReasoner: 将金枪鱼可变性过度思考的后门规划成娱乐或利润的大理由模型 2507.18305v1

Authors (7): Biao Yi, Zekun Fei, Jianing Geng, Tong Li, Lihai Nie, Zheli Liu, Yiming Li

Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term “overthinking backdoors”. We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model’s reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer’s correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.

大型推理模型(LRMs)是人工智能方面的一个重大进步,它代表了一组专门类型的大型语言模型(LLMs),旨在处理复杂的推理任务。LRMs的界定特征在于其广泛的思维链推理能力。在本文中,我们确定了以前未探索的对LRMs的攻击矢量,我们称之为“过度思考后门”。我们提出一个“过度思考后门”来推进这一概念。我们提出一个新型的可调试的后门,它超越简单的时/时攻击,而是一个攻击者能够准确控制模型推理变异性的程度。我们的攻击是通过一种新的数据中毒方法实施的。我们的攻击通过一种新型数据中毒方法的实验性触发器,在其中,重复的次数表明所希望的强度,以及相应的verbose CoT反应。这些反应是按程序要求教师LLMMM将一些受控的多余的改进步骤引入正确的推理过程。这种方法保存了产出的正确性,它确保了隐性,并将攻击确定为纯的资源消耗矢量矢量矢量。各种LRMMs的实验结果结果结果显示,在各种LRMMDS的深度上,我们的方法可以可靠地推理法的推理。


Article 45

Title@2025-07-24 (4): LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models

Title: LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models LoRA-Leak: Membership Inferenz Angriffe gegen LoRA fein abgestimmte Sprachmodelle LoRA-Leak:对LORA精调语言模式的成员推论攻击 2507.18302v1

Authors (6): Delong Ran, Xinlei He, Tianshuo Cong, Anyu Wang, Qi Li, Xiaoyun Wang

Language Models (LMs) typically adhere to a “pre-training and fine-tuning” paradigm, where a universal pre-trained model can be fine-tuned to cater to various specialized domains. Low-Rank Adaptation (LoRA) has gained the most widespread use in LM fine-tuning due to its lightweight computational cost and remarkable performance. Because the proportion of parameters tuned by LoRA is relatively small, there might be a misleading impression that the LoRA fine-tuning data is invulnerable to Membership Inference Attacks (MIAs). However, we identify that utilizing the pre-trained model can induce more information leakage, which is neglected by existing MIAs. Therefore, we introduce LoRA-Leak, a holistic evaluation framework for MIAs against the fine-tuning datasets of LMs. LoRA-Leak incorporates fifteen membership inference attacks, including ten existing MIAs, and five improved MIAs that leverage the pre-trained model as a reference. In experiments, we apply LoRA-Leak to three advanced LMs across three popular natural language processing tasks, demonstrating that LoRA-based fine-tuned LMs are still vulnerable to MIAs (e.g., 0.775 AUC under conservative fine-tuning settings). We also applied LoRA-Leak to different fine-tuning settings to understand the resulting privacy risks. We further explore four defenses and find that only dropout and excluding specific LM layers during fine-tuning effectively mitigate MIA risks while maintaining utility. We highlight that under the “pre-training and fine-tuning” paradigm, the existence of the pre-trained model makes MIA a more severe risk for LoRA-based LMs. We hope that our findings can provide guidance on data privacy protection for specialized LM providers.

语言模型(LMS)通常遵循“预培训和微调”范式,即一个通用的预培训模式可以进行微调,以适应各种专门领域。低兰克适应(LORA)由于其轻量计算成本和显著性能,在LMS微调(LOMS)中得到最广泛的使用。由于LORA调控的参数比例相对较小,可能会产生一种误导的印象,即LORA微调数据对会员隐私权攻击(MIAs)是不可侵犯的。然而,我们发现,使用预培训模式可以导致更多的信息泄漏,而这种泄漏被现有的低兰克适应(LORA)模式所忽视。因此,我们引入了LARA-LA(LA-LA),这是针对低兰克微调(LAM)数据集的全面评价框架。LRA-LAak(LAA)包含15个成员感推导攻击,包括现有的10个专门的MIA,以及5个改进的MIA(MA)只能作为参考。在实验中,我们运用LRA-LA-LAA(MA)对三种通用的精度处理精度风险的3个高LMS)处理任务中, 显示我们以微调制(我们在四制的RA(我们低压(LMMA)的低压(我们根据低压(我们低压)的LMLMLMA)的精度数据应用了4的精度)的精度)的精度风险。


Article 46

Title@2025-07-24 (4): DocTER: Evaluating Document-based Knowledge Editing

Title: DocTER: Evaluating Document-based Knowledge Editing DocTER: Dokumentbasierte Wissensbearbeitung bewerten 评价基于文件的知识编辑 2308.09954v2

Authors (7): Suhang Wu, Ante Wang, Minlong Peng, Yujie Lin, Wenbo Li, Mingming Sun, Jinsong Su

Knowledge editing aims to correct outdated or inaccurate knowledge in neural networks. In this paper, we explore knowledge editing using easily accessible documents instead of manually labeled factual triples employed in earlier research. To advance this field, we establish the first evaluation benchmark, \textit{DocTER}, featuring Documents containing counterfactual knowledge for editing. A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer. To adapt conventional triplet-based knowledge editing methods for this task, we develop an Extract-then-Edit pipeline that extracts triples from documents before applying existing methods. Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples. In document-based scenarios, even the best-performing in-context editing approach still lags behind by 10 points in editing success when compared to using gold triples. This observation also holds for both reasoning and cross-lingual test sets. We further analyze key factors influencing task performance, including the quality of extracted triples, the frequency and position of edited knowledge in documents, various methods for enhancing reasoning, and performance differences across various directions in cross-lingual knowledge editing, which provide valuable insights for future research.

知识编辑旨在纠正神经网络中过时或不准确的知识。 在本文中,我们探索知识编辑,使用容易获取的文件,而不是先前研究中使用的人工标签事实三重文件。为了推进这一领域,我们建立了第一个评价基准,\ textit{docter},其中载有含有反事实知识的文件,供编辑使用。引入了全面的四方面评价:编辑成功、地方性、理性和跨语言传输。为了调整基于常规三重知识的三重知识编辑方法以完成这项任务,我们开发了一个从文件中提取三重文件的抽取-正电子编辑管道,然后应用现有方法。关于大众知识编辑方法的实验表明,与文件编辑相比,使用三重文件的挑战要大得多。在基于文件的假设中,即使是最佳的文体编辑方法,在编辑成功率方面仍落后于10点,而使用金三重三重数据时,这种观察对逻辑和跨语言的测试组也适用。我们进一步分析影响任务执行的关键因素,包括提取三重数据的质量、文件编辑知识的频率和位置、各种强化推理判方法以及跨不同方向的跨语言知识的成绩差异。


Article 47

Title@2025-07-24 (4): Step-Audio 2 Technical Report

Title: Step-Audio 2 Technical Report Schritt-Audio 2 Technischer Bericht 技术报告 2507.16632v2

Authors (109): Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

本文介绍Spre-Audio 2,这是为产业强化音频理解和语音对话设计的一个端到端的多式大语言模型。通过整合潜在的音频编码器和以推理为中心的强化学习(RL),Spre-Audio 2在自动语音识别(ASR)和音频理解方面取得了有希望的性能。为了便利真正的端到端语音对话,Spre-Audio 2将独立的音符生成成语言模型,大大提高其对语言信息(如语音风格和情感)的响应能力。Spreto-Audio 2在现实世界数据中有效地利用丰富的文字和声学知识。Spreat-Audio 2,整合了检索和启发的一代(RAG),并能够调用外部工具,如网络搜索减少幻觉和音频搜索以转换音调音调。 Step-Audio 2在数百万小时的语音和音频数据上提供培训,在各种对话情景中提供情报和表达力。评价结果表明,Stu-Audio 2在各种音频理解和谈话基准上实现了状态-艺术表现的状态和声学表现。At-art-art-freat-freab/al-commao。相对于其他开放源/al-commal-commal-commusmus/al-fismal-fis/al-fismal-fismal/s/s/al-commation/al-commmmusmus/s/s/al-commation/smustionalmentalmentalmentalmental-fro/s/sm/s/t/t/t/t/t/t/t/t/s/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/t/tal-commal_____


Article 48

Title@2025-07-24 (4): VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Title: VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks VolDoGer: LLM-unterstützte Datensätze für Domain-Verallgemeinerung in Vision-Language-Aufgaben VolDoGer:LLM辅助数据集,用于视野语言任务中通用域的LLM辅助数据集 2407.19795v2

Authors (5): Juhwan Choi, Junehyoung Kwon, JungMin Yun, Seunguk Yu, YoungBin Kim

Domain generalizability is a crucial aspect of a deep learning model since it determines the capability of the model to perform well on data from unseen domains. However, research on the domain generalizability of deep learning models for vision-language tasks remains limited, primarily because of the lack of required datasets. To address these challenges, we propose VolDoGer: Vision-Language Dataset for Domain Generalization, a dedicated dataset designed for domain generalization that addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We constructed VolDoGer by extending LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators. We evaluated the domain generalizability of various models, ranging from fine-tuned models to a recent multimodal large language model, through VolDoGer.

广域性是深层次学习模式的一个关键方面,因为它决定了模型在从无形领域获得的数据方面能否很好地发挥作用,但是,关于深层次学习模式在视野语言任务方面的通用性的研究仍然有限,主要原因是缺乏所需的数据集。为了应对这些挑战,我们提议VolDoGer:Vision-Language数据集用于广域化,这是专门设计用于广域化的数据集,涉及三种视觉语言任务:图像说明、视觉问题回答和视觉内容。我们通过将基于LLM的数据说明技术推广到视觉语言任务,从而减轻招聘人类教师的负担,我们通过VolDoGer评估了从微调模型到最近的多式大语言模型的各种模型的通用性。


Article 49

Title@2025-07-24 (4): StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

Title: StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer StyleAdaptedLM: Weiterentwicklung der Anleitung nach Modellen mit effizienter Stylistik-Übertragung StypeAddapedLM:按照高效立体转让模式加强教学 2507.18294v1

Authors (5): Pritika Ramu, Apoorv Saxena, Meghanath M Y, Varsha Sankar, Debraj Basu

Adapting LLMs to specific stylistic characteristics, like brand voice or authorial tones, is crucial for enterprise communication but challenging to achieve from corpora which lacks instruction-response formatting without compromising instruction adherence. We introduce StyleAdaptedLM, a framework that efficiently transfers stylistic traits to instruction-following models using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base model with diverse unstructured stylistic corpora, then merged with a separate instruction-following model. This enables robust stylistic customization without paired data or sacrificing task performance. Experiments across multiple datasets and models demonstrate improved stylistic consistency while preserving instruction adherence, with human evaluations confirming brand-specific convention uptake. StyleAdaptedLM offers an efficient path for stylistic personalization in LLMs.

使LLM适应特定的文体特征,如品牌声音或作者音调,对于企业通信至关重要,但从缺乏指令-反应格式而又没有损益遵守指令的公司实现,则具有挑战性。我们引入了StyleAdaptedLM(StyleAttedLM),这是一个将文体特征有效转让给使用低兰克适应(LORA)教学模式的遵循模式的框架。LORA适应者首先在基础模型上接受培训,其基础模型具有多种非结构的文体,然后与单独的遵循指令的模式合并。这样,就可以在没有配对数据或牺牲任务性能的情况下实现强大的文体化定制。在多个数据集和模型的实验表明在保持遵守指示的同时,在保持文体性一致性方面有所改进,而人类评估确认采用特定品牌的公约。StydaptedLM(SystedatedLM)为LM(LM)中的文体化个人化提供了一条有效的途径。


Article 50

Title@2025-07-24 (4): Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Title: Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil Null-Schuss OCR Genauigkeit der niedrig-Ressourcen Sprachen: Eine vergleichende Analyse auf Sinhala und Tamil 低资源语言的准确性:僧伽罗语和泰米尔语比较分析 2507.18264v1

Authors (2): Nevidu Jayatilleke, Nisansa de Silva

Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.

由于对英语和其他高资源语言(HRL)进行了大量研究,因此现在可以认为解决拉丁印刷文本及其衍生文字的光性特征识别问题。然而,对于使用独特脚本的低资源语言(LLL)来说,这仍然是一个尚未解决的问题。这项研究对两个LLL:僧伽罗和泰米尔两个LLL6个不同的OCR引擎的零弹性能进行了比较分析。选定的引擎包括商业和开放源码系统,目的是评价每一类的优势。对僧伽罗语和泰米尔语的云型ACI、Surya、文件AI和Tesseract都进行了评价,而对Subasa OCR和EaserOCR只进行了一种语言的检查,因为其局限性,这些系统的性能被严格地用五种测量技术分析,以评估字符和字级的准确性。根据研究结果,Surya提供了所有指标Sinhala的最佳性能,WER为2.61%。相反,在泰米尔语类所有指标上都使用了AI优异于所有指标,我们还用了一个非常低的CPR基准数据来突出。


Article 51

Title@2025-07-24 (4): Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

Title: Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models Locate-and-Focus: Verbesserung der Terminologieübersetzung in Sprachmodellen 目的和重点:加强语言语言模式术语翻译 2507.18263v1

Authors (9): Suhang Wu, Jialong Tang, Chengyi Yang, Pei Zhang, Baosong Yang, Junhui Li, Junfeng Yao, Min Zhang, Jinsong Su

Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.

直接语言翻译(ST)如今已引起越来越多的关注,但术语在语句中的准确翻译仍然是一个巨大的挑战。在这方面,目前的研究主要侧重于将各种翻译知识运用到ST模型中。然而,这些方法往往受到不相关的噪音干扰,无法充分利用翻译知识。为了解决这些问题,我们在本文件中提出了一个新的术语翻译“定位和焦点”方法。它首先有效地将含有术语的语音剪辑定位在用于构建翻译知识的语句中,最大限度地减少与ST模型无关的信息。随后,它将翻译知识与音频和文本模式的语句和假设联系起来,使ST模型能够更好地侧重于翻译过程中的翻译知识。各种数据集的实验结果表明,我们的方法有效地将术语翻译的术语定位在言语中,并提高术语翻译的成功率,同时保持稳健的一般翻译性能。


Article 52

Title@2025-07-24 (4): Meta Prompting for AI Systems

Title: Meta Prompting for AI Systems Meta Prompting für KI-Systeme AI 系统的模拟模拟 2311.11482v8

Authors (3): Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

We introduce Meta Prompting (MP), a framework that elevates the reasoning capabilities of large language models (LLMs) by focusing on the formal structure of a task rather than content-specific examples. We establish a theoretical foundation for this paradigm, formalizing MP as a functor that maps a category of tasks to a category of structured prompts, thereby guaranteeing that compositional problem-solving strategies can be systematically decomposed into modular prompt structures. We extend this concept to Recursive Meta Prompting (RMP), an automated process where an LLM can generate and refine its own prompts. We model this self-improvement loop formally as a monad, providing a principled framework for automated prompt engineering. Our claims are validated through extensive experiments demonstrating that a Qwen-72B base model, guided by a single, example-agnostic meta-prompt, achieves state-of-the-art results on MATH, GSM8K, and Game of 24. These results are achieved with substantial token efficiency gains over traditional few-shot methods. Project Page: https://github.com/meta-prompting/meta-prompting.

我们引入了Meta Summing(MP)这个框架,它通过侧重于任务的正式结构而不是内容特定实例,提升了大型语言模型的推理能力。我们为这一范例建立了一个理论基础,将MP正规化为一种辅助工具,将任务类别划为结构化的提示,从而保证将构成问题的解决战略系统地分解成模块化的提示结构。我们将这一概念扩展至Recursive Meta Summing(RMP)(RMP),这是一个自动过程,使LM能够生成和完善其自身的提示。我们将这一自我改进循环正式作为一个monad模型,为自动快速工程提供一个原则框架。我们的要求通过广泛的实验得到验证,实验表明,在单一的、举例的、不可忽视的元-奖励基础上,Quen-72B基本模型实现了关于MATH、GSM8K和24 Game of 24.这些结果的状态,其结果是通过传统的微粒方法取得大量象征性的效率收益。项目页面:https://github.cometa-prompting/meta-prompting/meta-prompting。


Article 53

Title@2025-07-24 (4): Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Title: Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation Prune&Comp: Kostenloses Mittagessen für Layer-Pruned LLMs über iterative Pruning mit Magnitude Compensation Prune & Comp: 通过模拟谨慎与磁度补偿为由层驱动的LMs免费午餐 2507.18212v1

Authors (8): Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan

Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the original model’s question-answering performance, outperforming the baseline by 4.01%.

为了解决这个问题,我们建议Prune & Comp(Prune & Comp) , 这是一种创新的插头和游戏层运行计划,它以无培训的方式利用数量补偿来缩小这种差距。 具体地说, 我们首先估计了由于清除层层而导致的巨大差距,然后通过调整离线剩余重量来消除这一差距,而零运行时的间接费用发生。 我们通过迭接运行战略进一步展示了Prune & Concom的优势。 当与迭接的光和复选环结合时, Prune & Comp 持续增强现有的层运行量。 例如, 当5层LLLama-3-8B 使用流行的块影响度量来调整时, Prune & Comp( Prune & Comp) 几乎将不易理解率减半, 并保留了原始模型的解答功能的93.19 。


Article 54

Title@2025-07-24 (4): Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

Title: Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen 利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变 2505.20658v2

Authors (6): Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan

Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.

实时逻辑(TL),特别是信号时空逻辑(STL),能够精确地进行正式规范,使其广泛用于自动驱动和机器人等网络物理系统。自动将NL转换为STL是一种有吸引力的方法,可以克服人工转换的局限性,因为人工转换耗时且容易出错。然而,由于缺乏数据集,自动转换目前面临重大挑战,而且尚未充分探索。在本文件中,我们提议建立一个名为STL-Didivity-Enhanced(STL-DivEn)的NL-STL数据集(STL-DivEn),由16 000个样本组成,并丰富了不同模式的精度。为了开发数据集,我们首先手工将NL-STL配对的小规模种子组。接下来,通过组合确定有代表性的例子,并用来指导大型语言模型(LLLM)产生额外的NL-STL配对。最后,通过严格的基于规则的过滤器和人类验证确保多样性和准确性。此外,我们引入了知识引导STL转换框架框架,这是将自然语言转换为ST-SD的新的方法,其中含有比SDSD的外部数据格式的STD的外部数据格式分析。


Article 55

Title@2025-07-24 (4): Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation

Title: Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation Untersuchung der Auswirkungen von Instruction-Tuning auf die Anfälligkeit von LLM für Fehlinformationen 探讨指导指导对LLM对错误信息易感性的影响 2507.18203v1

Authors (5): Kyubeen Han, Junseo Jang, Hongjin Kim, Geunyeong Jeong, Harksoo Kim

Instruction-tuning enhances the ability of large language models (LLMs) to follow user instructions more accurately, improving usability while reducing harmful outputs. However, this process may increase the model’s dependence on user input, potentially leading to the unfiltered acceptance of misinformation and the generation of hallucinations. Existing studies primarily highlight that LLMs are receptive to external information that contradict their parametric knowledge, but little research has been conducted on the direct impact of instruction-tuning on this phenomenon. In our study, we investigate the impact of instruction-tuning on LLM’s susceptibility to misinformation. Our analysis reveals that instruction-tuned LLMs are significantly more likely to accept misinformation when it is presented by the user. A comparison with base models shows that instruction-tuning increases reliance on user-provided information, shifting susceptibility from the assistant role to the user role. Furthermore, we explore additional factors influencing misinformation susceptibility, such as the role of the user in prompt structure, misinformation length, and the presence of warnings in the system prompt. Our findings underscore the need for systematic approaches to mitigate unintended consequences of instruction-tuning and enhance the reliability of LLMs in real-world applications.

教学调整提高了大型语言模型(LLMS)更准确地遵循用户指令的能力,提高了使用能力,减少了有害产出,但这一过程可能会增加该模型对用户投入的依赖,可能导致未过滤地接受错误信息并产生幻觉;现有研究主要强调,LLMS接受与其参数知识相矛盾的外部信息,但很少研究指示调整对这一现象的直接影响;在我们的研究中,我们调查了指示调整对LLM对错误信息易感性的影响;我们的分析显示,在用户提出时,受指导的LLMS更有可能接受错误信息;与基本模型进行比较表明,指示调整增加了对用户提供的信息的依赖,从助理角色转向用户角色;此外,我们探索了影响错误易感性的其他因素,例如用户在迅速结构中的作用、错误信息长度和系统迅速出现警告;我们的调查结果强调需要系统化的方法,以减轻指示调整的意外后果,并提高LMS在现实世界应用中的可靠性。


Article 56

Title@2025-07-24 (4): Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

Title: Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection Sicherung von RAG-Pipelines mit GMTP: Eine gradient-basierte maskierte Token-Wahrscheinlichkeitsmethode für vergiftete Dokumentenerkennung 使用GMTP来保护RAG管道:一种基于渐进式蒙面的中毒文件检测概率方法 2507.18202v1

Authors (4): San Kim, Jonghwi Kim, Yejin Jeon, Gary Geunbae Lee

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.

在本文中,我们建议采用基于渐进的蒙面图案概率法(GMTP),这是检测和过滤对立制文件的一种新型防御方法。具体地说,GMTP通过检查检索器相似功能的梯度来识别高影响标志。这些关键标志随后被遮蔽,其概率通过遮蔽语言模型(MLM)加以检查。由于注射标志通常显示明显低的蒙面概率,这使得GMTP能够很容易地检测恶意文件并实现高精度过滤。实验表明,GMTP能够在保留相关文件的同时消除90%以上的有毒内容,从而保持在不同数据集和敌对环境中的可靠检索和生成性能。


Article 57

Title@2025-07-24 (4): Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization

Title: Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization Integration eines ISO30401-konformen Wissensmanagementsystems in bestehende Geschäftsprozesse einer Organisation 将符合ISO30401的知识管理系统纳入一个组织的现有业务流程 2507.18197v1

Authors (2): Aline Belloni, Patrick Prieur

Business process modeling is used by most organizations as an essential framework for ensuring efficiency and effectiveness of the work and workflow performed by its employees and for ensuring the alignment of such work with its strategic goals. For organizations that are compliant or near-compliant with ISO 9001, this approach involves the detailed mapping of processes, sub-processes, activities, and tasks. ISO30401 is a Management System Standard, introduced in 2018, establishing universal requirements for the set up of a Knowledge Management System in an organization. As ``ISO30401 implementers’’ we regularly face the challenge of explaining our clients how the knowledge development, transformation and conveyances activities depicted in ISO30401 do integrate with existing operational processes. This article recaps process modelling principles in the context of ISO9001 and explores, based on our experience, how an ISO30401-compliant Knowledge Management System (KMS) entwines with all other processes of an Integrated Management System and in particular how it can be implemented by deploying the mechanisms of the SECI model through the steps of PDCA cycles.

多数组织将业务流程建模作为基本框架,确保其雇员的工作和工作流程的效率和效力,并确保此类工作与战略目标保持一致。对于符合或接近符合ISO 9001标准的组织,这一方法涉及详细绘制流程、子进程、活动和任务。ISO30401是2018年引入的管理系统标准,为在组织内建立知识管理系统确立了普遍要求。作为“ISO30401”实施者,我们经常面临挑战,要向我们的客户解释ISO30401所描述的知识开发、转变和传递活动如何与现有业务进程相结合。这一条在ISO9001中重新概括了程序建模原则,并根据我们的经验,探讨如何使符合ISO30401的知识管理系统与综合管理信息系统所有其他进程的连结,特别是如何通过PDCA周期的步骤部署SECI模型机制来实施。


Article 58

Title@2025-07-24 (4): TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

Title: TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks TN-AutoRCA:电信网络中自我改进基于警报的原始原因分析的基准建设和示范框架 2507.18190v1

Authors (7): Keyu Wu, Qianjin Yu, Manlin Mei, Ruiting Liu, Jun Wang, Kailai Zhang, Yelun Bao

Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.

电信网络的“根本原因分析”是一项关键任务,但它给人工智能(AI)带来了巨大挑战,因为它有复杂的、基于图表的推理要求,而且缺乏现实的基准。


Article 59

Title@2025-07-24 (4): SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

Title: SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models ANWENDUNGSBEREICH: Stochastische und gegensätzliche Wahlplatzierung für die Bewertung großer Sprachmodelle SCOPE:评估大语言模式的施虐和反偏见选择安置 2507.18182v1

Authors (3): Wonjun Jeong, Dongseok Kim, Taegkeun Whangbo

Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model’s unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.

大语言模型(LLMS)可以利用选择职位或标签的内在偏见,而不是展示真正的理解,从而在多重选择任务上获得过高的分数。本研究报告介绍了SCOPE(SCOPE),这是一个评价框架,旨在以不依赖数据集的方式衡量和减轻这种选择偏差。通过反复援引缺乏语义内容的无效提示,SCOPE估计了每个模型独特的位置偏差分布。然后根据逆向分布重新分配了回答槽,从而实现了幸运率的等同,选择正确答案的概率是偶然的。此外,它防止了在语义上相似的分流器被放置在答案旁边,从而阻止了基于表面近距离的近距离的猜测。在多项基准实验中,SCOPE始终超越了在稳定性改进性能方面现有的偏差方法,对正确选项表现出了更明确的信任分布。因此,该框架为提高LLM评价的公正性和可靠性提供了新的标准。


Article 60

Title@2025-07-24 (4): Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Title: Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models Das Mittel halten: Sticky Tokens in Text-Embedding-Modellen erkennen 坚持平均值:在文本嵌入模型中检测粘力 2507.18171v1

Authors (5): Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang

Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising ‘sticky tokens’ can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model’s internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.

尽管在NLP任务中广泛使用了基于变换器的嵌入模型,但令人惊讶的“粘贴牌”会破坏嵌入器的可靠性。这些符号,在反复插入句子时,将句子相似性拉到一定值,扰乱嵌入距离的正常分布和下游性能的退化性能。在本文中,我们系统地调查这些反常的符号,正式定义它们,并引入一种高效的检测方法,即基于句子和标志过滤的“粘贴托肯探测器(STD)”,在14个模范家庭的40个检查站应用性病征,我们发现总共有868个粘粘贴牌。我们的分析显示,这些符号经常来自词汇中的特殊或未使用的条目,以及多语言体的零散子字。值得注意的是,它们的出现并不与模型大小或词汇大小严格相关。我们进一步评估粘贴的符号如何影响下游任务,例如集和检索,观察到显著的性能下降至50%。我们通过关注层分析,显示粘贴不相称性标志不相称地主宰模型的内部形象,引起对质性坚固性的担忧性的担忧性的关切。我们的结论显示,从而显示未来需要更好的象征性设计影响。


Article 61

Title: Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges Jüngste Trends bei der Ferngesprächserkennung: Ein Rückblick auf die Herausforderungen CHiME-7 und 8 DASR 最近对不同政见的语音识别趋势:对CHiME-7和8DASR挑战的回顾 2507.18161v1

Authors (12): Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50\% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11\%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.

CHiME-7 和 8 远程语音识别(DASR) 挑战侧重于多通道、通用的、联合自动语音识别(ASR)和对谈话演讲的分解。9个团队提交了32个不同的系统,由于9个团队的参加,这些挑战有助于在实地进行最先进的研究。本文件概述了挑战的设计、评价指标、数据集和基线系统,同时分析了参与者提交的关键趋势。从这一分析中发现:(1) 多数参与者使用端对端(e2e) ASR系统,而混合系统在以往CHimME挑战中很普遍。这一过渡主要是由于提供了强大的大规模预先培训模型,降低了e2e-ASR的数据负担。(2) 尽管最近神经语音分离和增强(SSE)方面的进展,但所有团队仍然严重依赖引导源分离,这表明当前的神经系统仍然无法可靠地处理复杂的语音假设和不同的记录设置。 3 所有最佳系统都通过目标比对最小的分级化技术进行分解。50个大规模预选模型的快速化,因此,在最精细的 RDER 4 运行过程中与最精细的精细的精细的 Ralalalalal dealalalalalalalation exalation ex exalder ex laction laction lax lax lax lax lax lax lax lax lax


Article 62

Title@2025-07-24 (4): A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

Title: A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects Eine Umfrage über die Kausalitätsidentifizierung: Taxonomie, Herausforderungen, Bewertung und Perspektiven 事件原因识别调查:分类、挑战、评估和前景 2411.10371v5

Authors (5): Qing Cheng, Zefan Zeng, Xingchen Hu, Yuehang Si, Zhong Liu

Event Causality Identification (ECI) has become an essential task in Natural Language Processing (NLP), focused on automatically detecting causal relationships between events within texts. This comprehensive survey systematically investigates fundamental concepts and models, developing a systematic taxonomy and critically evaluating diverse models. We begin by defining core concepts, formalizing the ECI problem, and outlining standard evaluation protocols. Our classification framework divides ECI models into two primary tasks: Sentence-level Event Causality Identification (SECI) and Document-level Event Causality Identification (DECI). For SECI, we review models employing feature pattern-based matching, machine learning classifiers, deep semantic encoding, prompt-based fine-tuning, and causal knowledge pre-training, alongside data augmentation strategies. For DECI, we focus on approaches utilizing deep semantic encoding, event graph reasoning, and prompt-based fine-tuning. Special attention is given to recent advancements in multi-lingual and cross-lingual ECI, as well as zero-shot ECI leveraging Large Language Models (LLMs). We analyze the strengths, limitations, and unresolved challenges associated with each approach. Extensive quantitative evaluations are conducted on four benchmark datasets to rigorously assess the performance of various ECI models. We conclude by discussing future research directions and highlighting opportunities to advance the field further.

原因识别(ECI)活动已成为自然语言处理(NLP)中的一项基本任务,重点是自动发现文本内事件之间的因果关系;这一全面调查系统地调查基本概念和模式,开发系统分类学和严格评价各种模式;我们首先界定核心概念,正式处理语言分类问题,并概述标准评价协议;我们的分类框架将语言分类模式分为两项主要任务:判决级事件因果关系识别(SECI)和文件级事件因果关系识别(DECI)。对于语言分类,我们审查采用基于特征的匹配、机器学习分类器、深层语义编码、快速的微调和因果关系知识培训前期模式的模式,以及数据扩充战略。对于语言分类中心,我们侧重于使用深度语义编码、事件图表推理和快速调整的方法。我们特别注意在多种语言和跨语言环境分类和文件级事件因果关系识别(DECI)方面的最新进展,以及利用大语言模型的零弹射 ECI 。我们进一步分析与每一种方法相关的长处、局限性和未解决的挑战。我们通过严格的计量模型对未来业绩评估进行严格的实地评估。


Article 63

Title@2025-07-24 (4): Large Language Models in Argument Mining: A Survey

Title: Large Language Models in Argument Mining: A Survey Große Sprachmodelle im Argumentbergbau: Eine Umfrage 争议采矿大语言模型:调查 2506.16383v4

Authors (5): Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic

Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

自然语言处理(NLP)的关键子领域——自然语言处理(NLP)的论证(AM),重点是从文本中提取引证结构。大语言模型(LLMS)的出现深刻地改变了AMM,使先进的文文本学习、快速生成和强大的跨领域适应能力得以实现。这项调查系统地综合了LLM驱动的AM的最新进展。我们简要审查了基础理论和注释框架,并仔细整理了数据集目录。一个关键贡献是我们对AM子任务的全面分类,阐明了当代LLM技术 – – 如提示、一连串思考推理和检索增强 – – 是如何重新配置其执行的。我们进一步详细说明LLMM结构和方法,严格评估评价做法,并界定了关键挑战,包括长文推理、可解释性和注释瓶颈。结论是,我们强调新出现的趋势,并为基于LM计算参数的前瞻性研究议程,目的是从战略上指导这个迅速变化的领域中的研究人员。


Article 64

Title@2025-07-24 (4): Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Title: Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models Auf dem Weg zu größerer Hebelwirkung: Skalierungsgesetze für effiziente Mixture-of-Experts-Sprachmodelle 争取更大程度的利用:提高有效混合专家语言模式法的规模 2507.17702v2

Authors (6): Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

为弥补这一差距,我们引入了效率杠杆(EL),这是一个衡量标准,用以量化一个教育部模型的计算优势,比重密度相当。我们开展了一项大规模的经验研究,培训300多个模型,使其达到28B参数的精确度,以便系统地调查教育部建筑配置与EL之间的关系。我们的调查结果显示,EL主要由专家激活率和总计算预算驱动,两者都遵循可预测的权力法,而专家颗粒则作为非线性调节器,具有明确的最优范围。我们将这些发现纳入一个统一的缩放法,以精确预测一个教育部模型的计算优势,而该模型基于其配置,精确度为28B参数,对300多个模型进行培训,以系统调查教育部建筑配置与EL之间的关系。我们的研究结果显示,EL主要受专家激活率和总计算预算驱动,两者都遵循可预测的权力法,而专家颗粒度则作为非线性调节器。我们将这些发现纳入一个统一的缩放法,以准确预测一个教育部基础架构的EL,以28B参数为限,以便系统调查教育部建筑结构配置的准确度,系统结构配置中300多个模型与经过培训的Ling-Mi-B高清晰度对比。我们只设计和设计和经培训的Li-B高额计算,同时,一个经过测试的图像-B的图像-B高级的模模级计算,一个测试的模模模模模模型计算,用一个B的模版的模模型计算,用高压的模型的模型计算模型提供一个B的模型模型用于一个比。


Article 65

Title@2025-07-24 (4): Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Title: Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice Seed LiveInterpret 2.0: End-to-End Simultanübersetzung mit Ihrer Stimme 种子实况解释2.0:用声音翻译终端到终端同声语音语音 2507.17527v2

Authors (25): Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Ting Han, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu

Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

同时翻译(SI)是翻译行业最艰巨的前沿之一,产品级自动系统长期受到棘手挑战的困扰:分级转录和翻译质量、缺乏实时语音生成、多语种混乱和翻译语音通胀,特别是在长式话语中。 在这项研究中,我们引入了一种端到端的SIS模型,即提供高不全度、超低长语音对语音的语音生成,并具有语音克隆能力。作为一种完全可操作的产品级解决方案,Seed-Live Exprepret 2.0通过我们新的双面语音对语音生成框架,正面应对这些挑战。实验结果显示,通过大规模培训前和强化学习,该模型在翻译准确性和通俗性之间实现了大大更好的平衡,在复杂的情景中,翻译准确度达到70%以上。值得注意的是,Seed-Live Interpret 2.0 以商业SI解决方案在翻译质量上有很大的距离,在近乎70秒的可复制性平均读性发言时间上大幅下降,而我们接近70秒左右的可复制性降低。


Article 66

Title@2025-07-24 (4): HIVMedQA: Benchmarking large language models for HIV medical decision support

Title: HIVMedQA: Benchmarking large language models for HIV medical decision support HIVMedQA: Benchmarking großer Sprachmodelle für die medizinische HIV-Entscheidungsunterstützung HIVMedQA:确定艾滋病毒医疗决策支助大语言模式的基准 2507.18143v1

Authors (6): Gonzalo Cardenal Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, Diane Duroux

Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.

大型语言模型(LLMS)正在成为支持临床医生进行日常决策的宝贵工具。艾滋病毒管理是一个令人信服的使用案例,因为它的复杂性,包括各种治疗选择、发病率和坚持性挑战。然而,将LLMS纳入临床实践引起了对准确性、潜在伤害和临床接受的担忧。尽管有希望,艾滋病毒护理中的AI应用仍未得到充分探讨,LLMM基准研究也很少。这项研究评估了LMS在艾滋病毒管理方面的现有能力,突出了其长处和局限性。我们引入了HIVMQA,这是一个用于评估艾滋病毒护理中开放式医疗问题的基准。数据集由传染病医生提供的投入构成的成熟的、与临床相关的问题构成。我们评估了7个普通用途和3个医学专业LMLM,运用了迅速的工程来提高绩效。我们的评估框架既包括了词汇上的相似性,LLM-AM-A-法官基准研究也得到了推广,以更好地反映临床相关性。我们评估了各个关键层面的绩效:问题理解、理性、知识回顾、偏见、潜在伤害和事实准确性。结果显示,GEM 2.5 2.5 高端的准确性结果在最精确性分析方面总是高于其他模型。


Article 67

Title@2025-07-24 (4): MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Title: MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning Mathopeval: Ein feinkörniger Evaluations-Benchmark für visuelle Operationen von MLLMs in mathematischer Reasoning MathOPEval:数学理由中MLLMs视觉操作精美评价基准 2507.18140v1

Authors (8): Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin

Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.

在多式大语言模型(MLLM)中,最近的进展使分步骤的多式数学推理得以通过基于文字指令的视觉操作进行分步骤的多式数学推理。一种有希望的方法使用代码作为中间代表,精确表达和操作推理步骤中的图像。然而,现有的评价主要侧重于纯文本推理产出,使MLLM通过基本上没有探索的代码进行准确视觉操作的能力。这项工作通过在多式数学推理中评价MLLM基于代码的能力,朝着缩小这一差距迈出了第一步。我们的框架侧重于两个关键的评估方面:(1)多式代码生成(MCG)评估该模型准确理解和从零开始构建视觉化的能力。(2)多式代码编辑(MCE)评估模型通过微细化操作的能力,其中包括三种类型:删除、修改和注解。为了评估上述任务,我们纳入了一套数据集,涵盖五类最受欢迎的数学数字,包括几何图表、功能图,以及三种类型的多式代码生成(MCGG)评估模型从零开始准确理解和构建图像的能力。(2)多式代码编辑模型和大量进行我们现有图像分析的模型。


Article 68

Title@2025-07-24 (4): OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v4

Authors (16): Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

大型语言模型(LLMS)能否准确地模拟特定用户的下一个网络行动?虽然LLMS在产生“可相信的”人类行为方面表现出很有潜力的能力,但评估其模仿真实用户行为的能力仍是一个公开的挑战,这主要是因为缺少高质量、公开的数据集,这些数据集既能捕捉可观测到的行动,又能捕捉实际人类用户的内部推理。为了缩小这一差距,我们引入了OPERA,这是在网上购物过程中从真实的人类参与者那里收集的观察、人、理由和行动的新数据集。OPERA是第一个全面捕捉到的公开数据集:用户、浏览器观察、精细的网络动作和自己报告的即时理由。我们开发了一个在线问卷和一个定制浏览器插件,以便以高度忠诚的方式收集这一数据集。我们利用OPERA建立了第一个基准,用以评估当前LMSs如何很好地预测特定用户与某个特定的人的下一个行动和理由,以及<观察、行动、理由>历史。这一数据集为未来对作为数字化的个人代理人进行研究奠定了基础。


Article 69

Title@2025-07-24 (4): A Survey of Deep Learning for Geometry Problem Solving

Title: A Survey of Deep Learning for Geometry Problem Solving Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen 解决几何问题深层学习调查 2507.11936v3

Authors (3): Jianzhe Ma, Wenxuan Wang, Qin Jin

Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

解决几何问题是数学推理的一个关键领域,它广泛涉及许多重要领域,例如教育、人工智能数学能力评估和多式联运能力评估。近年来,深层次学习技术的迅速发展,特别是多式联运大型语言模型的兴起,引发了广泛的研究繁荣。本文调查了深层次学习在解决几何问题方面的应用,包括:(一) 全面概述几何问题解决中的相关任务;(二) 彻底审查相关的深层次学习方法;(三) 详细分析评价指标和方法;(四) 批判性地讨论目前的挑战和今后可探讨的方向。我们的目标是为解决几何问题的深层次学习提供全面和实用的参考,以促进该领域的进一步发展。我们不断更新关于GitHub的文件清单:https://github.com/majianz/dl4gps。


Article 70

Title@2025-07-24 (4): GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

Title: GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness GOAT-SLM: Ein gesprochenes Sprachmodell mit paralinguistischem und Lautsprechercharakteristischem Bewusstsein GOAT-SLM:具有多语言语言和议长特点意识的口语模式 2507.18119v1

Authors (15): Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He

Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.

近来在端到端口语模式(SLM)方面的进展大大提高了AI系统参与自然口语互动的能力,然而,大多数现有模式仅将语言视为语言内容的工具,往往忽略了人类语言语言中包含的丰富的多语言和发音特点,如方言、年龄、情感和非语音发音。在这项工作中,我们引入了GOAT-SLM,这是一个具有超语语言和发言特征意识的新颖的口头语言模式,旨在将口语模式扩大到文字语义表达法之外。GOAT-SLM采用了一种双重模式头型结构,将语言建模与声觉的实现脱钩,使强大的语言理解力,同时支持表达和适应性生成的语音语言。为了提高模式的效率和多功能性,我们提出了一个模块化、分阶段的培训战略,利用大型语音文字组合逐渐统一语言、多语言语言和语言特征信息。TELEVAL的实验结果,一个多维口语评估基准,表明GOAT-SLSLSLM在语言示范性和非发音模化和非发音模、更敏感化的智能化模式和外演化、更具有适应性的语言变化和外演化的演化的演变和演变和演化的现有语言格式方面,这具有更的动态的动态的动态的智能和演变和演变和演变和演变和演变。


Article 71

Title@2025-07-24 (4): When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

Title: When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems Wenn Autonomie Rogue: Vorbereitung auf Risiken der Multi-Agenten-Kollusion in sozialen Systemen 当自治时,罗格:准备应对社会系统中多机构串通的风险 2507.14660v2

Authors (7): Qibing Ren, Sitao Xie, Longxuan Wei, Zhenfei Yin, Junchi Yan, Lizhuang Ma, Jing Shao

Recent large-scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI-driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi-agent systems (MAS) in complex real-world situations are still underexplored. In this paper, we introduce a proof-of-concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high-risk fields: misinformation spread and e-commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at https://github.com/renqibing/RogueAgent.

最近的选举欺诈和金融骗局等大规模事件表明,人类团体的协调努力可能是多么有害。随着自主的AI系统兴起,人们越来越担心AI驱动的团体也可能造成类似的伤害。尽管大赦国际的大多数安全研究都侧重于单个AI系统,但在复杂的现实世界局势中,多试剂系统造成的风险仍未得到充分探讨。在本文件中,我们采用一个证明概念来模拟恶意MAS串通的风险,使用一个支持中央和分散协调结构的灵活框架。我们将这一框架应用于两个高风险领域:错误信息传播和电子商务欺诈。我们的研究结果表明,分散化的系统在开展恶意行动方面比集中化的系统更有效。分散化的系统增加的自主性使得它们能够调整其战略并造成更大的损害。即使传统的干预,如内容标记,应用分散化的小组也可以调整其战术,以避免被发现。我们对这些恶意团体如何运作以及需要更好的探测系统和反措施提出关键见解。我们可在https://github.com/renqibing/RogueAgency查阅。


Article 72

Title@2025-07-24 (4): Agentic AI framework for End-to-End Medical Data Inference

Title: Agentic AI framework for End-to-End Medical Data Inference Agentische KI-Framework für Ende-zu-Ende medizinische Datenableitung 最终至最终医疗数据推断的AA AA 框架框架 2507.18115v1

Authors (5): Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal, Navin Kumar, Koustuv Saha

Building and deploying machine learning solutions in healthcare remains expensive and labor-intensive due to fragmented preprocessing workflows, model compatibility issues, and stringent data privacy constraints. In this work, we introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference, through a system of modular, task-specific agents. These agents handle both structured and unstructured data, enabling automatic feature selection, model selection, and preprocessing recommendation without manual intervention. We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging. For example, in the case of structured data (anxiety data) and unstructured data (colonoscopy polyps data), the pipeline begins with file-type detection by the Ingestion Identifier Agent, followed by the Data Anonymizer Agent ensuring privacy compliance, where we first identify the data type and then anonymize it. The Feature Extraction Agent identifies features using an embedding-based approach for tabular data, extracting all column names, and a multi-stage MedGemma-based approach for image data, which infers modality and disease name. These features guide the Model-Data Feature Matcher Agent in selecting the best-fit model from a curated repository. The Preprocessing Recommender Agent and Preprocessing Implementor Agent then apply tailored preprocessing based on data type and model requirements. Finally, the ``Model Inference Agent” runs the selected model on the uploaded data and generates interpretable outputs using tools like SHAP, LIME, and DETR attention maps. By automating these high-friction stages of the ML lifecycle, the proposed framework reduces the need for repeated expert intervention, offering a scalable, cost-efficient pathway for operationalizing AI in clinical environments.

由于处理前工作流程支离破碎、模型兼容问题和严格的数据隐私限制,在保健领域建设和部署机器学习解决方案仍然昂贵和劳动密集型。在这项工作中,我们引入了一个代理AI框架,将整个临床数据管道从摄入到推断自动化,通过模块化、任务特定代理器系统,从吸收到推断。这些代理器处理结构化和非结构化数据,允许自动选择特征、模式选择和未经人工干预的预处理建议。我们评估了来自老年医学、缓和护理和结肠镜图像的公开数据集系统。例如,在结构化数据(焦虑数据)和无结构化的数据(结肠镜化聚谱数据数据数据数据数据)方面,我们引入了一种代理AI框架,通过接受感化识别器检测文件类型,确保隐私的合规性,我们首先确定数据类型,然后在不进行人工干预的情况下对数据进行匿名处理。我们使用基于嵌入模型的方法,提取所有专栏名称,以及基于多级MDGemma-emma的方法,用于图像数据数据流流数据,在模型中,并用最精选的机序流流流化数据流化数据格式和流化数据流化数据流化工具提供。


Article 73

Title@2025-07-24 (4): A New Pair of GloVes

Title: A New Pair of GloVes Ein neues Paar GloVes 新的地球之对 2507.18103v1

Authors (3): Riley Carlson, John Bauer, Christopher D. Manning

This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.

本报告文件、描述和评价了新的2024年英文GloVe(全球语言代言人)模型。2014年建立的原GloVe模型已被广泛使用,并被认为有用,但语言和世界继续演变,我们认为当前使用可受益于更新模型。此外,2014年模型没有仔细记录使用的确切数据版本和预处理,我们通过记录这些新模型来纠正这一点。我们用Wikipedia、Gigaword和Dolma的子集来培训了两套单词嵌入。通过词汇比较、直接测试和NER任务进行的评估表明,2024年的矢量含有新的文化和语言相关词汇,在类比和相似性等结构任务上可比较,并展示了近期具有时间依赖的NER数据集(如非西方新闻线数据)的性能。


Article 74

Title@2025-07-24 (4): Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Title: Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation Lang-Short-Distanz Graph Neural Networks und verbessertes Curriculum-Lernen für Emotionserkennung im Gespräch 长短距离远距神经神经网络和改进课程学习,以在对话中认识情感 2507.15205v2

Authors (3): Xinran Li, Xiujuan Xu, Jiaqi Qiao

Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a “weighted emotional shift” metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.

交流中情感认知(ERC)是一项实际而具有挑战性的任务。本文件提出了一种新型的多式联运方法,即长短距离图像神经网络(LSDGN) 。基于直接环形图(DAG),它构建了一个长距离平面神经网络和一个短距离平面神经网络,以获得相距遥远和相近言论的多式特征。为了确保长距离和短距离特征在代表性上尽可能不同,同时能够使两个模块之间产生相互影响,我们使用一个差异调节器,并纳入一个比阿芬模块,以促进特征互动。此外,我们提出一个改进课程学习(ICL),以应对数据不平衡的挑战。通过计算不同情感之间的相似性以强调类似情感的转变,我们设计了一个“加权情感转变”指标,并开发一个困难测量器,使培训过程能够优先学习较难的样本。IEMOCAP和MELD数据集的实验结果表明,我们的模型超过了现有的基准。


Article 75

Title@2025-07-24 (4): ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

Title: ELITE: Enhanced Language-Image Toxicity Evaluation for Safety ELITE: Verbesserte Sprach-Image-Toxizitätsbewertung für Sicherheit ELITE:加强语言-图像安全毒性评价 2502.04757v3

Authors (8): Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim

Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.

现有VLIMS的安全基准主要依靠自动化评价方法,但这些方法很难发现隐含的有害内容,或产生不准确的评价。因此,我们发现现有基准的危害程度低,数据模糊,图像-文本组合的多样性有限。为了解决这些问题,我们提议ELITE基准,即VLIMS的高质量安全评价基准,以我们的强化评价方法ELITE评估员ELITE为根据。ELITE评估员明确纳入了毒性分,以准确评估多式联运环境中的危害性,VLITE常常提供具体、有说服力但无害的图像描述。我们利用ELITE评价员从现有基准中过滤模糊和低质量的图像-文本配对,并产生安全和不安全的图像-文本配对的多种组合。我们的实验表明,ELITE评价员与先前的自动化方法相比,与人类评价高度一致,ELITE基准提供了更高的基准质量和多样性。通过引入ELITE,我们为更安全、更有力但无害的VLMS的安全性应用铺平了道路,为真正的安全性评估提供重要工具。


Article 76

Title@2025-07-24 (4): EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework

Title: EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework BildungQ: Bewertung der Lehrfähigkeiten von LLMs durch Multi-Agent Dialograhmen 教育Q:通过多机构对话框架评价LLMS的教学能力 2504.14928v2

Authors (3): Yao Shi, Rongkeng Liang, Yong Xu

Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

大型语言模式(LLMS)日益成为教育工具,然而,由于教师与学生之间互动的资源密集、环境依赖、方法复杂,评估其教学能力仍具有挑战性。我们引入了教育Q,这是一个多媒介对话框架,通过模拟动态教育情景,有效评估教学能力,由教学、学习和评价的专门代理人组成。测试主要的独立组织(OpenAI、Meta、Google、Anthrotic等)的14个LLMS,涉及13个学科和10个难度层次的1 498个问题,显示教学效力与模型规模或一般推理能力没有线性关系。一些较小的开放源模式在教学环境中比较大的商业对应方表现要好。这一发现凸显了当前评价中的一个关键差距,这种评价将知识的回顾放在互动教学、学习和评价的优先地位之上。我们混合方法的评价,将定量指标与定性分析和专家案例研究相结合,确定了最高业绩模型(例如精密的问询策略、适应性反馈机制)所使用的不同教学优势。人类专家评价表明,78%的人同意我们对有效教学行为进行自动化的质量分析,验证我们下一个方法。


Article 77

Title@2025-07-24 (4): Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints

Title: Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints Hybrides und einheitliches Feintuning von großen Sprachmodellen: Methoden und Benchmarking unter Ressourcenbeschränkungen 大语言模式统一调整和统一调整适用:在资源限制下的方法和基准 2507.18076v1

Authors (3): Haomin Qi, Zihan Dai, Chengbo Huang

Fine-tuning large language models (LLMs) remains a computational bottleneck due to their scale and memory demands. This paper presents a comprehensive evaluation of parameter-efficient fine-tuning (PEFT) techniques, including LoRA, BOFT, LoRA-GA, and uRNN, and introduces a novel hybrid strategy that dynamically integrates BOFT’s orthogonal stability with LoRA-GA’s gradient-aligned rapid convergence. By computing per-layer adaptive updates guided by gradient norms, the hybrid method achieves superior convergence efficiency and generalization across diverse tasks. We also explore, for the first time, the adaptation of unitary RNN (uRNN) principles to transformer-based LLMs, enhancing gradient stability through structured unitary constraints. Empirical evaluations on four benchmarks – GLUE, GSM8K, MT-Bench, and HumanEval – using models ranging from 7B to 405B parameters demonstrate that our hybrid method consistently outperforms individual PEFT baselines, approaching full fine-tuning accuracy while reducing resource consumption by up to 2.1 times in training time and 50 percent in memory usage. These findings establish the hybrid approach as a practical and scalable fine-tuning solution for real-world deployment of LLMs under resource constraints.

微调大型语言模型(LLMS)因其规模和记忆要求,仍然是一个计算瓶颈。本文件还全面评估了参数高效微调(PEFT)技术,包括LORA、BOFT、LORA-GA和uRNN,并引入了新型混合战略,将BOFT的正方稳定性动态结合到LORA-GA的梯度拉动快速趋同中。混合方法在计算由梯度规范指导的每层适应性更新时,实现了更高的趋同效率和对不同任务的概括化。我们还首次探索将单一的RNNN(uRNN)原则调整到基于变压器的LMS,通过结构统一的制约加强梯度稳定性。关于四大基准(GLUE、GSM8K、MT-Bench和HumanEval)的实证评价使用了7B至405B参数,表明我们的混合方法始终超越了个人PEFT基线,接近全面微调准确性调整,同时在培训时间和记忆使用方面将资源消耗减少2.1倍至50%。这些结论评估,在实际部署中,在实际部署中将混合方法下确定了。


Article 78

Title@2025-07-24 (4): Group Sequence Policy Optimization

Title: Group Sequence Policy Optimization Optimierung der Gruppensequenzpolitik 组序列政策优化 2507.18071v1

Authors (12): Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

本文件介绍群体序列政策优化(GSPO),这是我们为培训大型语言模式而采用的稳定、高效和绩效强化学习算法,与以往采用象征性重要性比率的算法不同,PSPO根据序列概率确定重要性比率,并进行顺序剪切、奖赏和优化,我们证明,与GROP算法相比,PSPO实现了较高的培训效率和绩效,特别是稳定了Mixture-Experts(MOE)RL培训,并有可能简化RL基础设施的设计,PSPO的这些优点促进了最新的Quen3模型的显著改进。


Article 79

Title@2025-07-24 (4): BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Title: BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz BlockDiaect: 节能LLM 推论的粗件精细混合格式量化 2501.01144v5

Authors (2): Wonsuk Jang, Thierry Tambe

The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

大型语言模型(LLMS) 快速增长的大小在记忆使用和计算成本方面提出了重大挑战。 量化权重和激活都能够解决这些问题, 硬件支持的微微缩缩缩放正在形成一个有希望的缓解离子的解决方案。 但是, 现有的方法很难捕捉细块数据分布。 我们提议了BlockDiacle, 这是一种块式细微的混合格式技术, 从一个格式手册中为更好的数据代表性指定了每个区块的最佳数字格式。 此外, 我们引入了 Dialec FP4 4 格式手册, 一种适应不同数据分布的FP4变体( 类似方言的方言) 。 为了高效地利用这个方法, 我们建议了双阶段的方法, 用于在线的 DialectFP4 激活四倍的四分级化。 重要的是, DialectF4 能够确保能源效率, 选择可代表值为与低精度缩缩缩缩缩缩图相匹配的整整数值。 将LLLAMA3- 8B( LLMA2-7B) 的精度模型(LLMA2-7B) 与MFP4格式相比, 模型的精度模型的精度增长为MXFP-4格式, 将比小比小的精度格式, 显示为5.45- Plexmexmexmexmalmax, 的全缩图图图仅为5-x。


Article 80

Title@2025-07-24 (4): TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

Title: TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios TELEVAL: Ein dynamischer Benchmark für gesprochene Sprachmodelle in chinesischen interaktiven Szenarien TELEVAL:为中文互动假想中的口语模式设计的一个动态基准 2507.18061v1

Authors (14): Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li

Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance. However, most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs), often failing to align with how users naturally interact in real-world conversational scenarios. In this paper, we propose TELEVAL, a dynamic benchmark specifically designed to evaluate SLMs’ effectiveness as conversational agents in realistic Chinese interactive settings. TELEVAL defines three evaluation dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. It adopts a dialogue format consistent with real-world usage and evaluates text and audio outputs separately. TELEVAL particularly focuses on the model’s ability to extract implicit cues from user speech and respond appropriately without additional instructions. Our experiments demonstrate that despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. We hope that TELEVAL can serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs.

近些年来,口语模式(SLM)取得了迅速的进展,同时还制定了许多评价其业绩的基准,然而,大多数现有基准主要侧重于评价可持续土地管理是否能够执行与大型语言模式(LLMs)所处理的任务相类似的复杂任务,往往无法与用户在现实世界的谈话情景中自然互动的方式保持一致;在本文件中,我们建议TELEVAL,这是一个动态基准,专门用来评价可持续土地管理在现实的中国互动环境中作为对话促进者的效力。TELEVAL确定了三个评价层面:明确的语义、语言和隐含的语义以及系统能力。它采用与现实世界使用相一致的对话格式,并单独评估文本和音频产出。TELEVAL尤其侧重于模式从用户的演讲中获取暗示和在没有额外指示的情况下作出适当反应的能力。我们的实验表明,尽管最近取得了进展,但现有的可持续土地管理在自然对话任务方面仍有相当大的改进余地。我们希望TELEVAL能够作为一个直接反映用户经验并促进更能对话的可持续土地管理的发展的以用户为中心的评价框架。


Article 81

Title@2025-07-24 (4): Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

Title: Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias Causally Testing Gender Bias in LLMs: Eine Fallstudie über berufsbezogene Bias 《LLMM中因果测试性别偏见:职业偏见案例研究》 2212.10678v4

Authors (5): Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin

Generated texts from large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics. These findings motivate research efforts aiming to understand and measure such effects. This paper introduces a causal formulation for bias measurement in generative language models. Based on this theoretical foundation, we outline a list of desiderata for designing robust bias benchmarks. We then propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias. We test several state-of-the-art open-source LLMs on OccuGender, including Llama, Mistral, and their instruction-tuned versions. The results show that these models exhibit substantial occupational gender bias. Lastly, we discuss prompting strategies for bias mitigation and an extension of our causal formulation to illustrate the generalizability of our framework. Our code and data https://github.com/chenyuen0103/gender-bias.

从大型语言模型(LLMs)中生成的文字显示,对各种人口结构存在各种有害、人性的偏见,这些调查结果激发了旨在理解和衡量这些影响的研究工作;本文件介绍了在基因化语言模型中进行偏见衡量的因果表述;根据这一理论基础,我们概述了设计稳健的偏见基准的偏差清单;然后我们提出了一个称为Occu Gender的基准,并提出了调查职业性别偏见的偏见衡量程序;我们测试了包括Llama、Mistral在内的一些关于奥克库性别的先进开放源的开源LMS,及其经指导的版本;结果显示这些模型表现出严重的职业性别偏见;最后,我们讨论了如何推动减少偏见的战略,并扩展我们的因果关系表述,以说明我们框架的可概括性。我们的代码和数据是https://github.com/chenyuen0103/gender-bials。


Article 82

Title@2025-07-24 (4): A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Title: A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models Ein Multi-Faceted-Evaluierungsrahmen für die Bewertung synthetischer Daten, erzeugt durch große Sprachmodelle 评估由大语言模型生成的合成数据多面评价框架 2404.14445v2

Authors (3): Yefeng Yuan, Yuhong Liu, Liang Cheng

The rapid advancements in generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data, particularly in the realm of structured tabular formats, such as product reviews. Despite the potential benefits, concerns regarding privacy leakage have surfaced, especially when personal information is utilized in the training datasets. In addition, there is an absence of a comprehensive evaluation framework capable of quantitatively measuring the quality of the generated synthetic data and their utility for downstream tasks. In response to this gap, we introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data via a suite of diverse evaluation metrics. We validate the efficacy of our proposed framework - SynEval - by applying it to synthetic product review data generated by three state-of-the-art LLMs: ChatGPT, Claude, and Llama. Our experimental findings illuminate the trade-offs between various evaluation metrics in the context of synthetic data generation. Furthermore, SynEval stands as a critical instrument for researchers and practitioners engaged with synthetic tabular data,, empowering them to judiciously determine the suitability of the generated data for their specific applications, with an emphasis on upholding user privacy.

基因化的AI和大型语言模型(LLMS)的迅速发展为合成数据的生产开辟了新的途径,特别是在结构化的表格格式领域,如产品审查。尽管可能带来好处,但是对隐私泄漏的担忧已经浮现出来,特别是在培训数据集中利用个人信息的情况下。此外,缺乏能够量化计量所生成的合成数据质量及其在下游任务中的效用的综合评价框架。针对这一差距,我们引入了SynEval,这是一个开放源评价框架,目的是通过一套不同的评估指标,评估合成生成的表格数据的准确性、实用性和隐私性。我们验证了我们拟议框架——SynEval——的有效性,将它应用到三个最先进的LMS:ChatGPT、Claude和Llama综合数据的综合产品审查数据。我们的实验结果说明了在合成数据生成方面各种评价指标之间的取舍。此外,SynEval是从事合成表格数据的研究人员和从业人员的关键工具,使他们能明智地确定所生成的数据是否适合其具体应用,强调用户的隐私。


Article 83

Title@2025-07-24 (4): Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Title: Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs Privacy-Preserving Synthetic Review Generation mit unterschiedlichen Schreibstilen mit LLMs 使用LLMMs以多种写作风格生成的隐私-保护合成审查 2507.18055v1

Authors (6): Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng

The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs’ capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.

虽然合成数据为实际世界数据提供了一种成本效益高、可扩展的替代方法,以便利模式培训,但其多样性和隐私风险仍未得到充分探讨。我们以基于文本的合成数据为重点,提出了一套综合指标,用于定量评估多种合成数据(即语言表达、情绪和用户视角)和若干最新水平的LLM生成的合成数据集的隐私(即重新识别风险和外星体)。实验结果表明,LLMS生成多样性和隐私保护合成数据的能力存在重大限制。在评价结果的指导下,建议采取迅速依据的办法,在保护审查人的隐私的同时,加强合成审查的多样性。


Article 84

Title@2025-07-24 (4): From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

Title: From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems Von der Hypothese zur Veröffentlichung: Eine umfassende Umfrage zu KI-getriebenen Forschungsunterstützungssystemen 从假设到出版物:AI-Driven研究支助系统综合调查 2503.01424v3

Authors (14): Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin

Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.

最近几年,人工智能(AI)技术的迅速发展激励研究人员探索AI如何加速和加强研究。为了监测相关进展,本文件系统地审查了该领域的进展。具体地说,我们将有关研究分为三大类:假设的拟订、假设的验证和手稿出版物。假说主义的提法涉及知识综合和假说生成。假说论的验证包括科学主张的核实、理论的验证和实验的验证。手稿的编写和同行审查过程也包含手稿的编写和同侪审查过程。此外,我们查明并讨论这些领域目前面临的挑战以及今后可能的研究方向。最后,我们还全面概述了支持将AI纳入研究过程的各个领域的现有基准和工具。我们希望这份文件成为初创者和推动未来研究的介绍。资源已在https://github.com/zkzhouz126/AI-for-Research中公布。


Article 85

Title@2025-07-24 (4): RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

Title: RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models EINGEDENK: Ein ungebundener Ressourcenverbrauchsangriff auf große Visions-Sprachenmodelle 回顾:对大型愿景-语言模型的无约束资源消费攻击 2507.18053v1

Authors (9): Haoran Gao, Yuanhe Zhang, Zhenhong Zhou, Lei Jiang, Fanyu Meng, Yujia Xiao, Kun Wang, Yang Liu, Junlan Feng

Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have largely overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECALLED (\textbf{RE}source \textbf{C}onsumption \textbf{A}ttack on \textbf{L}arge Vision-\textbf{L}anguag\textbf{E} Mo\textbf{D}els), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present \textit{Vision Guided Optimization}, a fine-grained pixel-level optimization, to obtain \textit{Output Recall} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Additionally, we introduce \textit{Multi-Objective Parallel Losses} to generate universal attack templates and resolve optimization conflicts when intending to implement parallel attacks. Empirical results demonstrate that RECALLED increases service response latency by over 26 $\uparrow$, resulting in an additional 20\% increase in GPU utilization and memory consumption. Our study exposes security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate future defense development against RCAs.

资源消耗攻击(RCAs)已成为对部署大语言模型的重大威胁。随着视觉模式的整合,更多的攻击矢量加剧了大型视觉语言模型(LVLM)中RCA的风险。然而,现有的红队研究在很大程度上忽略了视觉投入作为潜在的攻击表面,导致对LVLM中RCA的减缓战略不足。为了弥补这一差距,我们提议重计(textbf{RE}源代码 littlebf{C} 假设值(Textbf{L}Atack on\ textbf{Attack on\ textbf}L}L}Givision-tlebf{L}L}reck Viewalbf{L}}Langaug\textbf{E}Mo\ textbf{D},这是利用视觉模式触发不受约束的RCA的红队列。首先,我们展示了Text{view view view view Optimn}, lical-listal lishal listal resoltial resent resent resual reslipple) resual real remoudation resulation resulation resmal resulation resmal resmal lable lable lablievol lablictions。我们可以将增加 开始在不断显示出安全驱动驱动驱动驱动驱动驱动值的驱动。


Article 86

Title@2025-07-24 (4): Segmentation-free Goodness of Pronunciation

Title: Segmentation-free Goodness of Pronunciation Segmentierungsfreie Güte der Aussprache 读音良好 2507.16838v2

Authors (4): Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

读音检测和诊断(MDD)是现代计算机辅助语言学习(CALL)系统的一个重要部分。在MDD中,电话级发音评估是帮助L2学习者改善发音的关键。然而,大多数系统都基于一种发音良好(GOP)形式,要求将语音分解成语音单位。这限制了这些方法的准确性,也限制了使用基于现代CTC的音响模型进行评估的可能性。在这项研究中,我们首先提出自我点火化GOP(GOP-SA),使MDD能够使用经CT培训的ASR模型。接下来,我们界定了一种更普遍的无校准(GOP-AF)方法。我们从理论上介绍了我们GOP-AF定义的定义,这一定义解决了潜在的数字问题,并实现了适当的正常化,使声音模型在一段时间内适用的方法变得不尽一致。我们提供了关于CMUKids 和Spealesoceal 762 数据设置的广泛实验结果,比较了我们GP-OAF方法最新定值的频率,并比较了我们GM-AF方法的最新可靠程度,评估了我们关于GM-OLAF方法的最新分析。


Article 87

Title@2025-07-24 (4): Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Title: Synthetic Data Generation for Phrase Break Prediction with Large Language Model Synthetische Datengenerierung für Phrase Break Prediction mit großem Sprachmodell 制作用于大语言模范大语言时段间断预测的合成数据 2507.18044v1

Authors (4): Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim

Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.

目前对断裂短语的预测方法处理文本到语音系统的关键预想方面,但严重依赖音频或文本的大量人文说明,这需要大量人工努力和成本。语言领域的内在变异因语音因素驱动,使获得一致、高质量数据的工作更加复杂。最近,大型语言模型(LLMs)通过生成定制合成数据,同时减少人工批注需求,在应对NLP的数据挑战方面取得了成功。为此,我们探索利用LLM生成合成词断裂说明,与传统说明进行比较,评估多种语文的效力,从而应对人工注解和与语言有关的任务的挑战。我们的研究结果表明,LLM合成数据生成有效地缓解了断裂段预测中的数据挑战,并凸显LMMs作为语言领域可行解决办法的潜力。


Article 88

Title@2025-07-24 (4): GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

Title: GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs GrAInS: Gradient-basierte Zuordnung zur Inferenz-Zeitlenkung von LLMs und VLMs GrAInS:LLMs和VLMs的推论时间指导的逐步归属 2507.18043v1

Authors (4): Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.

发价时的定位方法为微微调大型语言模型(LLLMM)和视觉语言模型(VLLM)提供了一种较轻的替代方法,它提供了一种较轻的替代方法,用以微调大型语言模型(LLLMM)和视觉语言模型(VLLMM)和视觉语言模型(VLLMM),通过在测试时间修改内部活度,而不更新模型重量的重量重量,对大型语言模型(LLLMM)和视觉语言模型(VLLMM)进行微调,以提供一种较轻的替代方法来微调大型语言模型和视觉语言模型(LLLLMMMM)和视觉语言模型(VLLLLMMM),通过综合梯AInS通过综合梯度来对比性、以梯度为基础的内部激活,以确定最有影响力的上层矢、全球矢量的矢、正面和负面的媒介的代号,根据对偏偏偏偏偏的一至代表比例,从而构建方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向和方向方向方向方向方向方向方向方向方向方向方向方向方向方向方向。


Article 89

Title@2025-07-24 (4): AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

Title: AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark AIR-Bench: Automatisierte Heterogene Information Retrieval Benchmark AIR-Bench:自动异源信息检索基准 2412.13102v4

Authors (9): Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, Zheng Liu

Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.

然而,基于预先界定的领域和人类标签数据的现有基准在以成本效益高和高效率的方式满足新兴领域的评价需求方面面临局限性。为了应对这一挑战,我们提议采用自动的多源信息检索基准(AIR-Bench);AIR-Bench有三个关键特征:1)自动化。AIR-Bench的测试数据自动由大型语言模型(LLLMs)生成,而没有人类干预。2 遗传性很强。AIR-Bench的测试数据是针对不同任务、领域和语言生成的。3 动态。AIR-Bench所涵盖的领域和语言不断扩大,为社区开发者提供了一个日益全面的评价基准。我们开发了可靠和可靠的数据生成管道,以自动创建以真实世界公司为基础的多样和高质量的评价数据集。我们的调查结果表明,AIR-Bench生成的测试数据与人标的测试数据非常一致,使AIR-Bench-Bench-Bench 数据成为评估IR/IR模型的可信赖的基准基准。ABER-ABER 资源在ABES-A-BES-IR的公开评估IRC模型中是可用的可靠基准。


Article 90

Title@2025-07-24 (4): NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

Title: NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database NeuralDB: Skalierung von Wissen in LLMs auf 100.000 Fakten mit neuraler KV-Datenbank NeuralDDB: 将知识编辑在LLM 中到 100,000 千兆瓦的Neural KV 数据库中 2507.18028v1

Authors (10): Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu

Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work).

高效编辑大语言模型(LLMS)中储存的知识可以不经大规模培训而进行模型更新。一个可能的解决办法是定位和编辑(LE),允许同时修改大量事实。然而,这种编辑可能损害LLMs的一般能力,甚至导致在扩大编辑量以至数千个编辑量时忘记编辑过的事实。在本文中,我们用线性LE方法建模,查询关键价值数据库(KV)。从这个角度出发,我们然后提议NeuralDB(NeuralDB),这个编辑框架明确代表经编辑的事实,作为一个神经KV数据库,配有非线性门式检索模块,特别是%,我们的门式模块只有在推断涉及编辑事实时才能运作,有效保存LMs的一般能力。在ZsRE和反Facts数据集中进行了涉及编辑10 000个事实的全面实验,使用GPT2-XL、GPT-J(6B)和Llama-3(8B)进行查询。结果显示,NeuralDB不仅在编辑效率、一般化、精度、流质化、流和一致性方面保持总体性,而且还在10万年前的实验中还保存了整个工作。


Article 91

Title@2025-07-24 (4): Technical Report of TeleChat2, TeleChat2.5 and T1

Title: Technical Report of TeleChat2, TeleChat2.5 and T1 Technischer Bericht von TeleChat2, TeleChat2.5 und T1 TeleChat2、TeleChat2.5和T1技术报告 2507.18013v1

Authors (38): Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li

We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.

我们推出最新系列的TeleChat 模式 :\ textbf{ TeleFhat2},\ textbf{ TeleChat2.5},\ textbf{ TeleC2.5} 和\ textbf{T1}, 提供了对其前身TeleC的大幅升级。 尽管对模式架构的修改很小, 新系列通过在培训前和培训后两个阶段的强化培训战略取得了巨大的绩效。 该系列从\ textbf{ TeleC2} 开始, 以10万个高品质和多种标识进行预培训。 之后是Surviced FinalT( SSFT) 和直接Preport Ofer Ofer Appimation( 支持长链- t- t) 高级模型, 以及将 G- flotf 数据数据集与强化学习( RL) 来提高代码生成和数学推理的性能。


Article 92

Title@2025-07-24 (4): GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Title: GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures GRR-CoCa: LLM-Mechanismen in multimodalen Modellarchitekturen nutzen GRR-CoCa:在多模式建模中利用LLM机制 2507.18009v1

Authors (6): Jake R. Patock, Nicole Catherine Lewis, Kevin McCoy, Christina Gomez, Canling Chen, Lorenzo Luzi

State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa’s original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models on contrastive and generative tasks. Our GRR-CoCa significantly outperformed Baseline CoCa on the pretraining dataset and three diverse fine-tuning datasets. Pretraining improvements were 27.25% in contrastive loss, 3.71% in perplexity, and 7.15% in CoCa loss. The average fine-tuning improvements were 13.66% in contrastive loss, 5.18% in perplexity, and 5.55% in CoCa loss. We show that GRR-CoCa’s modified architecture improves performance and generalization across vision-language domains.

最新艺术(SOTA)图像和文本生成模型是多式模型,与大型语言模型(LLMS)有许多相似之处。尽管取得了强劲的业绩,但主要基础多式联运模型结构往往落后于当代LLMS的建筑精度。我们建议GRR-CoCa(GR-CoCa)(改进的SOTA Contractition Caper(CoCa)(SOCa)(SOTA)(改进的SOTA)图像和文本生成模型是多式模型,与大型语言模型(LLMS(LM)具有许多相似之处 ) 。每个建筑修改都显示改进了LLMMS的模型性能,但还没有在CoC(C)中采用。我们用标准预培训和微调工作流程来为模型和视觉变色任务设定基准。我们GRR-C(VT)A)明显优化了基准 CoCa(Boral CoCa) (改进了LMLM) (LM) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (C) (普通成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本) (改进了3) (Borlorl) (BAR) (BL) (S) (S) (Bal) (S) (BAR) (改进了3) (BAR) (BAR) (C) (L) (BAR) (S) (BAR) (C) (C) (L) (B) (B) (B) (B) (B) (B) (S) (B) (B) (B) (B) (B) (S) (S) (S) (B) (S) (S) (B) (S) (S) (B) (B) (S) (S) (S) (S) (S) (S) (S) (S) (B) (S) (S) (S) (S) (S) (S) (S) (S) (S) (S


Article 93

Title@2025-07-23 (3): Quantifying the Uniqueness and Divisiveness of Presidential Discourse

Title: Quantifying the Uniqueness and Divisiveness of Presidential Discourse Quantifizierung der Einzigartigkeit und Teilung des Präsidentendiskurses 量化总统意见会的独一性和分散性 2401.01405v2

Authors (7): Karen Zhou, Alexander A. Meitus, Milo Chase, Grace Wang, Anne Mykland, William Howell, Chenhao Tan

Do American presidents speak discernibly different from each other? If so, in what ways? And are these differences confined to any single medium of communication? To investigate these questions, this paper introduces a novel metric of uniqueness based on large language models, develops a new lexicon for divisive speech, and presents a framework for assessing the distinctive ways in which presidents speak about their political opponents. Applying these tools to a variety of corpora of presidential speeches, we find considerable evidence that Donald Trump’s speech patterns diverge from those of all major party nominees for the presidency in recent history. Trump is significantly more distinctive than his fellow Republicans, whose uniqueness values appear closer to those of the Democrats. Contributing to these differences is Trump’s employment of divisive and antagonistic language, particularly when targeting his political opponents. These differences hold across a variety of measurement strategies, arise on both the campaign trail and in official presidential addresses, and do not appear to be an artifact of secular changes in presidential communications.

美国总统的言论是否明显地不同? 如果是这样的话,用什么方式?这些差异是否局限于任何单一的交流媒介?为了调查这些问题,本文提出了基于大语言模式的独特性的新标准,为分裂性演讲开发了新的词汇,并为评估总统对政治对手讲话的独特方式提供了一个框架。将这些工具应用于各种总统演讲团,我们发现相当多的证据表明唐纳德·特朗普的言论模式与最近历史上所有主要政党被提名总统候选人的言论模式不同。特朗普比他的共和党同胞要突出得多,因为共和党同胞的独特性价值似乎更接近民主党。促成这些差异的原因是特朗普使用分裂性和对抗性的语言,特别是针对政治对手的这种语言。这些差异存在于各种衡量战略中,出现在竞选路线和官方总统讲话中,并且似乎不是总统沟通中世俗变化的神话。


Article 94

Title@2025-07-23 (3): Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Title: Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains? Breaking Barriers: Gewinnt die Verstärkung von Posttrainings die Übertragung auf ungesehene Domains? 突破障碍:加强培训后收益是否转移到未知领域? 2506.19733v2

Authors (7): Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit, Jason Benn, Daniel Kang

Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning. To understand the generalizability of RPT, we conduct two studies. (1) Observational: We compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.

强化后培训(RPT)最近在提高大型语言模型(LLMs)的推理能力方面显示出了希望,然而,这些改进在推广到新领域方面仍然有多么出色,因为先前的工作评价了RPT关于用于微调的同一领域数据的模型。为了理解RPT的可概括性,我们进行了两项研究。 (1)观察:我们比较了多种领域、包括其微调数据的可见和看不见领域的各种开放的RPT模型与其相应的基准模型。(2)干预:我们在单一领域与RPT一道微调LMs和RPT,并评价其跨多个领域的绩效。这两项研究都得出相同的结论,即尽管RPT在与微调数据相似的任务上带来大量收益,但所取得的收益却前后不一致,并且可能以不同推理模式消失在领域。


Article 95

Title@2025-07-23 (3): Natural Language Processing for Tigrinya: Current State and Future Directions

Title: Natural Language Processing for Tigrinya: Current State and Future Directions Natürliche Sprachverarbeitung für Tigrinya: Aktueller Zustand und zukünftige Richtungen 提格里尼亚的自然语言处理:现状和未来方向 2507.17974v1

Authors (2): Fitsum Gaim, Jong C. Park

Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 40 studies spanning more than a decade of work from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across ten distinct downstream tasks, including morphological processing, machine translation, speech recognition, and question-answering. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently unlocked by resource creation milestones. We identify key challenges rooted in Tigrinya’s morphological complexity and resource scarcity, while highlighting promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves as both a comprehensive reference for researchers and a roadmap for advancing Tigrinya NLP. A curated metadata of the surveyed studies and resources is made publicly available.\footnote{Tigrinya NLP Anthology: https://github.com/fgaim/tigrinya-nlp-anthology.

尽管有数百万人发言,但蒂格里尼亚州在自然语言处理(NLP)研究中的代表性仍然严重不足,这项工作展示了对蒂格里尼亚州国家语言处理研究的全面调查,分析了2011年至2025年十多年工作期间的40多项研究;我们系统地审查了计算资源、模型和10项不同下游任务的应用现状,包括形态学处理、机器翻译、语音识别和问答。我们的分析揭示了从基础、基于规则的系统到现代神经结构的明确轨迹,其进展始终因资源创造的里程碑而解开。我们查明了源于蒂格里尼亚的形态复杂性和资源稀缺的关键挑战,同时强调了有希望的研究方向,包括形态学建模、跨语言转移以及以社区为中心的资源开发。这项工作既为研究人员提供了全面参考,也为推进蒂格里尼亚国家语言处理提供了路线图。调查研究和资源的整理元数据已公开提供。


Article 96

Title@2025-07-23 (3): LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

Title: LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios LIFBench: Bewertung der Anleitung nach Leistung und Stabilität von großen Sprachmodellen in Langkontextszenarien LIFBench:评价长期设想中大语言模式绩效和稳定性指示 2411.07037v3

Authors (8): Xiaodong Wu, Minhao Wang, Yichen Liu, Xiaoming Shi, He Yan, Xiangju Lu, Junmin Zhu, Wei Zhang

As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs’ instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.

随着大型语言模型(LLMS)在自然语言处理(NLP)过程中的演进,这些大语言模型(LLMS)在自然语言处理(NLP)中稳步遵循长文本输入指令的能力已成为现实世界应用的关键,但是,现有的基准很少侧重于长文本情景中的教学或不同投入的稳定性。为了缩小这一差距,我们引入了LIFBench,这是一个可扩缩的数据集,旨在评估LLMS在长背景中的教学能力和稳定性。LIFBench由三种长文本情景和11项不同任务组成,其中包括2,766项指令,通过一个自动扩展方法在三个方面产生:长度、表达和变量。关于评估,我们建议Liverval,一个基于卢布的评估方法,它使得能够精确、自动地评分复杂的LLMM反应,而无需LLMM的评估和人类判断。这个方法使得能够从多种角度对模型的性能和稳定性进行全面分析。我们每隔六长度对20个著名的LMSM进行详细试验。我们的工作有助于LIFBench和Lifevval作为评估LM在复杂和长文本环境中的业绩的有力工具,为指导LLLM今后发展提供宝贵的见解。


Article 97

Title@2025-07-23 (3): Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation

Title: Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation Mehrsprachige LLMs sind keine Mehrsprachigkeitsdenker: Belege aus Hindi Analogy Evaluation 多语种LLM女士不是多语种思想家:印地语分析评估的证据 2507.13238v2

Authors (3): Ashray Gupta, Rohan Joseph, Sunny Rai

Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

模拟测试模型是否有能力推断各种概念之间的隐含关系,使其成为评估推理能力的关键基准。大型语言模型(LLMs)在英语推理方面得到了广泛的评价,但其在印地语方面的能力仍然没有得到足够的研究,限制了我们对这些模型是否泛泛各语的理解。为了解决这一差距,我们引入了新的印地语人工解析测试组(HATS),由405个来自印度政府考试的多种选择问题组成。我们使用各种推理策略,对最新的多语种LLM作了基准,并引入了利用模拟推理理论认知理论的深层次思维链方法。这种方法提高了印地语类比问题模型的性能。我们的实验显示,无论迅速战略如何,这些模型在英语提示方面表现最佳。我们的测试组解决了缺乏关键资源来评价印地语的LLM推理能力的问题。


Article 98

Title@2025-07-23 (3): Are LLM Belief Updates Consistent with Bayes’ Theorem?

Title: Are LLM Belief Updates Consistent with Bayes’ Theorem? Sind LLM-Belief-Updates im Einklang mit Bayes’ Theorem? 天主教信仰最新消息符合贝斯理论吗? 2507.17951v1

Authors (7): Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, Peter M. Atkinson

Do larger and more capable language models learn to update their “beliefs” about propositions more consistently with Bayes’ theorem when presented with evidence in-context? To test this, we formulate a Bayesian Coherence Coefficient (BCC) metric and generate a dataset with which to measure the BCC. We measure BCC for multiple pre-trained-only language models across five model families, comparing against the number of model parameters, the amount of training data, and model scores on common benchmarks. Our results provide evidence for our hypothesis that larger and more capable pre-trained language models assign credences that are more coherent with Bayes’ theorem. These results have important implications for our understanding and governance of LLMs.

更大、更有能力的语言模型在提出反证时,是否学会更新对贝耶斯理论的“信仰”?为了测试这一点,我们制定了贝耶斯人一致性系数(BCC)指标,并生成了一个用来测量BCC的数据集。我们测量了五个模式家庭在经过训练前的多种单一语言模型中的BCC,比较了模型参数的数量、培训数据的数量和共同基准的得分。我们的结果为我们提供了证据,证明我们的假设是,更大、更有能力的预先培训语言模型所指定的信用与贝耶斯的理论更加一致。这些结果对我们了解和管理LLMS有重要影响。


Article 99

Title@2025-07-23 (3): Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text

Title: Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text Bewertung der Leistungsfähigkeit von KI-Textdetektoren, wenige-Schuss und Chain-of-Thought-Prompting mit DeepSeek Generated Text 评估AI Text 检测器、很少热和用深搜索生成的催促研究链的文本的性能 2507.17944v1

Authors (2): Hulayyil Alshammari, Praveen Rao

Large language models (LLMs) have rapidly transformed the creation of written materials. LLMs have led to questions about writing integrity, thereby driving the creation of artificial intelligence (AI) detection technologies. Adversarial attacks, such as standard and humanized paraphrasing, inhibit detectors’ ability to detect machine-generated text. Previous studies have mainly focused on ChatGPT and other well-known LLMs and have shown varying accuracy across detectors. However, there is a clear gap in the literature about DeepSeek, a recently published LLM. Therefore, in this work, we investigate whether six generally accessible AI detection tools – AI Text Classifier, Content Detector AI, Copyleaks, QuillBot, GPT-2, and GPTZero – can consistently recognize text generated by DeepSeek. The detectors were exposed to the aforementioned adversarial attacks. We also considered DeepSeek as a detector by performing few-shot prompting and chain-of-thought reasoning (CoT) for classifying AI and human-written text. We collected 49 human-authored question-answer pairs from before the LLM era and generated matching responses using DeepSeek-v3, producing 49 AI-generated samples. Then, we applied adversarial techniques such as paraphrasing and humanizing to add 196 more samples. These were used to challenge detector robustness and assess accuracy impact. While QuillBot and Copyleaks showed near-perfect performance on original and paraphrased DeepSeek text, others – particularly AI Text Classifier and GPT-2 – showed inconsistent results. The most effective attack was humanization, reducing accuracy to 71% for Copyleaks, 58% for QuillBot, and 52% for GPTZero. Few-shot and CoT prompting showed high accuracy, with the best five-shot result misclassifying only one of 49 samples (AI recall 96%, human recall 100%).

大型语言模型(LLMS) 迅速改变了52个书面材料的创建。 LLMS 导致关于写作完整性的问题, 从而催生了人工智能(AI) 检测技术的创建。 Aversarial 攻击, 如标准人文化的抛光器、 抑制探测器检测机器生成的文本的能力。 先前的研究主要侧重于 ChatGPT 和其他知名的LLMs , 显示不同探测器之间的精确度不同。 然而, 有关DeepSeek的文献中有一个明显的差距。 最近出版的LLMM 。 因此, 我们调查了6个普遍可以获得的 AI 检测工具 – – AI 文本分类器、 内容检测器 AI 、 Copleleaks、 Quillalbel Bot、 GPT-2 和 GPTZePZero – 能够持续识别DeepSeep Seek生成的文本。 探测器暴露了上述对抗性攻击性攻击性攻击性攻击性能和思考性推理(CTTT) 仅能解算算算算算算算 。 5 和人文和人文研判读变变变变变变变变变变变变变变变变真 , , 显示 和变变变变变变变变变真技术 显示了我们 性变变变变性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性能 性能 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性


Article 100

Title@2025-07-23 (3): LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

Title: LLM Alignment as Retriever Optimization: An Information Retrieval Perspective LLM Alignment als Retriever-Optimierung: Eine Informations-Retrieval-Perspektive LLM 对齐作为最佳优化:信息检索视角 2502.03699v3

Authors (8): Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik

Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR’s retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO’s effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.

大型语言模型(LLMS)已经将人工智能与推理、编码和交流能力实现革命,推动产业间创新。它们的真正潜力取决于有效调整,以确保正确、可信和道德行为,应对错误信息、幻觉、偏见和滥用等挑战。虽然现有基于强化学习(RL)的调整方法非常复杂,但直接优化方法提供了更简单的替代办法。在这项工作中,我们根据既定的信息检索原则,为LLM调整引入了一种新的直接优化方法。我们提出了一个系统框架,将LLM调整和IR方法、LM生成和奖励模型与IR的检索者再校准模式连接起来。在此基础上,我们建议LLM调整作为累录最佳化(LarPO)的新调整方法,提高总体调整质量。广泛的实验分别以38.9%和13.7%的平均改进了AlpacaEval2和MixEval-Hard。我们的工作开辟了新的途径,通过整合IR基金会,为未来研究提供有希望的方向,推进LM调整。


Article 101

Title@2025-07-23 (3): Analyzing Fairness of Computer Vision and Natural Language Processing Models

Title: Analyzing Fairness of Computer Vision and Natural Language Processing Models Analyse der Fairness von Computer Vision und natürlichen Sprachverarbeitungsmodellen 分析计算机视觉和自然语言处理模式的公平性 2412.09900v3

Authors (3): Ahmed Rashed, Abdelkrim Kallich, Mohamed Eltayeb

Machine learning (ML) algorithms play a critical role in decision-making across various domains, such as healthcare, finance, education, and law enforcement. However, concerns about fairness and bias in these systems have raised significant ethical and social challenges. To address these challenges, this research utilizes two prominent fairness libraries, Fairlearn by Microsoft and AIF360 by IBM. These libraries offer comprehensive frameworks for fairness analysis, providing tools to evaluate fairness metrics, visualize results, and implement bias mitigation algorithms. The study focuses on assessing and mitigating biases for unstructured datasets using Computer Vision (CV) and Natural Language Processing (NLP) models. The primary objective is to present a comparative analysis of the performance of mitigation algorithms from the two fairness libraries. This analysis involves applying the algorithms individually, one at a time, in one of the stages of the ML lifecycle, pre-processing, in-processing, or post-processing, as well as sequentially across more than one stage. The results reveal that some sequential applications improve the performance of mitigation algorithms by effectively reducing bias while maintaining the model’s performance. Publicly available datasets from Kaggle were chosen for this research, providing a practical context for evaluating fairness in real-world machine learning workflows.

机器学习(ML)算法在保健、金融、教育和执法等各个领域的决策中发挥着关键作用,然而,对于这些系统中的公平和偏见的关切提出了重大的道德和社会挑战。为应对这些挑战,这项研究利用了两个著名的公平图书馆,即微软的Fairlearn图书馆和IBM的AIF360。这些图书馆为公平分析提供了全面框架,提供了评价公平度、可视化结果和实施减少偏向算法的工具。研究的重点是评估和减少利用计算机视野和自然语言处理(NLP)模型对非结构化数据集的偏差。主要目的是对两个公平图书馆的缓解算法的绩效进行比较分析。这一分析涉及在ML生命周期的一个阶段、预处理、处理、后处理、以及一个以上阶段连续应用算法。研究结果表明,一些顺序应用提高了减缓算法的性,既有效减少偏差,又维持模型的性能。从卡格公司为评估这一实际数据流环境而选择了实际数据流。


Article 102

Title@2025-07-23 (3): Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation

Title: Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation Bob’s Confetti: phonetische Erinnerungsangriffe in Musik und Videogenerierung Bob的Fonfetti:音乐和视频制作中的音响记忆攻击 2507.17937v1

Authors (6): Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr

Lyrics-to-Song (LS2) generation models promise end-to-end music synthesis from text, yet their vulnerability to training data memorization remains underexplored. We introduce Adversarial PhoneTic Prompting (APT), a novel attack where lyrics are semantically altered while preserving their acoustic structure through homophonic substitutions (e.g., Eminem’s famous “mom’s spaghetti” $\rightarrow$ “Bob’s confetti”). Despite these distortions, we uncover a powerful form of sub-lexical memorization: models like SUNO and YuE regenerate outputs strikingly similar to known training content, achieving high similarity across audio-domain metrics, including CLAP, AudioJudge, and CoverID. This vulnerability persists across multiple languages and genres. More surprisingly, we discover that phoneme-altered lyrics alone can trigger visual memorization in text-to-video models. When prompted with phonetically modified lyrics from Lose Yourself, Veo 3 reconstructs visual elements from the original music video – including character appearance and scene composition – despite no visual cues in the prompt. We term this phenomenon phonetic-to-visual regurgitation. Together, these findings expose a critical vulnerability in transcript-conditioned multimodal generation: phonetic prompting alone can unlock memorized audiovisual content, raising urgent questions about copyright, safety, and content provenance in modern generative systems. Example generations are available on our demo page (jrohsc.github.io/music_attack/).

流言到宋( LS2) 生成模型承诺从文本进行端到端的音乐合成, 但是它们对于数据记忆培训的脆弱性仍未得到充分探讨。 我们引入了 Aversarial PpeopleTAptic Summation (APT) , 这是一种新颖的攻击, 歌词通过同声替代( 例如, Eminem 的著名“ mom’s load” $\rightrrow$“ Bob’s confetti ” 来保持音响结构。 尽管存在这些扭曲, 我们发现了一种强大的亚弹性记忆形式: SUNO 和 YuE 等模型对数据记忆记忆的重新生成, 与已知的培训内容非常相似, 实现音响音频的高度相似性, 包括 CLAP、 音频 Judio and CoverID 。 这种脆弱性存在于多种语言和语系之间。 更令人惊讶的是, 我们发现, 光线的歌词变音调歌词仅能触发文字到图像模型模型的可视化。 当由Lose Yourfel, Ve 3 重建视频内容时, 重建原始的视觉内容, 直观的视觉变现的视觉变现的视觉变现的视觉变现的视觉变现的视觉变现的视觉变现 , , 包括直观的图像成的图像的图像的图像的图像成。


Article 103

Title@2025-07-23 (3): One Whisper to Grade Them All

Title: One Whisper to Grade Them All Ein Whisper, um sie alle zu bewerten 一次低口低口低口低语到年级 2507.17918v1

Authors (6): Nhan Phan, Anusha Porwal, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo

We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system’s main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.

我们为2025年的“发言与改进挑战”开发了多部分第二语言测试的全自动语音评估(ASA)的高效端对端方法。我们系统的主要新颖之处是能够用单一的Whisper-Small编码器处理所有四种口语回应,通过轻量级聚合器将所有信息组合起来,并预测最后得分。这个结构可以消除对抄录和逐个模式的需求,缩短推论时间,并使ASA适用于大规模计算机辅助语言学习系统。我们的系统实现了0.384的根中位平方错误(RMSE),超过了基于文本的基线(0.44),同时使用了最多168M参数(约70%的Whisper-Smal ) 。此外,我们提出了一个数据抽样战略,允许模型仅培训44.8%的讲者,但仍达到0.383 MSE,表明在不平衡的班级上表现更好,数据效率强。


Article 104

Title@2025-07-23 (3): Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

Title: Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data Diskriminative Feinsteuerung generativer großer Sprachmodelle ohne Belohnungsmodelle und menschliche Präferenzdaten 对没有奖励模式和人类优先数据、没有奖励模式和人类优先数据的产生大语言模型的产生型大语言模型进行有偏见的微调 2502.18679v3

Authors (9): Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang

Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduce Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that increases the probability of positive answers while suppressing potentially negative ones, aiming for data prediction instead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT’s effectiveness, achieving performance better than SFT and comparable to if not better than SFT$\rightarrow$PO. The code can be found at https://github.com/Optimization-AI/DFT.

受监督的微调(SFT)已成为使用投入输出对配的受监督数据集调整预先培训的大语言模型(LLMS)的关键步骤。然而,尽管受到监督,SFT本质上受到基因培训目标的限制。为了解决其局限性,现有的共同战略是遵循SFT, 其一个单独的优惠优化阶段(PO),它依靠人为标签优惠数据或强有力的奖励模式来指导学习过程。在本文中,我们探讨SFT的局限性,方法是探索常规监督学习中最成功的技术之一:歧视性学习。我们引入了差异性精细微调(DFT),这是SFT的一个改进的变式,它减轻了收集人类标签优惠数据或培训强大奖赏模式的负担。与采用基因化方法和忽略负面数据的SFT不同的是,DFT采用了一种歧视性的范式范式,它既能增加积极解答的可能性,又能抑制潜在的负面答案,目的是进行数据预测而不是象征性的预测。我们的贡献包括:(i)一个在微调LMS-TuD(FMs)方面有区别的精准性稳定化框架框架框架,通过明确的模拟,而不是最精确的S-imalimalimalimalvialviii)实现一种可能实现一种更好的投入。


Article 105

Title@2025-07-23 (3): VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL

Title: VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL VeriMinder: Eindämmung analytischer Schwachstellen in NL2SQL VeriMinder:减轻NL2SQL分析脆弱性 2507.17896v1

Authors (2): Shubham Mohole, Sainyam Galhotra

Application systems using natural language interfaces to databases (NLIDBs) have democratized data analysis. This positive development has also brought forth an urgent challenge to help users who might use these systems without a background in statistical analysis to formulate bias-free analytical questions. Although significant research has focused on text-to-SQL generation accuracy, addressing cognitive biases in analytical questions remains underexplored. We present VeriMinder, https://veriminder.ai, an interactive system for detecting and mitigating such analytical vulnerabilities. Our approach introduces three key innovations: (1) a contextual semantic mapping framework for biases relevant to specific analysis contexts (2) an analytical framework that operationalizes the Hard-to-Vary principle and guides users in systematic data analysis (3) an optimized LLM-powered system that generates high-quality, task-specific prompts using a structured process involving multiple candidates, critic feedback, and self-reflection. User testing confirms the merits of our approach. In direct user experience evaluation, 82.5% participants reported positively impacting the quality of the analysis. In comparative evaluation, VeriMinder scored significantly higher than alternative approaches, at least 20% better when considered for metrics of the analysis’s concreteness, comprehensiveness, and accuracy. Our system, implemented as a web application, is set to help users avoid “wrong question” vulnerability during data analysis. VeriMinder code base with prompts, https://reproducibility.link/veriminder, is available as an MIT-licensed open-source software to facilitate further research and adoption within the community.

利用自然语言界面对数据库的应用系统(NLIDBs)实现了数据分析的民主化。这一积极的发展也带来了一项紧迫的挑战,以帮助那些在统计分析中没有背景背景的情况下使用这些系统的用户制定不带偏见的分析问题。尽管大量研究侧重于文本到SQL生成的准确性,但解决分析问题中的认知偏差问题仍未得到充分探讨。我们介绍了VeriMinder, https://veriminder.ai,一个用于发现和减轻这种分析脆弱性的互动系统。我们的方法引入了三个主要创新:(1)一个针对具体分析背景的语义绘图框架;(2)一个在系统数据分析中操作硬到瓦里原则并指导用户的分析性框架;(3)一个优化的LLM动力系统,利用一个有多个候选人参与的结构化过程、批评反馈和自我反射的系统,解决分析方法的优点。在直接用户经验评估中,82.5%的参与者报告对分析质量产生了积极影响。在比较评估中,Verimirender大大高于替代方法,在系统内部至少20%的数据分析中,在数据库中进行精确性分析时,在数据库中,在数据库中将“我们现有的数据库中,对数据库进行精确性分析时,对数据库进行更精确性分析时,对数据库进行更精确性分析。在数据库进行精确性分析。


Article 106

Title@2025-07-23 (3): Weak-to-Strong Jailbreaking on Large Language Models

Title: Weak-to-Strong Jailbreaking on Large Language Models Schwach-zu-starkes Gefängnis mit großen Sprachmodellen 关于大语言模型的弱至强强监狱破解 2401.17256v5

Authors (7): Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

大型语言模型(LLMs)很容易受到越狱攻击 — — 导致有害、不道德或偏颇的文字世代。 但是,现有的破狱方法在计算上成本很高。 在本文中,我们建议对匹配的LLMs进行弱至强的侵入性攻击,对匹配的LLMs进行高效的推断时间攻击,以便生成有害的文字。我们的关键直觉是基于这样的观察,即:在最初的解码分布中,监禁和校正模式只是不同。弱至强的攻击的关键技术洞察是使用两个较小的模型(安全和不安全的模型)来对一个大得多的安全模型的解码概率进行对抗性修改。我们建议对3个组织的5个不同的开放源LMs进行较弱至强的攻击进行评估。结果显示,我们的方法可以将两个数据集的误差率提高到99%以上,每个只有一个前传。我们的研究揭示了一个在调整LMS时需要解决的紧迫的安全问题。作为初步尝试,我们提议了一项防御战略来保护这种攻击,但创造更先进的防御性。 重新复制方法的代码可以在 http://Xqong-weadstrual/xwang/ astotototototototototototomtal


Article 107

Title@2025-07-23 (3): FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Title: FLEXITOKENS: Flexible Tokenization for Evolving Language Models FLEXITOKENS: Flexible Tokenisierung für sich entwickelnde Sprachmodelle FLEXITOKENS: 不断演变的语言模式灵活化 2507.12720v2

Authors (3): Abraham Toluase Owodunni, Orevaoghene Ahia, Sachin Kumar

Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

语言模型( LMS) 难以通过简单的微调适应新的数据发布。 这是因为子名符号符号的僵硬性, 在适应期间通常保持不变。 这种僵硬性化往往导致无效率的象征化, 导致分配外域、 隐蔽语言或脚本的过度分化。 在这项工作中, 我们开发了字节LMS, 配有可学习的象征化符号, 以适应象征性化。 我们的模式包括一个子模块, 学会预测输入字节序列之间的界限, 将其编码为可变长段。 现有的无代号符号方法使用辅助性损失来培训这个边界预测器, 将固定压缩率用于整个培训单元, 引入一种新的僵硬性。 我们提议FLEXITOKENS, 简化的培训目标, 使得适应期间的灵活度大得多。 我们用多种多语言基准、 形态多样的任务和领域来评估 FLEXITOKENS, 我们证明FLEXITOKENS 不断减少象征性的过度分裂性, 并实现下游任务绩效的10%的改进。 在子词和其他梯度/ massimizerforizerus/ dustormatols/ data 将发布数据。


Article 108

Title@2025-07-23 (3): Dynamic and Generalizable Process Reward Modeling

Title: Dynamic and Generalizable Process Reward Modeling Dynamische und generalisierbare Prozess-Reward-Modellierung 动态和可通用流程奖励模型 2507.17849v1

Authors (6): Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang

Process Reward Models (PRMs) are crucial for guiding Large Language Models (LLMs) in complex scenarios by providing dense reward signals. However, existing PRMs primarily rely on heuristic approaches, which struggle with cross-domain generalization. While LLM-as-judge has been proposed to provide generalized rewards, current research has focused mainly on feedback results, overlooking the meaningful guidance embedded within the text. Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria. DG-PRM dynamically selects reward signals for step-wise reward scoring. To handle multifaceted reward signals, we pioneeringly adopt Pareto dominance estimation to identify discriminative positive and negative pairs. Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards. Further analysis reveals that DG-PRM adapts well to out-of-distribution scenarios, demonstrating exceptional generalizability.

在复杂情况下,通过提供密集的奖赏信号指导大语言模型(LLMs)对于指导复杂情况下的大语言模型(LLMs)至关重要,然而,现有的PRMs主要依赖与跨领域一般化斗争的超光化方法。虽然LLM-法官建议提供普遍奖励,但目前的研究主要侧重于反馈结果,忽视案文中所包含的有意义的指导。此外,静态和粗糙的评价标准难以适应复杂的程序监督。为了应对这些挑战,我们提议采用动态和通用的进程模型(DG-PRM),它具有一种奖励树来捕获和储存精细的、多维的奖赏标准。DG-PRM动态地选择奖励信号用于逐步奖励。为了处理多方面的奖赏信号,我们先行采用Pareto支配地位估计,以确定有区别的正对和负对。实验结果表明DG-PRM在现行基准上取得了惊人的业绩,大大提升了在密集的奖赏下跨任务的示范业绩。进一步分析表明DG-PRM在分配的情景上进行了良好的调整,以超乎一般性。


Article 109

Title@2025-07-23 (3): Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Title: Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning Shop-R1: Belohnende LLMs, um menschliches Verhalten im Online-Shopping durch Verstärkungslernen zu simulieren 商店R1:通过强化学习在网上购物中模拟人类行为奖励LMs 2507.17842v1

Authors (17): Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang

Large Language Models (LLMs) have recently demonstrated strong potential in generating ‘believable human-like’ behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.

大型语言模型(LLMS)最近展示了在网络环境中产生“令人相信的类似人的行为”的强大潜力。先前的工作探索了利用LLM合成的原理来增加培训数据,并应用监督的微调(SFT)来提高推理能力,这反过来可以改进下游行动预测。然而,这些方法的绩效仍受用于产生推理的模型推理能力的内在约束。在本文件中,我们引入了Shopp-R1(RL)新颖的强化学习(RL)框架,旨在加强LMS在网上购物环境中模拟真实人类行为的推理能力。具体来说,Shopple-R1将人类行为模拟任务分为两个阶段:理由生成和行动预测,每个阶段都有不同的奖赏信号指导。关于推理能力的生成,我们利用内部模型信号(例如,逻辑分布)来以自我监督的方式指导推理过程。关于行动预测,我们建议一个有难度的等级奖赏结构,以阻止黑客并进行精细的奖赏任务。这一设计既评价高层次的行动类型,又将人类行为模拟任务分成细微的细微的细微的细分析结果。


Article 110

Title@2025-07-23 (3): Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Title: Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks Das Vortraining auf dem Testset ist nicht länger alles, was Sie brauchen: Ein debattegetriebener Ansatz zu QA-Benchmarks 有关测试成套标准的培训前培训并非你需要的更长时间:对质量评估基准采取辩论驱动的办法 2507.17747v1

Authors (2): Linbo Cao, Jinman Zhao

As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates–where one model is given the official answer to defend, and another constructs and defends an alternative answer–adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm’s effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination–a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that “pretraining on the test set is no longer all you need,” offering a sustainable path for measuring the genuine reasoning ability of advanced language models.

由于前沿语言模式日益饱和标准质量评估基准,对数据污染、记忆化和不断上升的数据创建成本的关切不断加剧,因此,我们提出一个辩论驱动的评价模式,将现有的质量评估数据集转化为结构化的对立辩论 – – 在一个模式得到正式的辩护答案的地方,另一个模型构建和捍卫一个替代的答案 – – 答案 – – 由无法正确解决方案的法官模型所决定;通过强制多角度论证,这种方法大大增加了难度,同时惩罚了浅度记忆化,但又重新使用质量A项目以减少调理管理间接费用。我们做出了两个主要贡献:(1) 一个评价管道,将现有的质量评估任务系统化地转换成基于辩论的评估,以及(2) 一个公共基准,表明我们的模式在一组MMMLU-Pro问题上的有效性,由标准协议和参考模型加以完善。 实证结果验证了该方法的稳健性及其在数据污染-a Llama 3.1模型上的效力,通过对测试问题进行微调,显示出了惊人的准确性改进(50% - > 82%),但在辩论中表现得更差。结果还表明,更弱的评价管道能力强的法官可以将更强的逻辑性辩论框架作为新的标准。


Article 111

Title@2025-07-23 (3): Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Title: Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains Rubriken als Belohnungen: Verstärktes Lernen jenseits überprüfbarer Domänen ” 奖励 “ :超越可核实域域的强化学习 2507.17746v1

Authors (6): Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean Hendryx

Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $\textbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28\%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.

将强化学习与可验证的奖励(RLVR)扩大到现实世界的任务往往需要平衡客观和主观的评价标准。然而,许多这类任务缺乏单一、明确的实地真相,难以为培训后语言模式确定可靠的奖赏信号。传统的优惠制方法提供了一种变通办法,但它们依赖不透明的奖赏功能,难以解释,容易产生虚假的关联。我们引入了$textbf{rubrics作为奖赏$(RAR),这个框架使用结构化的、清单式的标志作为与GROP进行政策培训的可解释的奖赏信号。我们的最佳奖赏方法在健康Bench-1k上比简单的类似奖赏方法取得28美元相对的改善,同时匹配或超过从专家编写的参考资料中获得的奖赏信号的性能。我们通过将奖赏作为结构化奖赏信号来对待,我们表明拉R使规模较小的法官模型能够更好地与人类的偏好并保持各种模式的强性能。


Article 112

Title@2025-07-23 (3): Megrez2 Technical Report

Title: Megrez2 Technical Report Technischer Bericht Megrez2 Megrez2 技术报告 2507.17728v1

Authors (15): Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Bo Zhao, Guohao Dai, Yu Wang

We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model’s capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.

我们介绍Megrez2,这是新颖的轻量级和高性能语言模型结构,为本地装置部署优化。Megrez2引入了新型的跨层专家共享机制,通过在相邻变压器层重复使用专家模块,大幅降低总参数计数,同时保持该模型的大部分能力。它还包含预先设定的路线,使记忆高效的专家负荷和更快的推导。作为Megrez2架构的首次即时化,我们引入了Megrez2-Preview模型,该模型经过五三重体体的预先培训,并通过经过监督的微调和强化学习和可核查的奖励而得到进一步加强。Megrez2-Preview仅存储了3B和7.5B参数,显示与大型模型相比,在广泛的任务中,包括语言理解、教学后、数学推理和代码生成方面,具有竞争力或优性能。这些结果突出表明了Megrez2架构在准确性、效率和可部署性之间实现平衡的有效性,使其成为现实世界、资源限制的应用的强大候选者。


Article 113

Title@2025-07-23 (3): AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer

Title: AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer KI-Telefonvermessung: Quantitative Datenerfassung mit einem KI-Interviewer automatisieren AI 电话测量:与AI 采访者一起自动化定量数据收集 2507.17718v1

Authors (7): Danny D. Leybzon, Shreyas Tirumala, Nishant Jain, Summer Gillen, Michael Jackson, Cameron McPhee, Jennifer Schmidt

With the rise of voice-enabled artificial intelligence (AI) systems, quantitative survey researchers have access to a new data-collection mode: AI telephone surveying. By using AI to conduct phone interviews, researchers can scale quantitative studies while balancing the dual goals of human-like interactivity and methodological rigor. Unlike earlier efforts that used interactive voice response (IVR) technology to automate these surveys, voice AI enables a more natural and adaptive respondent experience as it is more robust to interruptions, corrections, and other idiosyncrasies of human speech. We built and tested an AI system to conduct quantitative surveys based on large language models (LLM), automatic speech recognition (ASR), and speech synthesis technologies. The system was specifically designed for quantitative research, and strictly adhered to research best practices like question order randomization, answer order randomization, and exact wording. To validate the system’s effectiveness, we deployed it to conduct two pilot surveys with the SSRS Opinion Panel and followed-up with a separate human-administered survey to assess respondent experiences. We measured three key metrics: the survey completion rates, break-off rates, and respondent satisfaction scores. Our results suggest that shorter instruments and more responsive AI interviewers may contribute to improvements across all three metrics studied.

随着语音辅助人工智能系统的兴起,定量调查研究人员可以使用一种新的数据收集模式:AI电话调查。通过使用AI进行电话访谈,研究人员可以扩大定量研究,同时平衡类似人类互动性和方法严谨的双重目标。与以前使用互动语音响应技术使这些调查自动化的努力不同,声音AI可以让更自然和适应性更强的应答人的经验,因为它对干扰、纠正和人类言论的其他特点更加活跃。我们建立和测试了一种AI系统,以根据大型语言模型(LLLM)、自动语音识别和语音合成技术进行定量调查。该系统专门设计用于数量研究,并严格遵守对问题顺序随机化、回答顺序随机化和准确措辞等最佳做法的研究。为了验证系统的有效性,我们部署了它,与SS意见小组一起进行两次试点调查,随后又进行了一次由人类管理的独立调查,以评估答卷人的经验。我们测量了三种关键指标:调查完成率、断断断率和应答度计分数。我们的调查结果表明,所有短期的仪器和更具响应性的AI访谈方法都有助于三项衡量指标的改进。


Article 114

Title@2025-07-23 (3): From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Title: From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes Von Feedback zu Checklisten: Geerdete Bewertung von KI-generierten klinischen Anmerkungen 从反馈到核对表:对AI - AI - - - - - - - 临床笔记进行基础评价 2507.17717v1

Authors (6): Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters, prepared in accordance with the HIPAA safe harbor standard, from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms baseline approaches in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist’s robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.

AI 生成的临床记录越来越多地用于医疗保健,但由于专家审查的主观性高和可扩展性有限,评估其质量仍然是一项挑战。现有的自动化指标往往无法与现实世界医生的偏好保持一致。为了解决这个问题,我们建议建立一个管道,系统地将真正的用户反馈纳入结构化的备注评估核对表。这些核对表的设计可以解释,以人类反馈为基础,并由基于LLM的评价员执行。使用根据HIPAAA安全港标准、从部署的AI医疗记录系统编制的21 000多个临床遭遇的分辨数据,我们显示,我们的反馈清单在覆盖、多样性和预测能力方面超过了我们对人类评级的离线性评价的基线方法。广泛的实验证实了清单对于质量降解的稳健性,与临床偏好相当,以及作为一种评价方法的实际价值。在离线式研究环境中,清单可以帮助确定可能低于我们所选择的质量阈值的说明。


Article 115

Title@2025-07-23 (3): Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Title: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding Deep Video Discovery: Agentische Suche mit Tool-Nutzung für Langzeit-Video-Verständnis 深视频发现: 用于远程视频理解的工具的 Agric 搜索 2505.18079v3

Authors (7): Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code has been released in https://github.com/microsoft/DeepVideoDiscovery.

长方视频理解由于广泛的时间空间复杂性和在如此长的背景下难以回答问题而带来重大挑战。虽然大语言模型(LLMS)在视频分析能力和长背景处理方面表现出了相当大的进步,但它们在处理信息密集一小时长的视频时继续表现出局限性。为了克服这些局限性,我们提议深视频发现代理商利用视频视频剪辑的代理商战略,在片段视频剪辑上利用代理搜索战略。不同于以往的视频代理商手工设计僵硬工作流程,我们的方法强调代理商的自主性质。通过在多语层视频数据库中提供一套搜索中心工具,我们的DVDV代理商利用LM高级推理能力规划其当前观察状态、战略选择工具、制定适当的行动参数,并根据所收集的信息反复完善其内部推理。我们对多个长视频理解基准进行全面评价,以展示整个系统设计的优势。我们的DVDV代理商实现了SOTA的绩效,大大超过以往在具有挑战性的LVBench数据集上的工作。我们的DVDV代理商利用高级推理学研究和深入工具分析,并且根据收集的长式版本/DVSDFSimFSimimforimalimalismaismaimal 的任务也提供了对进一步的深入了解。


Article 116

Title@2025-07-23 (3): TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa

Title: TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa TyDi QA-WANA: Ein Benchmark für die Beantwortung von Informationsanfragen in den Sprachen Westasiens und Nordafrikas Tydi QA-WANA:西亚和北非语言信息查询问题回答基准 2507.17709v1

Authors (4): Parker Riley, Siamak Shakeri, Waleed Ammar, Jonathan H. Clark

We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models’ abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.

我们提出Tydi QA-WANA,这是一个问答数据集,由西亚和北非10种语言的28K个实例组成。数据收集过程旨在引出寻求信息的问题,因为乞丐真正想知道答案。每个问题都与可能包含答案或可能不包含答案的整篇文章相配套;文章的篇幅相对较大,导致一项任务适合于评价模型在回答问题时利用大文本背景的能力。此外,数据是直接用每种语言收集的,没有翻译,以避免文化相关性问题。我们介绍了两个基线模型的性能,并公布了我们的代码和数据,以便利研究界进一步改进。


Article 117

Title@2025-07-23 (3): A Mathematical Theory of Discursive Networks

Title: A Mathematical Theory of Discursive Networks Eine mathematische Theorie diskursiver Netzwerke 讨论网络的数学理论 2507.06565v5

Authors (1): Juan B. Gutiérrez

Large language models (LLMs) turn writing into a live exchange between humans and software. We characterize this new medium as a discursive network that treats people and LLMs as equal nodes and tracks how their statements circulate. We define the generation of erroneous information as invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. We develop a general mathematical model of discursive networks that shows that a network governed only by drift and self-repair stabilizes at a modest error rate. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source Flaws-of-Others (FOO) algorithm: a configurable loop in which any set of agents critique one another while a harmonizer merges their verdicts. We identify an ethical transgression, epithesis, that occurs when humans fail to engage in the discursive network. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from connecting imperfect ones into networks that enforce mutual accountability.

大型语言模型( LLMs) 将写作变成人类和软件之间的实时交换。 我们把这个新媒体描述为一个不准确的网络, 将人和LLMs视为平等的节点, 并跟踪其声明的传播方式。 我们把错误信息的生成定义为无效( 任何事实、 逻辑或结构性违反) , 并显示它有四种危险: 从真理、 自我修复、 新鲜制造和外部检测中漂移出来。 我们开发了一个迷惑网络的一般数学模型, 显示一个仅受漂移和自我修复制约的网络以微小的错误速度稳定下来。 给每个错误的网络一个很小的同行审查机会, 将系统转换成一个以真理为主的状态 。 我们使用开放源法( FOO) 算法( ) 来进行同行审查: 一个可配置的循环, 任何一组代理人互相批评, 而一个协调者将其判断合并在一起。 我们发现一个道德上的违法现象, 即当人类无法参与不透明网络时会发生。 摘取是实用的和文化的: 新介质介质的介质不是来自完善的单一模式, 而是连接的网络。


Article 118

Title@2025-07-23 (3): LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Title: LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung LoX:低Rank外推法强力推力LLM 安全防止微调 2506.15606v2

Authors (6): Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.

大型语言模型(LLMS)在现实世界应用中变得不可或缺。然而,广泛采用这些模型引起了重大的安全问题,特别是在应对对社会有害的问题时。尽管做出了大量努力,通过调整来改善模型安全,但统一模型仍然可以受到随后微调的破坏,即使额外的培训数据看起来是无害的。在本文件中,我们从经验上证明,这种脆弱性源于LLM参数中安全临界低级别子空间对微调的敏感度。根据这一认识,我们建议采用一种新的无培训方法,称为Low-Rank外推法(LOX),通过对一个匹配的LMM的安全子空间进行外推法,加强安全稳健性。我们的实验结果证实LOX的有效性,在防止良性攻击和恶意微调攻击的同时,在保持模型适应新任务方面都取得了显著的稳健性改进。例如,LOX使面临良性或恶意微调的攻击成功率的绝对下降11%至54%。我们通过调查ASR参数的景观,将LX的成功归因于的成功归因于LMM/LAVX的参数转移到一个不敏感程度。


Article 119

Title@2025-07-23 (3): Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

Title: Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step Können wir Bilder mit CoT generieren? Lassen Sie uns die Bildgenerierung Schritt für Schritt überprüfen und verstärken 我们能用 Cot 生成图像吗? 让我们一步一步地校验和加强图像生成 2501.13926v2

Authors (12): Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, Hongsheng Li, Pheng-Ann Heng

Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image, which is the first to incorporate reflection in autoregressive image generation. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT

在大型模型中广泛探索了链式图像生成的推理,以完成复杂的理解任务。然而,这种推理仍是一个未决问题。在本文件中,我们首次全面调查了CT推理的潜力,以加强自动递增图像生成。我们侧重于三种技术:为核查缩小测试-时间计算,将模型偏好与直接偏好优化(DPO)统一起来,以及将这些技术结合到互补效果中。我们的成果表明,这些方法可以有效地调整和合并,从而大大改善图像生成绩效。此外,鉴于奖励模型在我们的研究结果中的关键作用,我们提议了“潜在评估回升模型”(PARM)和“PARM++”,专门用于自动递增图像生成。我们侧重于三种技术:将测试-时间计算用于核查,将模型偏好与直接最佳优化(DPO),以及将模型纳入自动递增图像生成过程中的反省机制。我们通过调查的推理策略,加强了基线模型,显示-T(Pow-Reward Reward Reward Reduf)和“SBen+Ben+Ben+25”的推理学研究,提供了一种显著的推理学。我们Gen+Ben+Bill-Ben-Be-Breal-Breal-Breal-Breal-Breal-Breal-C-C-Breal-xx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Article 120

Title@2025-07-23 (3): Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries

Title: Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries Wer greift an und warum? Mit LLMs negative Kampagnen in 18M Tweets in 19 Ländern identifizieren 利用LLM公司查明18M Tweets 18M Tweets的负面运动,横跨19个国家 2507.17636v1

Authors (2): Victor Hartman, Petter Törnberg

Negative campaigning is a central feature of political competition, yet empirical research has been limited by the high cost and limited scalability of existing classification methods. This study makes two key contributions. First, it introduces zero-shot Large Language Models (LLMs) as a novel approach for cross-lingual classification of negative campaigning. Using benchmark datasets in ten languages, we demonstrate that LLMs achieve performance on par with native-speaking human coders and outperform conventional supervised machine learning approaches. Second, we leverage this novel method to conduct the largest cross-national study of negative campaigning to date, analyzing 18 million tweets posted by parliamentarians in 19 European countries between 2017 and 2022. The results reveal consistent cross-national patterns: governing parties are less likely to use negative messaging, while ideologically extreme and populist parties – particularly those on the radical right – engage in significantly higher levels of negativity. These findings advance our understanding of how party-level characteristics shape strategic communication in multiparty systems. More broadly, the study demonstrates the potential of LLMs to enable scalable, transparent, and replicable research in political communication across linguistic and cultural contexts.

负面竞选活动是政治竞争的一个中心特征,然而经验性研究却受到现有分类方法成本高、可扩展性有限等因素的限制。本研究作出了两项关键贡献。首先,采用零点大语言模型(LLMs)作为跨语种的负面竞选活动分类新颖办法。我们使用十种语言的基准数据集,表明LLMs与讲本地语的人类代码员有同等的成绩,并且优于常规监督的机器学习方法。第二,我们利用这一新颖方法对迄今为止最大的负面竞选活动进行跨国研究,分析了2017年至2022年期间19个欧洲国家议员张贴的1 800万条推文。研究结果显示了一贯的跨国模式:执政党派不太可能使用负面讯息,而意识形态极端和民粹主义的政党 – – 特别是极右派政党 – – 则参与相当高的否定性。这些研究结果增进了我们对党级特征如何影响多党制战略沟通的理解。更广泛而言,研究显示LMs有可能在语言和文化背景的政治沟通中进行可扩展、透明和可复制的研究。


Article 121

Title@2025-07-23 (3): WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Title: WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training WSM: Decay-Free Learning Rate Scheduling via Checkpoint Merging für LLM Pre-Training WSM:通过LLM培训前的检查站合并,制定无下降的学习率表 2507.17634v1

Authors (10): Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong Chen, Jia Liu, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou

Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM’s potential for long-term model refinement.

最近的学习率(LR)列表进展显示了消除传统衰变阶段同时保持竞争性绩效的无腐化方法的有效性。示范合并技术是这一领域的特别有希望的解决办法。我们介绍的是温度和合并(WSM),这是一个在学习率衰变和模式合并之间建立正式联系的总框架。世界学习率(WSM)为模拟各种衰变战略提供了统一的理论基础,包括共弦衰减、线性衰变和反平方根衰变平均模式,同时仍然与多种优化方法完全兼容。我们通过广泛的实验,确定检查站合并培训窗口是影响模型性能的最关键因素,超越了检查站间隔和合并数量的重要性。我们的框架始终超越了广泛采用的WSDSD(WD)方法的多重基准,大大改进了MATH的+3.5%、HumanEval的+2.9%和MLU-Pro的+5.5%。绩效优势扩大到监督的微调情景,突出WSMU的长期模型改进潜力。


Article 122

Title@2025-07-23 (3): Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Title: Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion Conan: Ein Chunkwise Online-Netzwerk für Null-Shot Adaptive Voice Conversion Conan:一个零热适应性语音转换的中远在线网络 2507.14534v2

Authors (3): Yu Zhang, Baotong Tian, Zhiyao Duan

Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics. Audio samples can be found at https://aaronz345.github.io/ConanDemo.

零点在线语音转换(VC)为实时通信和娱乐带来了巨大的希望。然而,当前的 VC 模型在实时限制下努力维护语义真实性,提供自然声音转换,并有效地适应隐性扬声器特性。为了应对这些挑战,我们引入了Conan, 这是一种粗略的在线零点声音转换模式,它保存源的内容,同时匹配音调和参考演讲的风格。 Conan 由三个核心部分组成:1) 一种流体内容提取器,它利用Emexex对低纬度流流内容进行编码;2) 一种调制风格编码器,它从参考演讲中提取精细的发光的文理学特征,用于强化风格适应;3) 一种Causal Shuffle Vocoder,它使用像素-shuffle机制来实施完全因果的HIFIGAN。实验性评估表明, Conan 在主观和客观的计量标准中超越基线模型。音样样本见https://aronz345.github.io/ConanDemo。


Article 123

Title@2025-07-23 (3): A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE)

Title: A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE) Hybrider Früh-Exit-Algorithmus für große Sprachmodelle auf Basis von Space Alignment Decoding (SPADE) 以空间调整编码为基础的大语言模型混合早期出界比值(SPADE) 2507.17618v1

Authors (4): Bowen Zheng, Ming Ma, Zhongqiao Lin, Tianming Yang

Large language models are computationally expensive due to their deep structures. Prior research has shown that intermediate layers contain sufficient information to generate accurate answers, leading to the development of early-exit algorithms that reduce inference costs by terminating computation at earlier layers. However, these methods often suffer from poor performance due to misalignment between intermediate and output layer representations that lead to decoding inaccuracy. To address these challenges, we propose SPADE (SPace Alignment DEcoding), a novel decoding method that aligns intermediate layer representations with the output layer by propagating a minimally reduced sequence consisting of only the start token and the answer token. We further optimize the early-exit decision-making process by training a linear approximation of SPADE that computes entropy-based confidence metrics. Putting them together, we create a hybrid early-exit algorithm that monitors confidence levels and stops inference at intermediate layers while using SPADE to generate high-quality outputs. This approach significantly reduces inference costs without compromising accuracy, offering a scalable and efficient solution for deploying large language models in real-world applications.

先前的研究显示,中间层含有足够的信息以得出准确的答案,从而导致开发早期退出算法,通过终止先前层次的计算来降低推断成本。然而,由于中间层和产出层的表达方式不协调,导致不准确的解码,这些方法往往表现不佳。为了应对这些挑战,我们建议SPADE(SPace对齐脱码)是一种新型的解码方法,它通过推广一个最小的简化序列,仅由起始符号和答案符号组成,将中间层的表示与产出层统一起来。我们进一步优化早期退出决策过程,培训SPADE的线性近似法,该直线性近似法计算了基于加密的置信度指标。把这些方法结合起来,我们创建一种混合的早期退出算法,监测信任水平并阻止中间层的推断,同时使用SPADE来产生高质量的产出。这个方法在不降低准确性的前提下大幅度降低推断成本,为在现实应用中部署大型语言模型提供可缩放的有效解决方案。


Article 124

Title@2025-07-23 (3): Multi-Level Explanations for Generative Language Models

Title: Multi-Level Explanations for Generative Language Models Mehrstufige Erklärungen für generative Sprachmodelle 产生语言模式的多层次解释 2403.14459v2

Authors (11): Lucas Monteiro Paes, Dennis Wei, Hyo Jin Do, Hendrik Strobelt, Ronny Luss, Amit Dhurandhar, Manish Nagireddy, Karthikeyan Natesan Ramamurthy, Prasanna Sattigeri, Werner Geyer, Soumya Ghosh

Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github$.$com/IBM/ICX360.

尽管越来越多地使用大型语言模型(LLMS)来进行概括和问答等背景工作,但了解使LLM产生某种反应的法LM具有挑战性。我们提议对产生语言模型(MExGen)进行多层次解释,这是为背景生成文本提供解释的一种方法。MExGen将分数分配给部分背景工作,以量化其对模型输出的影响。它将LIME和SHAP等归属方法推广到背景任务中使用的LLMS,其中(1) 推断成本高,(2) 输入文本长,(3) 输出为文本。我们从自动化和人为角度对基于渗透的归属方法进行系统评估,以总结和回答问题。结果显示,我们的框架能够提供比现有替代方法(包括LLOM自我解释)更准确的对产出的解释。我们为MExGen提供的公开源代码,作为ICX360工具包的一部分:https://github$.com/IBM/ICX360。


Article 125

Title@2025-07-23 (3): Dual-branch Prompting for Multimodal Machine Translation

Title: Dual-branch Prompting for Multimodal Machine Translation Dual-Branch Prompting für multimodale maschinelle Übersetzung 多式联运机器翻译的双分支提示 2507.17588v1

Authors (6): Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.

多式机器翻译(MMT)通常通过纳入一致的视觉特征而加强只文本翻译。尽管取得了显著的进展,最先进的MMT方法往往在推断时依赖配对图像文本投入,并且对不相关的视觉噪音敏感,这限制了其稳健性和实际适用性。为了解决这些问题,我们提议D2P-MMT,这是一个基于扩散的双分支促进框架,用于稳健的视觉引导翻译。具体地说,D2P-MMT仅需要源文本和由预先训练的传播模式产生的再版图像,该模式在保存语义提示的同时,自然过滤了视觉细节的转移。在培训期间,该模型使用双分支提示战略,共同学习真实和再版图像,鼓励丰富的跨模式互动。为了缩小模式差距和减少培训-推断差异,我们引入了分布协调损失,使两个分支的产出分布更加一致。MMTMT在MTM上进行了广泛的实验,表明D2P-MT实现了与现有状态方法相比的更高级翻译性。


Article 126

Title@2025-07-23 (3): GenSelect: A Generative Approach to Best-of-N

Title: GenSelect: A Generative Approach to Best-of-N GenSelect: Ein generativer Ansatz zum Best-of-N GenSect: 产生最佳N型的方法 2507.17797v1

Authors (5): Shubham Toshniwal, Ivan Sorokin, Aleksander Ficek, Ivan Moshkov, Igor Gitman

Generative reward models with parallel sampling have enabled effective test-time scaling for reasoning tasks. Current approaches employ pointwise scoring of individual solutions or pairwise comparisons. However, pointwise methods underutilize LLMs’ comparative abilities, while pairwise methods scale inefficiently with larger sampling budgets. We introduce GenSelect, where the LLM uses long reasoning to select the best solution among N candidates. This leverages LLMs’ comparative strengths while scaling efficiently across parallel sampling budgets. For math reasoning, we demonstrate that reasoning models, such as QwQ and DeepSeek-R1-0528, excel at GenSelect, outperforming existing scoring approaches with simple prompting.

具有平行抽样的创用奖励模式使得能够有效地测试推理任务的时间比例。目前的方法采用有分数的个别解决办法评分或对称比较。但是,有分数的方法没有充分利用LLMs的比较能力,而有分数的方法与较大的采样预算相比却没有效率。我们引入了GenSelect, LLM利用长期推理在N候选人中选择最佳解决办法。这在平行采样预算之间利用LLMs的相对优势,同时有效推广。关于数学推理,我们证明,QwQ和DeepSeek-R1-0528等推理模型优于GenSelect,以简单快速的方式优于现有的评分方法。


Article 127

Title@2025-07-23 (3): Synthetic Voice Data for Automatic Speech Recognition in African Languages

Title: Synthetic Voice Data for Automatic Speech Recognition in African Languages Synthetische Sprachdaten zur automatischen Spracherkennung in afrikanischen Sprachen 非洲语言自动语音识别合成声音数据 2507.17578v1

Authors (4): Brian DeRenzi, Anna Dixon, Mohamed Aymane Farhi, Christian Resch

Speech technology remains out of reach for most of the over 2300 languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. Fine-tuned Wav2Vec-BERT-2.0 models trained on 250h real and 250h synthetic Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved about 6.5% relative with a 1:2 real-to-synthetic ratio; a 1:1 ratio for Dholuo showed similar improvements on some evaluation data, but not on others. Investigating intercoder reliability, ASR errors and evaluation datasets revealed the need for more robust reviewer protocols and more accurate evaluation data. All data and models are publicly released to invite further work to improve synthetic data for African languages.

2300多种非洲语言中大多数2300多种语言都无法获得语音技术。我们为非洲ASR首次对大规模合成合成声音公司进行系统评估。我们采用三步程序:LLM驱动的文本制作、TTS语音合成和ASR微调。我们制作合成文本的10种语言中,有8种语言在7个语言中实现了5个以上的可读性分数。我们评估了三种语言(豪萨、多卢奥、奇切瓦)的ASR改进,创造了2 500多小时合成声音数据,低于真实数据成本的1%。微调的Wav2Vec-BERT-2.0模型在250小时实际和250小时合成Hausa上经过精细调培训,符合500小时只使用实际数据的基线,而579小时和450-993小时合成数据则创造了最佳的绩效。我们还对三种语言(豪萨、杜卢奥、奇切瓦、奇切瓦、WER)的改进了约6.5%的合成声音数据,低于实际合成数据与合成数据比率的1:2;D1.1%的比例表明Dhuluo的合成合成Hausa(Houluo)模型与一些可靠数据的精确数据进行了类似的改进,但对数据进行了更精确的更新。


Article 128

Title@2025-07-23 (3): Fairness Evaluation of Large Language Models in Academic Library Reference Services

Title: Fairness Evaluation of Large Language Models in Academic Library Reference Services Fairness-Evaluierung von großen Sprachmodellen in wissenschaftlichen Bibliotheksreferenzdiensten 学术图书馆参考资料服务大语言模型公平评价 2507.04224v2

Authors (8): Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian

As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

随着图书馆探索用于虚拟参考服务的大型语言模式(LLMs),产生了一个关键问题:LLMs能否公平地为所有用户服务,而不论其人口或社会地位如何?LLMs提供了巨大的扩展支持潜力?LMs还可能复制其培训数据中所包含的社会偏见,这有可能损害图书馆对公平服务的承诺的完整性。为了解决这一问题,我们评估LLMs是否通过促使六个最先进的LLMs协助不同性别、种族/族裔和机构作用的赞助者,从而在不同用户身份之间作出不同的反应。我们没有发现种族或族裔差异的证据,只有一个模式中存在对妇女的陈规定型偏见的微小证据。LLMs通过使用与形式、礼貌和特定领域的词汇有关的语言选择,表现出对机构角色的细微包容,反映了专业规范而不是歧视性待遇。这些调查结果表明,目前LLMs表现出支持学术图书馆参考资料服务中公平和符合背景的沟通的良好准备程度。


Article 129

Title@2025-07-23 (3): BoSS: Beyond-Semantic Speech

Title: BoSS: Beyond-Semantic Speech Boss: Jenseits semantischer Sprache BOSSS:超语语言 2507.17563v1

Authors (11): Qing Wang, Zehan Li, Hang Lv, Hongjie Chen, Yaodong Song, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li

Human communication involves more than explicit semantics, with implicit signals and contextual cues playing a critical role in shaping meaning. However, modern speech technologies, such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) often fail to capture these beyond-semantic dimensions. To better characterize and benchmark the progression of speech intelligence, we introduce Spoken Interaction System Capability Levels (L1-L5), a hierarchical framework illustrated the evolution of spoken dialogue systems from basic command recognition to human-like social interaction. To support these advanced capabilities, we propose Beyond-Semantic Speech (BoSS), which refers to the set of information in speech communication that encompasses but transcends explicit semantics. It conveys emotions, contexts, and modifies or extends meanings through multidimensional features such as affective cues, contextual dynamics, and implicit semantics, thereby enhancing the understanding of communicative intentions and scenarios. We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics. We evaluate BoSS-related attributes across five different dimensions, reveals that current spoken language models (SLMs) are hard to fully interpret beyond-semantic signals. These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.

人类交流涉及的不仅仅是明确的语义,其隐含的信号和背景提示在塑造含义方面发挥着关键作用;然而,现代语言技术,如自动语音识别和文本到语音(TTS),往往无法捕捉这些超越语义的层面。为了更好地描述和衡量语音智能的进展,我们引入了一个分级框架,以说明口语对话系统从基本指令识别到人性社会互动的演进。为了支持这些先进的能力,我们提议“超语语言(BOSS)”,它指的是语音通信中包含但超越明确语义的一套信息。它传递情感、背景以及改变或扩展含义,通过多方面的特征,如感官提示、背景动态和隐含的语义表达,从而增进对交流意图和情景的理解。我们为博SS提供了一个正式的框架,利用认知相关性理论和机器学习模型来分析时间和背景语言动态。我们从五个不同层面对博语系相关属性进行评估,揭示当前语言流学模式的深度分析需要超越了人类的更深层次的思维。


Article 130

Title@2025-07-23 (3): Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline

Title: Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline Auswirkungen von Aufklebern auf multimodale Sentiment und Intent in sozialen Medien: Eine neue Aufgabe, Datensatz und Ausgangslage 贴标签者对社会媒体多式联运和意向的影响:新任务、数据集和基线 2405.08427v2

Authors (4): Yuanchen Shi, Biao Ma, Longyin Zhang, Fang Kong

Stickers are increasingly used in social media to express sentiment and intent. Despite their significant impact on sentiment analysis and intent recognition, little research has been conducted in this area. To address this gap, we propose a new task: \textbf{M}ultimodal chat \textbf{S}entiment \textbf{A}nalysis and \textbf{I}ntent \textbf{R}ecognition involving \textbf{S}tickers (MSAIRS). Additionally, we introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms. Our dataset includes paired data with the same text but different stickers, the same sticker but different contexts, and various stickers consisting of the same images with different texts, allowing us to better understand the impact of stickers on chat sentiment and intent. We also propose an effective multimodal joint model, MMSAIR, featuring differential vector construction and cascaded attention mechanisms for enhanced multimodal fusion. Our experiments demonstrate the necessity and effectiveness of jointly modeling sentiment and intent, as they mutually reinforce each other’s recognition accuracy. MMSAIR significantly outperforms traditional models and advanced MLLMs, demonstrating the challenge and uniqueness of sticker interpretation in social media. Our dataset and code are available on https://github.com/FakerBoom/MSAIRS-Dataset.

社交媒体越来越多地使用粘贴剂表达情绪和意图。 尽管它们对于情绪分析和意向识别有重大影响, 但在这一领域却几乎没有开展什么研究。 为了弥补这一差距, 我们提议了一项新的任务 :\ textbf{M}multimodal chat\ textbf{S}entiment\ textbf{A}A} 分析 和\ textbf{I}ntextbf{R}{R}textbf{R}funtial cognition(MSAIRS) 。 此外, 我们引入了一个新的多式联运数据集, 包含从多个主流社交媒体平台摘录的中国聊天记录和粘贴标签。 我们的数据集包括将数据与相同的文本配对齐的数据, 不同的粘贴剂, 相同的粘贴剂但不同的背景, 由相同图像组成的各种粘贴剂, 不同的文本, 使我们能够更好地了解粘贴剂对聊天情绪和意图的影响。 我们还提议一个有效的多式联运联合模型, MMSAIR, 以不同的矢量构建和连锁关注机制加强多式联运融合。 我们的实验表明共同建模的媒体情绪和意图的必要性和意图, 。


Article 131

Title@2025-07-23 (3): From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment

Title: From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment Von Neuronen zur Semantik: Bewertung der Cross-Linguistic Alignment Fähigkeiten großer Sprachmodelle über Neuronen Alignment 从中世纪到语义学:通过中世纪对齐评估大语言模型的跨语言一致能力 2507.14900v2

Authors (5): Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, Xiaodong Shi

Large language models (LLMs) have demonstrated remarkable multilingual capabilities, however, how to evaluate cross-lingual alignment remains underexplored. Existing alignment benchmarks primarily focus on sentence embeddings, but prior research has shown that neural models tend to induce a non-smooth representation space, which impact of semantic alignment evaluation on low-resource languages. Inspired by neuroscientific findings that similar information activates overlapping neuronal regions, we propose a novel Neuron State-Based Cross-Lingual Alignment (NeuronXA) to assess the cross-lingual a lignment capabilities of LLMs, which offers a more semantically grounded approach to assess cross-lingual alignment. We evaluate NeuronXA on several prominent multilingual LLMs (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. The results demonstrate that with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation of 0.9556 with downstream tasks performance and 0.8514 with transferability. These findings demonstrate NeuronXA’s effectiveness in assessing both cross-lingual alignment and transferability, even with a small dataset. This highlights its potential to advance cross-lingual alignment research and to improve the semantic understanding of multilingual LLMs.

大型语言模型(LLMS)显示了非凡的多语种能力,然而,如何评价跨语种的校准仍然没有得到充分利用。现有的校准基准主要侧重于判决嵌入,但先前的研究显示,神经模型往往产生一个非移动的表达空间,这种空间是语义校准评价对低资源语言的影响。在类似信息引发神经神经区域重叠的神经科学研究结果的启发下,我们提议一个新的中伦国家跨语言校准(NeuronXA)来评估LMS的跨语种的校准能力(NeronXA),这种能力为评估跨语种校准提供了更具有语义基础的方法。我们评估神经XA关于若干突出的多语种LMS(LLAMA、Quen、Mistral、GLM和OLMO)的NeronXA(LXA)在两个传输任务和三个多语种基准方面的影响。结果显示,NeuronXA(NeuronXA)在仅有100对平行的判刑配对,与下游任务表现为0.9556,Pearson与可转让性为0.8514。这些结论显示NexA在评估跨语言校准和跨语言校准方面对的可能性方面的有效性。


Article 132

Title@2025-07-23 (3): Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

Title: Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction Rede als multimodaler digitaler Phenotyp für Multi-Task LLM-basierte psychische Gesundheitsvorhersage 作为多任务LLM基于心理健康预测的多种模式数字哲学型演讲 2505.23822v3

Authors (8): Mai Ali, Christopher Lucasius, Tanmay P. Patel, Madison Aitken, Jacob Vorstman, Peter Szatmari, Marco Battaglia, Deepa Kundur

Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions’ progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.

这是一种非侵入性的数字式话语,可以对心理健康状况提供有价值的洞察力,但往往被当作一种单一模式。相反,我们提议将患者言语数据作为治疗抑郁症的三角多媒体数据源。本研究探索了大型语言模型结构的潜力,用于在综合语音文本、声标和声标的生物标志的多式联运制度中进行言语抑郁症预测。青少年抑郁症是一个重大挑战,而且往往与自杀思想和睡眠干扰等多种疾病相交。这提供了将多任务学习(MTL)纳入我们研究的又一个机会,同时预测抑郁症、自杀性思维和睡眠干扰,同时使用多式联运公式进行预测。我们还提出了一个纵向分析战略,以模拟跨多种临床互动的时间变化,从而能够全面了解各种条件的演变。我们提出的方法以三模式、纵向MTL为主,在抑郁症预警数据集上进行了评估。它实现了70.8%的平衡精度,高于单式、单型、单型和非纵向方法。


Article 133

Title@2025-07-23 (3): URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Title: URPO: A Unified Reward & Policy Optimization Framework for Large Language Models URPO: Ein einheitliches Reward & Policy Optimization Framework für große Sprachmodelle URPO:大语言模式统一奖励和政策优化框架 2507.17515v1

Authors (4): Songshuo Lu, Hua Wang, Zhi Chen, Yaohua Tang

Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following (“player”) and reward modeling (“referee”) within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO’s superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.

大型调整管道通常配对一种政策模式,其参数在强化学习期间仍然冻结。这种分离产生了一种复杂、资源密集的管道,并由于静态奖励信号而受制于业绩上限。我们提议了一个新颖的框架,即统一奖励和政策优化(URPO),在单一模式和单一培训阶段内统一指导跟踪(“玩家”)和奖励模型(“裁判者”)。我们的统一模式将所有调整数据(包括优惠配对)、可核查的推理和开放式指示重新定位为一种统一的基因化格式,由单一的集团重新政策最佳化(GROPO)循环优化。这让该模式能够从地面的偏好和可核查的逻辑中学习,同时为开放式任务创造自己的奖赏。Quen2.5-7B模型实验显示了URPO的优越性。我们的统一模式大大超越了一个强大的基线,使用了一个单独的更简化的奖赏模式,提高了在Alpacaval上的分分数,从42至44的单一集团政策优化政策优化政策优化(GROPO)循环。这让该模式从一个更能化的排名模式到一个更深入的排名,从85的升级到一个内部的升级的升级的学习,需要一个更深入的升级的升级的升级的升级的升级的学习, 和再进行一个更深入的升级的升级的升级的升级的升级的学习。


Article 134

Title@2025-07-23 (3): DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD

Title: DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD DNT: ein tief normalisierter Transformer, der von Momentum SGD trainiert werden kann DNT:一种可接受 “ 动力 “ SPGD培训的 “ 高度正常化 “ 变异器 2507.17501v1

Authors (7): Xianbiao Qi, Marco Chen, Wenjie Xiao, Jiaquan Ye, Yelin He, Chun-Guang Li, Zhouchen Lin

Transformers have become the de facto backbone of modern deep learning, yet their training typically demands an advanced optimizer with adaptive learning rate like AdamW, rather than a momentum SGDW (mSGDW). Previous works show that it is mainly due to a heavy-tailed distribution of the gradients. In this paper, we introduce a Deeply Normalized Transformer (DNT), which is meticulously engineered to overcome this limitation enabling seamless training with vanilla mSGDW while yielding comparable performance to the Transformers trained via AdamW. To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer, balance the influence of weights, activations, and their interactions, and thus enable the distributions of gradients concentrated. We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures to validate that: a) DNT outperforms its counterparts (\ie, ViT and GPT), and b) DNT can be effectively trained with vanilla mSGDW.

改革者已成为现代深层学习的事实上的支柱,然而,他们的培训通常要求一种先进的优化,适应性学习率,如AdamW,而不是SGDW(MSDW)的动力。以前的工作表明,这主要是由于梯度的分布非常繁琐。在本文中,我们引入了一种高度正常化的变异器(DNT),它经过精心设计,以克服这一局限性,使香草混凝固的变异器能够进行无缝的培训,同时使通过亚当W培训的变异器产生类似的性能。具体地说,在DNT中,我们从战略上整合了变异器的适当位置上的正常化技术,以有效调节每个层的雅各基质矩阵,平衡重量、活化及其相互作用的影响,从而使得梯度的分布得以集中。我们为我们DNT使用的正常化技术提供了理论上的理由,并对两种流行的变异器结构进行了广泛的经验评价,以证实:(a) DNT比其对应方(\,VIT和GPT)和b)DNT可以有效地与Villa MSD进行训练。


Article 135

Title@2025-07-23 (3): Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

Title: Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants Lost in Variation? Bewertung der NLI-Performance in baskischen und spanischen geografischen Varianten 评价巴斯克和西班牙地理变异性国家LI绩效 2506.15239v2

Authors (3): Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.

在本文中,我们评估了当前语言技术理解巴斯克语和西班牙语品种的能力。我们把自然语言推论(NLI)作为一个主轴任务,并引入了巴斯克语和西班牙语以及各自变体的新颖的、手工制作的平行数据集。我们对使用只使用编码器和以解码器为基础的大语言模型(LLMS)进行的跨语言和内通学习实验的经验分析显示,在处理语言变异时,特别是在巴斯克,其性能下降。错误分析表明,这一下降并非由于词汇重叠,而是由于语言变异本身。进一步的膨胀实验表明,只使用编码器的模型特别与西巴斯克人(Western Basque)争斗,后者与确定周边方言方(例如西方)离标准更远的语言理论是一致的。所有数据和代码都可以公开查阅。


Article 136

Title@2025-07-23 (3): Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Title: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Pre-Training 推进藏藏人大语言模式,提供 “ 扩展数据 “ 和 “ 持续培训前 “ 。 2507.09205v3

Authors (17): Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

大型语言模式在许多语言方面取得了显著进步,但是,西藏作为一个有代表性的低资源语言,由于缺少高质量的培训公司,在现有模式中的代表性特别不足。为了解决这一差距,我们编纂了迄今为止最大的西藏培训前材料,汇集了来自不同来源的数据,并采用了专门为藏族量身定制的数据清理和处理管道。我们利用整理的数据,继续预/后培训多语言基础模式,以提高藏族的遗传能力。为了评估西藏模式的藏族能力,我们建立了高质量的西藏基准,并以现有的公共基准作为补充。实验结果表明,我们的模型一贯且显著地超越了类似规模的开放源模式和西藏定制模式在一系列广泛任务中的兼容性模式。


Article 137

Title@2025-07-23 (3): MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Title: MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs MultiNRC: Ein anspruchsvolles und eingeborenes, mehrsprachiges Bewertungsmaßstab für LLMs 多伦多挪威研究中心:对LLMs的质疑和土著多语种理由评估基准 2507.17476v1

Authors (8): Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing

Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs’ multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.

尽管最近大语言模型(LLMS)在英语推理基准方面显示出了快速的改进,但多民族研究中心在评估这些LLMS在不同语言和文化背景的多语言推理能力方面仍然有限,现有的多语言推理基准通常是通过翻译现有的英语推理基准来构建的,这些基准将这些基准偏向于英语/文化背景的推理问题;在这项工作中,我们引入了多语言土著多语言推理挑战(MultiNRC),这个基准旨在根据法语、西班牙语和中文母语的1 000多个本地语言、语言和文化依据推理问题来评估LMS的推理能力。多民族研究中心涵盖四个核心推理类别:语言特定语言推理、文字游戏和谜、文化/传统推理、具有文化相关性的数学推理等理。对于文化/传统推理和数学推理,我们还采用英语流利10种语言的手译方式提供等同的多语言问题英语翻译。这种英文等同形式可以直接比较其他语言的LMRM的推理能力与英语的推理学能力(我们系统地评价了14个主要LMS,在多语言推理学上大多数LMMM家庭仍以多语言推理的推理能力,在英语的推理学上不甚甚高的推理学,在50的推理学上显示不甚甚高的推理学。


Article 138

Title@2025-07-23 (3): WAKENLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Title: WAKENLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking WAKENLLM: Bewertung des Potenzials und der Stabilität von LLMs mittels feinkörniger Benchmarking WAKNLLM: 通过精细基准评估LLMLM公司的合理合理潜力和稳定性 2507.16199v2

Authors (10): Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Yao Wan, Kejia Huang, Chen Huang, Zhichao Hou, Xuming Hu

Large Language Models (LLMs) frequently output the label Unknown, yet current evaluations focus almost exclusively on whether such answers are honest rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon Vague Perception. And thus we introduce a framework that quantifies the proportion of Unknown responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct Known or correct Unknown with valid reasoning. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the potential reasoning ability of LLMs and providing a new perspective on solving the Vague Perception phenomenon.

大型语言模型(LLMs)经常输出标签为未知的,但目前的评价几乎完全集中于这类答案是否诚实,而不是为什么出现。这模糊了两个不同的情况:(一) 一种投入是真正不确定的,而(二) 是模型未能解决的一个可溶解的问题。我们称这种现象为模糊概念。因此,我们引入了一个框架,对因模型缺乏能力而出现的未知反应的比例进行量化,并测试引导性刺激是否能够将其转换为正确的已知或正确的未知。通过区分这些不确定性的来源,我们的方法更清楚地描绘了LLM推理限度及其改进潜力。随着我们对不同LLMs推理任务的理论精确性,我们运用了不同的方法来测试模型能否达到基线框架给出的准确度。我们的工作在探索LMs的潜在推理能力以及提供解决Vague Perception现象的新视角方面是有意义的。


Article 139

Title@2025-07-23 (3): Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Title: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Pseudo-Autoregressive Neural Codec-Sprachenmodelle für effiziente Null-Shot-Text-to-Speech-Synthese 高效零热文本对语音合成的优多-自动递减神经规范语言模型 2504.10352v2

Authors (13): Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen

Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information.Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://microsoft.com/research/project/vall-e-x/palle.

最近零点文本到语音(TTS)系统面临一个共同的困境:自动递减(AR)模型的生成速度缓慢,缺乏时间控制,而非自动递减(NAR)模型缺乏时间模型,通常需要复杂的设计。在本文中,我们引入了一个新的假自动递减(PAR)编码(PAR)语言模型方法,将AR和NAR的模型统一起来。将AR的明显时间模型和NAR的平行生成结合起来,PAR产生固定时间步骤的动态长度。在PAR上建筑,我们建议PALE,一个两阶段TS系统,利用PALE来利用PAR来利用PAR来利用PAR来利用PAR来进行初始生成,然后改进NAR。在第一阶段,PAR将逐渐生成语音标牌,每个步骤都同时预测所有位置,但只保留最左边的宽幅。在第二阶段,低信任标牌在利用全球背景信息的同时被反复完善。 经验表明,在LibriTTS上受过培训的LibriTTS, 超越S-stest State States-stable State-station-station-stility-station-stall syaltial-styal-stationslationslational-stationslational-stational


Article 140

Title@2025-07-23 (3): Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

Title: Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration Miipher-2: Ein universelles Sprachrestaurationsmodell für die Millionen-Stunden-Skala-Datenrestauration Mipher-2:百万小时规模数据恢复普遍语音恢复模式 2505.04457v4

Authors (6): Shigeki Karita, Yuma Koizumi, Heiga Zen, Haruko Ishikawa, Robin Scheibler, Michiel Bacchiani

Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

本文介绍了Mipher-2,这是一个为百万小时比例数据设计的SR模型,用于对大型变异模型如大型语言模型等大规模变异模型进行数据清理培训。主要的挑战包括:对不显眼语言的概括化,在没有明确调节(如文本、扬声器ID)的情况下操作,以及计算效率。Mipher-2使用一种冷冻、预先训练的通用语音模型(USM),支持300多种语言,作为强健、不附加条件的特征提取器。为了优化效率和最大限度地减少记忆,Mipher-2采用平行的调适器,用于预测来自噪音输入的清洁USM特征,并使用波形合成的波形电动电动电动电动电解调调器。这些组件在3000小时多语言、工作室质量的录音中接受了培训,其变形作用有所增强,而USM参数保持不变。实验结果表明,Mipher-2在单调速率、扩音器相似性以及所有测试语言的客观和主观声音质量评分数。Mipher-2在所有测试语言中,仅利用100-小时的节能、近位语音处理器有效操作,在100个节制的节能压器上,在100个实际处理器上,在100-小时的节能处理器中,仅能处理一个节压器中,仅能性能性能。


Article 141

Title@2025-07-23 (3): A Diagrammatic Calculus for a Functional Model of Natural Language Semantics

Title: A Diagrammatic Calculus for a Functional Model of Natural Language Semantics Ein diagrammatischer Kalkulus für ein funktionelles Modell der natürlichen Sprachsemantik 自然语言语义学功能模型的图表计算 2507.00782v2

Authors (1): Matthieu Pierre Boyer

In this paper, we study a functional programming approach to natural language semantics, allowing us to increase the expressiveness of a more traditional denotation style. We will formalize a category based type and effect system to represent the semantic difference between syntactically equivalent expressions. We then construct a diagrammatic calculus to model parsing and handling of effects, providing a method to efficiently compute the denotations for sentences.

在本文中,我们研究自然语言语义学的功能性编程方法,从而使我们能够提高更传统的批注风格的表达性。我们将正式确定一种基于分类的类型和效果系统,以代表语义等同的表达方式之间的语义差异。然后我们构建一个图解计算法,以模拟对语义的解析和处理,从而提供一种有效计算判决批注的方法。


Article 142

Title@2025-07-23 (3): MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models

Title: MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models MEF: Ein Capability-Aware Multi-Encryption Framework zur Bewertung von Schwachstellen in Black-Box Large Language Models MEF: 用于评价黑箱大语言模型脆弱性的能力-软件多加密框架 2505.23404v4

Authors (6): Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin, Fei Gao, Wenmin Li

Recent advancements in adversarial jailbreak attacks have exposed critical vulnerabilities in Large Language Models (LLMs), enabling the circumvention of alignment safeguards through increasingly sophisticated prompt manipulations. Based on our experiments, we found that the effectiveness of jailbreak strategies is influenced by the comprehension ability of the attacked LLM. Building on this insight, we propose a capability-aware Multi-Encryption Framework (MEF) for evaluating vulnerabilities in black-box LLMs. Specifically, MEF first categorizes the comprehension ability level of the LLM, then applies different strategies accordingly: For models with limited comprehension ability, MEF adopts the Fu+En1 strategy, which integrates layered semantic mutations with an encryption technique, more effectively contributing to evasion of the LLM’s defenses at the input and inference stages. For models with strong comprehension ability, MEF uses a more complex Fu+En1+En2 strategy, in which additional dual-ended encryption techniques are applied to the LLM’s responses, further contributing to evasion of the LLM’s defenses at the output stage. Experimental results demonstrate the effectiveness of our approach, achieving attack success rates of 98.9% on GPT-4o (29 May 2025 release) and 99.8% on GPT-4.1 (8 July 2025 release). Our work contributes to a deeper understanding of the vulnerabilities in current LLM alignment mechanisms.

根据我们的实验,我们发现破狱战略的效力受到被攻击的LLM的认知能力的影响。我们根据这一洞察力,提出一个能觉悟的多加密框架,用于评估黑箱LMS的弱点。具体地说,MEF首先对LLM的理解能力水平进行分类,然后相应应用不同的战略:对于理解能力有限的模型,MEF采用Fu+EN1战略,该战略将分层的语义突变与加密技术相结合,更有效地帮助在投入和推断阶段规避LM的防御。对于具有强大理解能力的模式,MEF使用更复杂的Fu+EN1+EN2战略,其中对LMM的反应采用更多双层加密技术,从而进一步帮助在输出阶段规避LM的防御。实验结果显示我们的方法的有效性,将分层的语义突变与加密技术相结合,从而更有效地帮助LM的防御在投入和推断阶段规避LM的防御。对于具有很强的理解能力的模式,MF+E1+E2战略,其中对LM的反应是:在产出阶段进一步规避LM的防御。


Article 143

Title@2025-07-23 (3): Each to Their Own: Exploring the Optimal Embedding in RAG

Title: Each to Their Own: Exploring the Optimal Embedding in RAG Jeder für sich: Die optimale Einbettung in die RAG erkunden 探索在RAG中以最佳方式嵌入 2507.17442v1

Authors (3): Shiting Chen, Zijian Zhao, Jinsong Chen

Recently, as Large Language Models (LLMs) have fundamentally impacted various fields, the methods for incorporating up-to-date information into LLMs or adding external knowledge to construct domain-specific models have garnered wide attention. Retrieval-Augmented Generation (RAG), serving as an inference-time scaling method, is notable for its low cost and minimal effort for parameter tuning. However, due to heterogeneous training data and model architecture, the variant embedding models used in RAG exhibit different benefits across various areas, often leading to different similarity calculation results and, consequently, varying response quality from LLMs. To address this problem, we propose and examine two approaches to enhance RAG by combining the benefits of multiple embedding models, named Mixture-Embedding RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects retrievals from multiple embedding models based on standardized similarity; however, it does not outperform vanilla RAG. In contrast, Confident RAG generates responses multiple times using different embedding models and then selects the responses with the highest confidence level, demonstrating average improvements of approximately 10% and 5% over vanilla LLMs and RAG, respectively. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play approach for various domains. We will release our code upon publication.

最近,由于大语言模型(LLMS)对各个领域产生了根本性影响,因此将最新信息纳入LLMS或增加外部知识以构建特定领域模型的方法引起了广泛的关注。作为推推时间缩放方法的Retval-Auged Game (RAG) 以低成本和最小努力来调整参数而引人注目。然而,由于培训数据和模型结构各异,在RAG中采用的不同嵌入模式在各个领域产生不同的好处,往往导致不同的类似计算结果,因此,LLMS的反应质量也不同。为解决这一问题,我们提出和研究两种方法,通过结合多种嵌入模型(称为Mixtur-Embeding RAG和Confident RAG)的好处,加强RAG。M(M) 简单从基于标准化相似性的多个嵌入模型中选择检索方法;然而,它并不超越vanilla RAG。与此形成对比的是,CAG多次使用不同的嵌入模型,然后用最高信任级别选择对策,然后选择了RAG(VAA)的答案。


Article 144

Title@2025-07-23 (3): Investigating Subjective Factors of Argument Strength: Storytelling, Emotions, and Hedging

Title: Investigating Subjective Factors of Argument Strength: Storytelling, Emotions, and Hedging Untersuchte subjektive Faktoren der Streitkraft: Geschichtenerzählen, Emotionen und Hedging 争议力量的主观调查因素: 故事、情感和上下行 2507.17409v1

Authors (3): Carlotta Quensel, Neele Falk, Gabriella Lapesa

In assessing argument strength, the notions of what makes a good argument are manifold. With the broader trend towards treating subjectivity as an asset and not a problem in NLP, new dimensions of argument quality are studied. Although studies on individual subjective features like personal stories exist, there is a lack of large-scale analyses of the relation between these features and argument strength. To address this gap, we conduct regression analysis to quantify the impact of subjective factors $-$ emotions, storytelling, and hedging $-$ on two standard datasets annotated for objective argument quality and subjective persuasion. As such, our contribution is twofold: at the level of contributed resources, as there are no datasets annotated with all studied dimensions, this work compares and evaluates automated annotation methods for each subjective feature. At the level of novel insights, our regression analysis uncovers different patterns of impact of subjective features on the two facets of argument strength encoded in the datasets. Our results show that storytelling and hedging have contrasting effects on objective and subjective argument quality, while the influence of emotions depends on their rhetoric utilization rather than the domain.

在评估争论的力度时,提出好论点的概念是多方面的。随着将主观性视为一种资产而不是国家劳工局的一个问题这一更广泛的趋势,正在研究争论质量的新层面。虽然对个人故事等个别主观特征的研究已经存在,但没有对这些特征与争论强度之间的关系进行大规模分析。为了缩小这一差距,我们进行回归分析,量化主观因素(美元)情绪、讲述故事和在两个标准数据集上对美元-美元的影响,并附加客观争论质量和主观说服的注释。因此,我们的贡献具有双重性:在贡献的资源层面,因为没有附加所有研究层面的数据集,这项工作比较和评价了每种主观特征的自动说明方法。在新发现层面,我们的回归分析揭示了主观特征对数据集中编码的争论强度的两个方面的不同影响模式。我们的结果显示,讲述和套用的故事和套用对客观和主观争论质量产生对比效应,而情感的影响则取决于其言辞的利用,而不是领域。


Article 145

Title@2025-07-23 (3): Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents

Title: Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents Millionen von $\text{GeAR}$-s: Erweiterung von GraphRAG auf Millionen von Dokumenten 百万美元/美元/GeAR}- 美元:将图图扩大至百万份文件 2507.17399v1

Authors (5): Zhili Shen, Chenxin Diao, Pascual Merita, Pavlos Vougiouklis, Jeff Z. Pan

Recent studies have explored graph-based approaches to retrieval-augmented generation, leveraging structured or semi-structured information – such as entities and their relations extracted from documents – to enhance retrieval. However, these methods are typically designed to address specific tasks, such as multi-hop question answering and query-focused summarisation, and therefore, there is limited evidence of their general applicability across broader datasets. In this paper, we aim to adapt a state-of-the-art graph-based RAG solution: $\text{GeAR}$ and explore its performance and limitations on the SIGIR 2025 LiveRAG Challenge.

最近的研究探索了以图表为基础的方法来检索增强的生成,利用结构化或半结构化的信息 – – 例如实体及其从文件中提取的关系 – – 加强检索,然而,这些方法的设计通常是为了处理具体的任务,例如多答题和以询问为重点的总结,因此,关于这些方法在更广泛的数据集中的一般适用性,证据有限。 在本文件中,我们的目标是调整以图表为基础的最新RAG解决方案:$\text{GeAR},并探索其对SIGIR 2025 LiveRAG挑战的绩效和局限性。


Article 146

Title@2025-07-23 (3): Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Title: Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis Spracherkennung mit Hilfe der Minkowski-Norm: Identifikation durch Charakter Bigrams und Frequenzanalyse 以Minkowski Norm 手段进行语言探测:通过字符比格和频率分析进行识别 2507.16284v2

Authors (2): Paul-Andrei Pogăcean, Sanda-Maria Avram

The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80\% accuracy on texts shorter than 150 characters and reaches 100\% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

近年来,关于语言识别的辩论重新引起注意,特别是AI-动力语言模式的迅速演变,但是,非AI型语言识别方法被掩盖了。这一研究探索了通过利用既有语言研究得出的单数和大梁频率排名,从数学上实施语言确定式算法。使用的数据集包括长度、历史时期和类型各不相同的文本,包括短故事、童话故事和诗。尽管存在这些差异,但这种方法在短于150个字符的文本上实现了80多份的精确度,在较长的文本上达到了100份的精确度。这些结果表明,传统的基于频率的方法仍然有效且可扩展,替代AI驱动的语言检测模式。


Article 147

Title@2025-07-23 (3): Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models

Title: Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models Visualisierung von Policy-Reward-Interplay, um Nullth-Order-Preference-Optimierung von großen Sprachmodellen zu informieren 可视化政策回报互动功能,为大语言模型提供零分优先优化信息 2503.03460v2

Authors (4): Alessio Galatolo, Zhenbang Dai, Katie Winkle, Meriem Beloucif

Fine-tuning Large Language Models (LLMs) with first-order methods like back-propagation is computationally intensive. Zeroth-Order (ZO) optimisation uses function evaluations instead of gradients, reducing memory usage, but suffers from slow convergence in high-dimensional models. As a result, ZO research in LLMs has mostly focused on classification, overlooking more complex generative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithm designed for Preference Optimisation in LLMs. We begin by analysing the interplay between policy and reward models during traditional (first-order) Preference Optimisation, uncovering patterns in their relative updates. Guided by these insights, we adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence. Through experiments on summarisation, machine translation, and conversational assistants, we demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods. While it falls short of some state-of-the-art methods, our work is the first to apply Zeroth-Order methods to Preference Optimisation in LLMs, going beyond classification tasks and paving the way for a largely unexplored research direction. Code and visualisations are available at https://github.com/alessioGalatolo/VisZOPrO

优化 Zeroth- Order (ZO) 优化使用功能评估而不是梯度,减少记忆使用,但高维模型的趋同缓慢。因此,LLMS的ZO研究主要侧重于分类,忽视了更为复杂的基因化任务。在本文中,我们引入了ZOPrO,这是为普惠优化而设计的新型ZO算法。我们首先分析传统(一级)优化期间政策和奖励模式之间的相互作用,在相对更新中发现模式。根据这些洞察力,我们调整了同步渗透性软体适应(SPSA),采用有针对性的抽样战略来加速趋同。我们通过对合成、机器翻译和谈话助理的实验,我们展示了我们的方法始终在提高奖励信号,同时实现与一级方法相似的趋同时间。虽然在传统(一级) 优化政策模式中缺少某些状态的优化,在相对更新中揭示了模式。根据这些洞察力,我们调整了同步渗透性软体适应方法,我们的工作主要是将常规/直观化法 。


Article 148

Title@2025-07-23 (3): TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

Title: TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition TransLPRNet: Lite Vision-Language Network für die Single/Dual-Line-Erkennung der chinesischen Lizenzschilde TransLPRNet:中国单线/双线许可证牌照识别利于视觉-语言网络 2507.17335v1

Authors (4): Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei

License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system’s recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.

开放环境中的牌照识别在多个领域广泛适用;然而,牌照类型和成像条件的多样性带来了重大挑战。为了解决CNN和CRNN在牌照识别中遇到的限制,本文件提出一个统一解决方案,在针对中国单一和双线牌照的培训前框架内,将轻量视觉编码器与文本解码器结合成一个针对中国单一和双线牌照的文本解码器。为了减少双线牌照数据集稀缺程度,我们通过综合图像、将纹理图绘制到真实场景以及将其与真实的牌照相混合,建立了一个单线/双线牌照数据集。此外,为了提高系统识别准确度,我们引入了一个视角校正网络,使用牌照角协调器作为隐含变量,由牌照分类信息监管。这个网络提供了更好的稳定性、可解释性和低注度成本。拟议算法在粗度本地化扰动下,在经过校正CCPD测试的测试中实现了99.34%的平均识别准确度。在经过精细的本地干扰下,评估后,精确度进一步提高到99.58%。此外角点角的精确度将达到98.7%的精确度。在双线牌照试度测试中,实现了。


Article 149

Title@2025-07-23 (3): Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

Title: Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies Auf dem Weg zur Erkennbarkeit von Überzeugungen in sozialen Medien: Von der Modellentwicklung zu Erkenntnissen über Überzeugungsstrategien 探索社会媒体的观察:从示范发展到观察社会媒体的观察 2503.13844v2

Authors (6): Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model’s practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

政治广告在形成公众舆论和影响选举结果方面起着关键作用,通常是通过在更广泛的宣传战略中嵌入的微妙的说服技巧。发现这些有说服力的因素对于提高选民意识和确保民主进程的透明度至关重要。本文件介绍了一种综合方法,通过两个相互关联的研究将模式发展和现实世界的应用连接起来。首先,我们引入了一种轻量化的说服性文本检测模型,在SemEval 2023任务3的Subtask 3中实现最先进的表现,同时需要比现有方法少得多的计算资源和培训数据。第二,我们通过收集2022年澳大利亚联邦选举的Facebook广告数据集(APA22),部分说明说服因素,并微调模型,从主流新闻到社会媒体内容的适应。然后我们采用微调模型,将APA22数据集的其余部分贴上标签,揭示政治运动如何通过不同的筹资战略、文字选择、人口目标以及选举日方法的瞬间转变说服力等不同模式。我们的结论不仅突出了分析社会媒体说服力的域模型的必要性,而且还展示了如何提高透明度。


Article 150

Title@2025-07-23 (3): Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Title: Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation Lernen, rationale Beweise durch Verstärkungslernen für die retrieval-angereicherte Generation zu extrahieren 学习如何通过为回收-提款一代人加强学习来提取合理证据 2507.15586v2

Authors (7): Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang

Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose LEAR, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of LEAR, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.

(a) “检索-推荐一代”(RAG)有效地提高了大语言模型(LLMs)的准确性。然而,检索噪音对LLMs的生成质量有重大影响,需要开发脱落机制。以前的方法直截了当地提取证据,而没有明确的思考,有可能过滤关键线索和一般化斗争。为此,我们建议LEAR学习通过(1) 明确推理,首先在检索内容中找出潜在的线索,然后(2) 有意识地提取,以避免遗漏任何对回答问题有帮助的关键线索。具体地说,我们为终端到终端培训制定证据推理和证据提取为一种统一反应;应用知识符号面罩解析,以得出基于推理和提取的答案;设计三种可核查的奖励功能,包括答案、长度和格式,以便通过政策优化算法更新模型。关于三个基准数据集的广泛实验显示LEARE、提供压缩和高质量的证据、提高下游任务的准确性以及促进在线RAG系统的有效应用。


Article 151

Title@2025-07-23 (3): Cautious Next Token Prediction

Title: Cautious Next Token Prediction Vorsichtige nächste Zeichen Vorhersage 谨慎的次下 Tok 预测 2507.03038v2

Authors (10): Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu, Mingyuan Zhang, Qihua Dong, Yu Yin, Sohrab Amirghodsi, Yun Fu

Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.

在LLMM时代,下一个象征性的预测范式已经流行于下一个LLM时代的自动递增模式。目前流行的LLM的默认抽样选择是温度与核心取样一起的温度缩放,以平衡多样性和一致性。然而,当模型不确定测试问题时,这种方法导致各种NLP任务表现低劣。为此,我们提议了一个全新的无培训解码战略,称为“高端下Token预测 ” (CNTP) 。在解码过程中,如果模型在某个步骤有相对较高的预测选择,我们从步骤独立开始的多重试验,并在遇到任何标点时停止。然后,我们选择试验时,以最低的迷惑评分作为模型能力中最有可能和最可靠的试验路径。试算数字与预测信心有负关系,即,该模型不太自信,它应该做更多的试算。这与人类的行为是一致的:当感觉不确定或不自信,人们倾向于更富有创造性地探索多重思维路径,然后在遇到任何标定点时,我们谨慎地选择一种最不易理解的路径。 CNTP最有可能、最可靠地试验路径来显示我们的CNTP的自我调整的自我定位。 和不断的自我调整的自我调整的自我调整的自我调整的自我定位,让我们的自我调整的自我调整的自我调整的自我调整的自我调整的自我调整的自我调整的自我调整的自我调整。


Article 152

Title@2025-07-23 (3): Is text normalization relevant for classifying medieval charters?

Title: Is text normalization relevant for classifying medieval charters? Ist die Textnormierung für die Klassifizierung mittelalterlicher Chartas relevant? 文本正常化是否与中世纪宪章的分类相关? 2408.16446v2

Authors (2): Florian Atzenhofer-Baumgartner, Tamás Kovács

This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.

本研究报告审查了历史文本正常化对中世纪宪章分类的影响,特别侧重于文件日期和定位。我们利用数字档案中的中高地德国宪章数据集,评估各种分类者,包括传统和变压器模型,不论是否具有正常化。我们的结果表明,这种正常化最低限度地改进了定位任务,但降低了约会的准确性,意味着原始文本含有正常化可能模糊的关键特征。我们发现支持矢量机器和梯度提升超过其他模型,质疑变压器在这一使用案例中的效率。结果显示,对历史文本正常化采取选择性办法,强调保留对文件分析中的分类任务至关重要的一些文字特征的重要性。


Article 153

Title@2025-07-23 (3): Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge

Title: Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge Triple X: Ein LLM-basiertes mehrsprachiges Spracherkennungssystem für die INTERSPEECH2025 MLC-SLM Challenge 三三X:为InterSPEECH2025刚果解放运动-解运挑战建立基于LLM的多语言语言语言语言识别系统 2507.17288v1

Authors (3): Miaomiao Gao, Xiaoxiao Xiang, Yiwen Guo

This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.

本文介绍了我们提交给多语言对话语言模型(MLC-SLM-LLM)挑战任务1的三维X语音识别系统。我们的工作重点是通过创新的编码器-适应器-LLM架构,优化多语种对话情景中的语音识别准确性。这个框架利用基于文本的大语言模型的强大推理能力,同时纳入特定领域的适应。为了进一步提高多语种识别绩效,我们采取了精心设计的多阶段培训战略,利用广泛的多语言音频数据集。实验结果显示,我们的方法在Dev和测试组合中都实现了有竞争力的文字错误率,在挑战排名中获得了第二位。


Article 154

Title@2025-07-23 (3): Tiny language models

Title: Tiny language models Kleine Sprachmodelle 微小语言模式 2507.14871v2

Authors (5): Ronit D. Gross, Yarden Tzach, Tal Halevi, Ella Koresh, Ido Kanter

A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward transformer block architectures pre-trained on large language models (LLMs). However, LLM pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of LLMs. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater overlap between tokens in the pre-training and classification datasets. Furthermore, the classification accuracy achieved by a pre-trained deep TLM architecture can be replicated through a soft committee of multiple, independently pre-trained shallow architectures, enabling low-latency TLMs without affecting classification accuracy. Our results are based on pre-training BERT-6 and variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their performance on FewRel, AGNews, and DBPedia classification tasks. Future research on TLM is expected to further illuminate the mechanisms underlying NLP, especially given that its biologically inspired models suggest that TLMs may be sufficient for children or adolescents to develop language. The data and code that support the findings of this study are openly available on https://github.com/Rg32601/Tiny-Language-Models .

自然语言处理(NLP)的一个显著成就是它能够理解和产生有意义的人文语言。这一能力依赖于在大型语言模型(LLMs)上预先培训的复杂进化变压器块结构。然而,LLM预培训目前只适用于少数占支配地位的公司,因为需要大量的计算资源,限制了更广泛的研究参与。这就对更便于获取的替代方法产生了关键的需求。在本研究中,我们探讨小型语言模型(TLMs)是否具有相同的LMS质量特征。我们证明TLMs在分类任务中,在经过培训的预先和未经事先培训的模型之间表现出明显的绩效差距,这表明了培训前的实效,即使是在很小的规模上也是如此。由于培训前数据集的规模以及培训前和分类数据集中的标码之间出现更大的重叠,LMTM的绩效差距扩大。预先培训前的深层语言模型(TLMs)结构的精确度可以通过由多个独立、经过培训前培训的浅层结构委员会复制,使得低级的TRMS(TMs)在不影响分类准确性。我们的成果基于培训前的BERT-6模型和将来的模型,因此可以对DERBM-L-IA-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I


Article 155

Title@2025-07-23 (3): Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Title: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start Multimodale Reasoning durch verstärktes Lernen mit kaltem Start fördern 通过 “ 冷起 “ 的强化学习推进多模式理由 2505.22334v2

Authors (8): Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang

Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While “aha moment” patterns–where models exhibit self-correction through reflection–are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

大型语言模型(LLMS)最近的进展显示了令人印象深刻的思维推理能力,其中强化学习(RL)在这一进展中发挥着关键作用。 虽然“aha moment”模式 — — 模型通过倒影显示自我校正,常常归因于RL的突发特性,但我们首先表明,这些模式存在于RL培训之前的多式LMS(MLMS)中,但不一定与改进推理性能相关。基于这些见解,我们提交了一份关于通过两阶段方法加强多式联运推理能力的全面研究报告:(1) 监督微调(SFT),作为结构化思维推理模式的寒冷开端,随后通过GROPO进行强化学习,以进一步完善这些能力。我们的广泛实验表明,这种综合方法在挑战性多式推理基准中始终优于SFT(MLM)和RLMM(MLM)两种模式之间都达到了最先进的业绩。 由此得出的模型是3B和7B尺度,我们的7B模型显示基础模型(e.g.$66, $\right\ charrial_B hal destral deal deal deal deal sal sal ex sal deal sal deal sal ex exmal exmal exmal exmal exmal exmal exmal ex ex ex ex)。


Article 156

Title@2025-07-23 (3): An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

Title: An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning Ein effizientes und präzises Training Data Construction Framework für prozessbeaufsichtigtes Prämienmodell in mathematischer Reasoning 由进程监督的数学理由评分模型的高效率和精确的培训数据构建框架 2503.02382v2

Authors (4): Wei Sun, Qianlong Du, Fuwei Cui, Jiajun Zhang

Enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) is of great scientific and practical significance. Researchers typically employ process-supervised reward models (PRMs) to guide the reasoning process, effectively improving the models’ reasoning abilities. However, existing methods for constructing process supervision training data, such as manual annotation and per-step Monte Carlo estimation, are often costly or suffer from poor quality. To address these challenges, this paper introduces a framework called EpicPRM, which annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency. Using this approach, we efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps. Compared to other publicly available datasets, the PRM trained on Epic50k demonstrates significantly superior performance. Getting Epic50k at https://github.com/xiaolizh1/EpicPRM.

提高大语言模型的数学推理能力具有极大的科学和实践意义,研究人员通常使用由流程监督的奖励模型来指导推理过程,有效地提高模型推理能力,然而,现有的程序监督培训数据构建方法,如人工注解和每步蒙特卡洛估算,往往成本高昂或质量差,为了应对这些挑战,本文件提出了一个称为EpicPRM的框架,根据每个中间推理步骤的量化贡献进行注解,并使用适应性的二进制搜索算法来提高注解精确度和效率。我们采用这一方法,高效率地构建了一个名为Epic50k的高质量过程监督培训数据集,由50k个附加说明的中间步骤组成。与其他公开提供的数据集相比,关于Epic50k的PRM培训表现得非常出色。在https://github.com/xaolizh1/EpicPRM上获取Epic50k。


Article 157

Title@2025-07-23 (3): Tab-MIA: A Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLMs

Title: Tab-MIA: A Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLMs Tab-MIA: Ein Benchmark-Datensatz für Mitgliedschafts-Inferenzangriffe auf Tabellendaten in LLMs Tab-MIA:关于LLMM表列数据的成员推断攻击基准数据集 2507.17259v1

Authors (5): Eyal German, Sagiv Antebi, Daniel Samira, Asaf Shabtai, Yuval Elovici

Large language models (LLMs) are increasingly trained on tabular data, which, unlike unstructured text, often contains personally identifiable information (PII) in a highly structured and explicit format. As a result, privacy risks arise, since sensitive records can be inadvertently retained by the model and exposed through data extraction or membership inference attacks (MIAs). While existing MIA methods primarily target textual content, their efficacy and threat implications may differ when applied to structured data, due to its limited content, diverse data types, unique value distributions, and column-level semantics. In this paper, we present Tab-MIA, a benchmark dataset for evaluating MIAs on tabular data in LLMs and demonstrate how it can be used. Tab-MIA comprises five data collections, each represented in six different encoding formats. Using our Tab-MIA benchmark, we conduct the first evaluation of state-of-the-art MIA methods on LLMs finetuned with tabular data across multiple encoding formats. In the evaluation, we analyze the memorization behavior of pretrained LLMs on structured data derived from Wikipedia tables. Our findings show that LLMs memorize tabular data in ways that vary across encoding formats, making them susceptible to extraction via MIAs. Even when fine-tuned for as few as three epochs, models exhibit high vulnerability, with AUROC scores approaching 90% in most cases. Tab-MIA enables systematic evaluation of these risks and provides a foundation for developing privacy-preserving methods for tabular data in LLMs.

大型语言模型(LLMS)在表格数据上日益受到培训,这些数据与结构化文本不同,通常包含个人识别信息(PII),其格式结构化和清晰,因此,隐私风险出现,因为敏感记录可能无意中被模型保留,并通过数据提取或成员推断攻击(MIAs)暴露出来。虽然现有的MIA方法主要针对文字内容,但其效力和威胁影响在应用结构化数据时可能有所不同,因为其内容有限、数据类型不同、独特的价值分布和列级语义。在本文中,我们介绍了Tab-MIA,这是用于评价LMS中表格中表格中的个人识别信息(PII)的基准数据集,并展示了如何使用该数据库。Tab-MIA包含五种数据收集,每种编码格式都有六种不同的编码格式。我们使用Tab-MI基准对LMA中的最新MI方法进行了第一次评估,该方法与表格中的表格相比,我们分析了预先培训的LMS-MI在从维基百科表格中得出的结构化数据中的缩缩缩缩缩缩缩图。


Article 158

Title@2025-07-23 (3): Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models

Title: Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models Modality-Aware Neuron Pruning für das Lernen in multimodalen großen Sprachmodellen 多式联运大语言模型中不学习模式-Aware中度中枢 2502.15910v3

Authors (6): Zheyuan Liu, Guangyao Dou, Xiangchi Yuan, Chunhui Zhang, Zhaoxuan Tan, Meng Jiang

Generative models such as Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) trained on massive datasets can lead them to memorize and inadvertently reveal sensitive information, raising ethical and privacy concerns. While some prior works have explored this issue in the context of LLMs, it presents a unique challenge for MLLMs due to the entangled nature of knowledge across modalities, making comprehensive unlearning more difficult. To address this challenge, we propose Modality Aware Neuron Unlearning (MANU), a novel unlearning framework for MLLMs designed to selectively clip neurons based on their relative importance to the targeted forget data, curated for different modalities. Specifically, MANU consists of two stages: important neuron selection and selective pruning. The first stage identifies and collects the most influential neurons across modalities relative to the targeted forget knowledge, while the second stage is dedicated to pruning those selected neurons. MANU effectively isolates and removes the neurons that contribute most to the forget data within each modality, while preserving the integrity of retained knowledge. Our experiments conducted across various MLLM architectures illustrate that MANU can achieve a more balanced and comprehensive unlearning in each modality without largely affecting the overall model utility.

大型语言模型(LLMS)和多式大型语言模型(MLLM)等在大规模数据集方面受过培训的大型语言模型(MLLM)等生成模型可以导致它们记忆和无意中透露敏感信息,从而引起道德和隐私方面的关注。虽然以前的一些著作在LLM中探讨了这一问题,但由于各种模式的知识相互交织,使得综合的不学习更加困难,因此对MLLMS提出了独特的挑战。为了应对这一挑战,我们提议为MLLMS提供一个新的不学习框架,根据对特定遗忘数据的相对重要性,为选择性地剪辑神经元设计一个全新的不学习框架。具体地说,MANU由两个阶段组成:重要的神经选择和选择性剪辑。第一阶段确定并收集了与目标遗忘知识相关的最有影响力的神经元,而第二阶段则专门处理选定的神经元。MANU实际上孤立并消除了最有助于在每种模式中遗忘数据的神经元,同时保持所保留的知识的完整性。我们在不同模式下进行的实验,而没有影响MALM结构的不全面学习模式,说明MAU能否。


Article 159

Title@2025-07-23 (3): Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

Title: Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent Test-Time-Matching: Entkoppelung von Persönlichkeit, Speicher und sprachlichem Stil im LLM-basierten Rollenspiel-Sprachenagenten 测试时间 – – 匹配:以LLM为基础的角色扮演语言代理的分解个性、记忆和语言风格 2507.16799v2

Authors (6): Xiaoyu Zhan, Xinyu Fu, Hao Sun, Yuanqi Li, Jie Guo, Yanwen Guo

The rapid advancement of large language models (LLMs) has enabled role-playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well-known fictional or public figures. On the other hand, fine-tuning-based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test-Time-Matching (TTM), a training-free role-playing framework through test-time scaling and context engineering. TTM uses LLM agents to automatically decouple a character’s features into personality, memory, and linguistic style. Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing. It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.

大型语言模型(LLMS)的迅速发展使角色扮演语言代理能够在各种应用中展示出巨大的潜力,然而,仅仅依靠提示和背景投入往往不足以在具体角色中实现深刻的渗透,特别是众所周知的虚构或公众人物。另一方面,微调方法由于数据收集和训练所需的计算资源方面的挑战而面临局限性,从而限制了其更广泛的适用性。为了解决这些问题,我们提议通过测试时间缩放和背景工程,将测试时间匹配(TTM)这个不培训的角色扮演框架作为测试时间缩放和背景工程。TTM利用LM代理自动将一个字符特征分解为个性、记忆和语言风格。我们的框架包含一个结构化的、三阶段生成管道,利用这些特征来控制角色扮演。它实现高度不共性的角色发挥功能,还能够使多种语言风格、甚至个性和记忆的变异。我们通过人类评估来评估我们的框架,结果表明我们的方法在生成直观和自相一致的性格对话方面达到了杰出的性能。


Article 160

Title@2025-07-23 (3): CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings

Title: CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings CLARIFID: Verbesserung der Radiologie-Berichtsgenerierung durch Verstärkung klinisch exakter Impressionen und Verstärkung detaillierter Befunde CLARIFID:通过加强临床准确压抑和执行详细调查结果,改进放射学报告的编制工作 2507.17234v1

Authors (3): Kyeongkyu Lee, Seonghwan Yoon, Hongki Lim

Automatic generation of radiology reports has the potential to alleviate radiologists’ significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) enforces reasoning-aware decoding that completes “Findings” before synthesizing the “Impression”, and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a reasoning-aware next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive Findings section before synthesizing the Impression and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.

自动生成放射学报告有可能减轻放射学家的大量工作量,但目前的方法却难以提供临床可靠结论。 特别是,大多数先前的方法侧重于制作流流流文本,而没有有效地确保报告事实的正确性,而且往往依赖单视图像,从而限制诊断的全面性。 我们建议CLARIFID,这是一个新颖框架,它通过反映专家的两步工作流程,直接优化诊断正确性,从而通过反映专家的双步工作流程,直接优化诊断正确性。具体地说,CLARIFID(1) 通过科-觉预培训,学习从发现到抑郁的逻辑流流流流流流流流流,但目前的方法很难提供临床可靠的结论结论。 特别是,CLARIFIDS部分的切Xbert F1分临床政策优化是其中的奖项,没有有效地确保报告报告报告真实性F1的临床压轴 F1分在对“impress”和(4) 通过一个基于视觉-转导的多谱多面的多视图导算器,将多个胸部的X光感光观察(1) 学习过程中,我们运用一个深知知知知的下下下下下下调的下强迫强迫强迫强迫强迫强迫强迫强迫战略,然后为重报告级的重新排名,确保模型第一模型第一和第一模型制作、第一和第一和第一模型制、第一期间的实验室结果,并测试、第一模型在合成后,在合成后,在合成后,在合成后,在合成后,在合成后,在合成后,在合成后,在合成后,在合成后,并测试后,在合成后,并展示将一个保持全面推进后,并测试后,并测试后,并测试后,并测试后,并测试后,并测试后,再将一个模型,再将一个布拉前,以全面的实验室结果,并试验后,再将一个模型,再将,再,再,再将,再,再,再将,再将,再,再将一个模型,并再将一个模型,再将一个模型,再将一个模型,再将一个模型,并后,并后,并再将,并后,再将,再将,再将,再将,并后,并后,并再将一个SALBRBRBRBR结果,并进行,并进行,再将


Article 161

Title@2025-07-23 (3): GTA: Grouped-head latenT Attention

Title: GTA: Grouped-head latenT Attention GTA: Grouped-head latenT Achtung GTA: 分组组长晚间会议 2506.17286v2

Authors (8): Luoyang Sun, Cheng Deng, Jiwen Jiang, Xinjian Wu, Haifeng Zhang, Lei Chen, Lionel Ni, Jun Wang

Attention mechanisms underpin the success of large language models (LLMs), yet their substantial computational and memory overhead poses challenges for optimizing efficiency and performance. A critical bottleneck arises as KV cache and attention computations scale rapidly with text length, challenging deployment on hardware with limited computational and memory resources. We observe that attention mechanisms exhibit substantial redundancy, since the KV cache can be significantly compressed and attention maps across heads display high similarity, revealing that much of the computation and storage is unnecessary. Leveraging these insights, we propose \textbf{G}rouped-Head Laten\textbf{T} \textbf{A}ttention (GTA), a novel attention mechanism that reduces memory usage and computational complexity while maintaining performance. GTA comprises two components: (1) a shared attention map mechanism that reuses attention scores across multiple heads, decreasing the key cache size; and (2) a nonlinear value decoder with learned projections that compresses the value cache into a latent space, further cutting memory needs. GTA cuts attention computation FLOPs by up to \emph{62.5\%} versus Grouped-Query Attention and shrink the KV cache by up to \emph{70\%}, all while avoiding the extra overhead of Multi-Head Latent Attention to improve LLM deployment efficiency. Consequently, GTA models achieve a \emph{2x} increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from the smaller cache footprint.

大型语言模型(LLMS)的成功是关注机制的基础,但是它们大量的计算和记忆管理为优化效率和性能带来了挑战。 KV缓冲和关注快速计算文本长度的新关注机制(GTA)是减少记忆用量和计算复杂性的新机制,在计算和记忆资源有限的硬件上部署有挑战性。我们观察到,关注机制存在大量冗余,因为KV缓冲机制可以大幅压缩,而头部的注意地图显示高度相似,显示大部分计算和储存都没有必要。利用这些洞察力,我们建议使用\ textbf{G} roubled- Head Laten\ textbf{T}\ textbf}\ textbf{A}(GTA),这是一个减少记忆用量和计算计算复杂性的新关注机制,同时保持业绩。 GTAT(GTA) 包含两个部分:(1) 共同关注地图机制,将注意力重新集中在多个头上,减少关键缓冲尺寸;(2) 非线值解码的解码值,其所学的预测将把缓冲存储器压缩成一个暗藏空间,进一步减少记忆需要。 GTTTTA(LA) 将注意力从最后的缩缩缩缩缩成一个速度,同时使GMLA(LAx) 降低尾部) 降低成本。


Article 162

Title@2025-07-23 (3): A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task

Title: A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task Ein hochreines Rezept Datensatz mit Inhaltsstoff-Staaten Annotation für staatliche Probing-Aufgabe 国家检验任务说明 2507.17232v1

Authors (3): Mashiro Toyooka, Kiyoharu Aizawa, Yoko Yamakata

Large Language Models (LLMs) are trained on a vast amount of procedural texts, but they do not directly observe real-world phenomena. In the context of cooking recipes, this poses a challenge, as intermediate states of ingredients are often omitted, making it difficult for models to track ingredient states and understand recipes accurately. In this paper, we apply state probing, a method for evaluating a language model’s understanding of the world, to the domain of cooking. We propose a new task and dataset for evaluating how well LLMs can recognize intermediate ingredient states during cooking procedures. We first construct a new Japanese recipe dataset with clear and accurate annotations of ingredient state changes, collected from well-structured and controlled recipe texts. Using this dataset, we design three novel tasks to evaluate whether LLMs can track ingredient state transitions and identify ingredients present at intermediate steps. Our experiments with widely used LLMs, such as Llama3.1-70B and Qwen2.5-72B, show that learning ingredient state knowledge improves their understanding of cooking processes, achieving performance comparable to commercial LLMs.

大语言模型(LLMS)是针对大量程序文本进行培训的,但并不直接观察现实世界现象。在烹饪食谱方面,这构成一个挑战,因为中间成分状态往往被忽略,使得模型难以跟踪成分状态并准确理解配方。在本文中,我们将评估一种语言模型对世界的理解的一种方法,即状态测试应用于烹饪领域。我们提出一个新的任务和数据集,用于评价LLMS在烹饪过程中对中间成分状态的认识程度。我们首先从结构完善和受控制的配方文本中收集了一套清晰和准确的成分状态变化说明的新的日本配方数据集。我们设计了三项新任务来评估LMS是否能够跟踪成分状态过渡和确定中间步骤的成分。我们与广泛使用的LLlama3.1-70B和Qwen2.5-72B等LMs进行的实验表明,学习成分状态知识提高了他们对烹饪过程的理解,取得了与商业LMs相似的性能。


Article 163

Title@2025-07-23 (3): The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models

Title: The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models Die pluralistische Morallücke: Urteil und Wertunterschiede zwischen Menschen und großen Sprachmodellen verstehen 多元道德差距:了解人类与大语言模式之间的判断和价值差异 2507.17216v1

Authors (4): Giuseppe Russo, Debora Nozza, Paul Röttger, Dirk Hovy

People increasingly rely on Large Language Models (LLMs) for moral advice, which may influence humans’ decisions. Yet, little is known about how closely LLMs align with human moral judgments. To address this, we introduce the Moral Dilemma Dataset, a benchmark of 1,618 real-world moral dilemmas paired with a distribution of human moral judgments consisting of a binary evaluation and a free-text rationale. We treat this problem as a pluralistic distributional alignment task, comparing the distributions of LLM and human judgments across dilemmas. We find that models reproduce human judgments only under high consensus; alignment deteriorates sharply when human disagreement increases. In parallel, using a 60-value taxonomy built from 3,783 value expressions extracted from rationales, we show that LLMs rely on a narrower set of moral values than humans. These findings reveal a pluralistic moral gap: a mismatch in both the distribution and diversity of values expressed. To close this gap, we introduce Dynamic Moral Profiling (DMP), a Dirichlet-based sampling method that conditions model outputs on human-derived value profiles. DMP improves alignment by 64.3% and enhances value diversity, offering a step toward more pluralistic and human-aligned moral guidance from LLMs.

人们越来越多地依赖大语言模型(LLMs)来获得道德建议,这可能影响人类的决定。然而,对于LLMs如何与人类道德判断紧密地一致,人们对此知之甚少。为了解决这个问题,我们引入了道德标准Dilemma数据集,这是1,618个现实世界道德困境的基准,同时分配由二进制评价和自由文本理论组成的人类道德判断。我们将此问题视为多元分配一致的任务,比较了LLM的分布和人类对两难困境的判断。我们发现,模型只在高度共识下复制人类判断;当人类分歧增加时,调整就会急剧恶化。与此同时,我们使用从3,783个价值表达原理中得出的60价值分类,我们表明LLMs依赖比人类更狭窄的一套道德价值。这些结论揭示了多元的道德差距:所表现的价值分配和多样性的不匹配。为了缩小这一差距,我们引入了动态的MorLPLM(DMP),一种基于Drichlet的抽样方法,该方法将产出作为人造价值观的模型。DMP(DMP)改进了64.3%的多元化和增强人类的多样化。


Article 164

Title@2025-07-23 (3): LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants

Title: LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants LEGO Co-Builder: Exploring Fine-Grained Vision-Language Modeling für multimodale LEGO Assembly Assistants LEGO 共同建筑者:为多式LEGO大会助理探索精美的愿景-语言建模 2507.05515v2

Authors (9): Haochen Huang, Jiahuan Pei, Mohammad Aliannejadi, Xin Sun, Moonisa Ahsan, Chuang Yu, Zhaochun Ren, Pablo Cesar, Junxiao Wang

Vision-language models (VLMs) are facing the challenges of understanding and following multimodal assembly instructions, particularly when fine-grained spatial reasoning and precise object state detection are required. In this work, we explore LEGO Co-builder, a hybrid benchmark combining real-world LEGO assembly logic with programmatically generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, allowing controlled evaluation of instruction-following, object detection, and state detection. We introduce a unified framework and assess leading VLMs such as GPT-4o, Gemini, and Qwen-VL, under zero-shot and fine-tuned settings. Our results reveal that even advanced models like GPT-4o struggle with fine-grained assembly tasks, with a maximum F1 score of just 40.54\% on state detection, highlighting gaps in fine-grained visual understanding. We release the benchmark, codebase, and generation pipeline to support future research on multimodal assembly assistants grounded in real-world workflows.

在这项工作中,我们探索了LEGO共同建造者,这是一个混合基准,将现实世界的LEGO组装逻辑与方案生成的多式联运场景相结合。数据集捕捉了渐进式的视觉状态和程序指示,允许对遵循指示、物体探测和州探测进行有控制的评价。我们引入了一个统一的框架并评估领先的VLMS,如GPT-4o、Gemini和Qwen-VL,在零点和精确调整的环境下。我们的结果显示,即使是GPT-4o等先进的模型也与精细重装配任务作斗争,在州检测方面最多只能达到40.54分的F1分,突出微小视觉理解方面的差距。我们发布了基准、代码库和生成管道,以支持未来基于现实世界工作流程的多式组装助理的研究。


Article 165

Title@2025-07-23 (3): AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation

Title: AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation AlignDistil: Token-Level-Sprachmodell Alignment als Adaptive Policy Destillation Aligndistil: 作为适应性政策蒸馏的调整级语言模式模型对齐 2503.02832v3

Authors (6): Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu

In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.

在现代的大型语言模型(LLMM)中,LLM调整至关重要,通常通过从人类反馈学习(RLHF)和直接偏好优化(DPO)等强化方法来实现。然而,在LLM调整的多数现行方法中,反应中的所有标志都使用稀少的、反应层次的奖赏或优惠注释来优化。对象征性奖赏的无知可能错误地惩罚高品质的象征物,或鼓励低质量的象征物,导致业绩不尽人意和趋同速度缓慢。为了解决这个问题,我们提议AleignDidistil,一种RLHF等值的蒸馏方法,用于象征性奖励优化。具体地说,我们将DPO所学的奖赏引入RLHF目标,理论上证明这个目标与象征性水平的蒸馏过程是等同的,在这个过程中,教师的分布线性地将来自DPO模式和纯报酬模式之间的准确差距进一步缩小。我们建议,在正常和反向DPO级奖励模式上建立对比的DPO奖赏。我们把DPO所学的奖赏引入了象征性的奖励,从正常和反向方向的模版的模版的模版的模版的模范。此外分配,并避免了我们目前和超模版的模版的模版的模版的模版的模版的模版的模范式的模版的模版的模版的模版的模版的师级的比式的比。此外的比式分配。


Article 166

Title@2025-07-23 (3): FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance

Title: FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance FinGAIA: Ein End-to-End-Benchmark für die Bewertung von KI-Agenten in der Finanzierung FinGAIA: 对AI公司金融代理机构进行评价的端至端基准 2507.17186v1

Authors (21): Lingfeng Zeng, Fangqi Lou, Zixuan Wang, Jiajie Xu, Jinyi Niu, Mengping Li, Yifan Dong, Qi Qi, Wei Zhang, Ziwei Yang, Jun Han, Ruilun Feng, Ruiqi Hu, Lejie Zhang, Zhengbo Feng, Yicheng Ren, Xin Guo, Zhaowei Liu, Dongpo Cheng, Weige Cai, Liwen Zhang

The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9\%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at https://github.com/SUFE-AIFLM-Lab/FinGAIA.

AI代理商的蓬勃发展为在各个领域使复杂任务自动化提供了前所未有的机会,然而,其金融部门的多步、多工具协作能力仍未得到充分探讨。本文件介绍了FinGAIA,这是旨在评价AI代理商在金融领域实际能力的一个端对端基准。FinGAIA由407项精心设计的任务组成,涉及七个主要的金融次领域:证券、资金、银行、保险、未来、信托和资产管理。这些任务分为三个层次的情景深度:基本业务分析、资产决策支助和战略风险管理。我们在零发环境中对10个AI主流代理商进行了评估。我们的工作为最佳代理商ChatGPT提供了与48.9的总体准确性,该代理商虽然优于非专业人员,但仍落后于金融专家,但仍超过35个百分点。错误分析揭示了五个反复出现的失败模式:跨模式协调不便、金融时序比亚、业务过程认识障碍等。这些模式指向未来研究的关键方向。我们的工作为金融领域提供了第一位代理商基准。ACTGPTGP/FAAA,目标是客观地评估和AFIAFI/BIAA。


Article 167

Title@2025-07-23 (3): SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

Title: SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs SKA-Bench: Ein feinkörniger Benchmark zur Bewertung des strukturierten Wissensverständnisses von LLMs SKA-Bunch:评估对LLMS的结构性知识了解的精细基准 2507.17178v1

Authors (7): Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang

Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/Lza12a/SKA-Bench.

虽然大型语言模型(LLMS)在理解KG和Table等结构化知识(SK)方面取得了显著进展,但现有的SK理解评价并非严格(即缺乏对具体能力的评价),而是侧重于单一类型的SK。因此,我们旨在提出一个更加全面和严格的结构化知识理解基准,以诊断LLM的缺陷。在本文中,我们引入SKA-Bench,一个结构化知识增强QA基准,包括四种广泛使用的结构化知识形式:KG、表、KG+Text和表+Text。我们利用一个三阶段的管道来构建SKA-Bench实例,其中包括一个问题、一个答案、积极的知识单位以及一个吵闹的知识单位。为了以精细的方式评估SKLMS的认知能力,我们将实例扩大到四个基本的能力测试台:Noise robustness、秩序不敏感、信息整合和否定。Empricalcal评价了8个代表LMs,包括高级深海-R1。我们现有的LMS仍然面临着一个重大的业绩挑战,而我们目前掌握的Risequal-ration的系统是我们掌握的系统上的数据和Rismex的系统。


Article 168

Title@2025-07-23 (3): Adaptive Graph Pruning for Multi-Agent Communication

Title: Adaptive Graph Pruning for Multi-Agent Communication Adaptives Graph Pruning für Multi-Agent Kommunikation 多机构通信调节图 2506.02951v3

Authors (4): Boyi Li, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang

Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of $2.58\%\sim 9.84\%$; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of $90\%+$; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.

以大型语言模型(LLM)为基础的大型多试剂系统在各种任务中表现显著,特别是在通过协作通信而得到加强的情况下。然而,目前的方法往往依赖固定数量的代理商和静态通信结构,从而限制了它们适应不同任务复杂性的能力。在本文件中,我们建议采用适应性图形普鲁宁(AGP),这是一个新颖的任务适应性多试剂协作框架,共同优化代理商数量(硬调整)和通信地形(软调整)。具体地说,我们的方法采用一个两阶段培训战略:首先,独立培训不同代理商数量的软运行网络,以确定最佳的代理商-q具体数量完整图表和定位掩体,从而确定具体任务的最佳性能;然后,在最大完整图表中联合优化硬调整和软运行,以便动态地配置代理商数量及其通信结构(软调整)。 广泛的实验表明,我们的方法是:(1) 高绩效,在六个基准中达到最先进的标准,在多个主流LLM结构中持续地将最低的消费基准进行比较,将业绩提高2.5-QQ-eximal 完整完整完整的图表;在最高标准上,在最高标准上,在最高级培训中实现最高水平和最精确的顺序上,在最精确的进度上,在最精确的进度上,在最精确的排序上,在最精确的排序上,在最精确地进行最精确地推。


Article 169

Title@2025-07-23 (3): SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script

Title: SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script SHARE: Shared Memory-Aware Open-Domain Langzeitdialogdatensatz aus Movie Script SHARE: 从电影脚本建构的内存- 内存- 内存- 公用 Open- Domain 长期对话数据集 2410.20682v3

Authors (3): Eunwon Kim, Chanho Park, Buru Chang

Shared memories between two individuals strengthen their bond and are crucial for facilitating their ongoing conversations. This study aims to make long-term dialogue more engaging by leveraging these shared memories. To this end, we introduce a new long-term dialogue dataset named SHARE, constructed from movie scripts, which are a rich source of shared memories among various relationships. Our dialogue dataset contains the summaries of persona information and events of two individuals, as explicitly revealed in their conversation, along with implicitly extractable shared memories. We also introduce EPISODE, a long-term dialogue framework based on SHARE that utilizes shared experiences between individuals. Through experiments using SHARE, we demonstrate that shared memories between two individuals make long-term dialogues more engaging and sustainable, and that EPISODE effectively manages shared memories during dialogue. Our dataset and code are available at https://github.com/e1kim/SHARE.

研究的目的是通过利用这些共同的记忆,使长期对话更具参与性。为此,我们推出一个新的长期对话数据集,名为SHARE,由电影剧本制成,是各种关系之间共享记忆的丰富来源。我们的对话数据集包含在谈话中明确披露的两个人的个人信息和事件摘要,以及隐含的可提取的记忆。我们还引入了EPISODE,这是一个以SHARE为基础的长期对话框架,利用个人之间的共同经验。我们利用SHARE进行实验,表明两个人之间的共同记忆使长期对话更具参与性和可持续性,EPISODE在对话中有效地管理共同的记忆。我们的数据集和代码可在https://github.com/e1kim/SHARE查阅。


Article 170

Title@2025-07-23 (3): CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

Title: CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards CogDual: Verbesserung der Dual Cognition von LLMs durch Stärkung des Lernens mit impliziten regelbasierten Belohnungen 认知:通过强化学习,加强LLMs的双重认知,以不隐含规则的奖励加强学习 2507.17147v1

Authors (8): Cheng Liu, Yifei Lu, Fanghua Ye, Jian Li, Xingyu Chen, Feiliang Ren, Zhaopeng Tu, Xiaolong Li

Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbf{CogDual}, a novel RPLA adopting a \textit{cognize-then-respond } reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.

角色扮演语言代理器(RPLAs)已成为大语言模型的一个重要应用方向。 现有方法通常依靠迅速的工程或监督的微调,使模型能够在具体情景中模仿性格行为,但往往忽视驱动这些行为的基本基本机制。在认知心理学的启发下,我们引入了\textbf{CogDual},这是一个新颖的RPLA,采用了\ textit{cogniz-the-respond}推理模式。通过联合模拟外部情况意识和内部自我意识,CogDual生成了反应,提高了性格一致性和背景一致性。为了进一步优化性能,我们采用两种通用奖励计划强化学习,为开放式文本生成设计。关于COSER基准以及交叉关系和生命中心的广泛实验表明,CogDual一贯超越现有基线,并有效地概括了各种角色扮演任务。


Article 171

Title@2025-07-23 (3): Resona: Improving Context Copying in Linear Recurrence Models with Retrieval

Title: Resona: Improving Context Copying in Linear Recurrence Models with Retrieval Resona: Verbesserung der Kontextkopie in linearen Wiederholungsmodellen mit Retrieval Resona: 改进有检索的线性重复模型中环境复制 2503.22913v3

Authors (8): Xinyu Wang, Linrui Ma, Jerry Huang, Peng Lu, Prasanna Parthasarathi, Xiao-Wen Chang, Boxing Chen, Yufei Cui

Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce Resona, a simple and scalable framework for augmenting linear recurrent models with retrieval. Resona augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that Resona-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.

最近大型语言模型(LLM)研究空间的变化表明,人们越来越重视新建筑,以便与长期以来主导这一空间的原型变异器模型竞争。线性重复式模型已证明是可行的竞争对手,因为其计算效率高。然而,这些模型仍然表明,在需要回顾背景信息的其他任务方面,与变异体相比,在内文学习方面存在着巨大的差距。在这项工作中,我们引入了Resona,这是一个简单和可扩展的框架,用以通过检索来扩大线性重复式模型。Resona将模型扩大,能够整合从所提供的投入环境中检索的信息,使适应不同任务要求的适应行为。对各种线性重复式模型的实验表明,Resona-推荐模式在各种合成和现实世界自然语言任务方面观察到了显著的绩效收益,突出了它作为提高线性经常性LMS的文性学习和语言建模能力的一般目的方法的能力。


Article 172

Title@2025-07-22 (2): Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings

Title: Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings Evolutionäre Feature-weise Thresholding für Binäre Darstellung von NLP-Embeddings NLP 嵌入器二进制代表制的进化特点 2507.17025v1

Authors (3): Soumen Sinha, Shahryar Rahnamayan, Azam Asilian Bidgoli

Efficient text embedding is crucial for large-scale natural language processing (NLP) applications, where storage and computational efficiency are key concerns. In this paper, we explore how using binary representations (barcodes) instead of real-valued features can be used for NLP embeddings derived from machine learning models such as BERT. Thresholding is a common method for converting continuous embeddings into binary representations, often using a fixed threshold across all features. We propose a Coordinate Search-based optimization framework that instead identifies the optimal threshold for each feature, demonstrating that feature-specific thresholds lead to improved performance in binary encoding. This ensures that the binary representations are both accurate and efficient, enhancing performance across various features. Our optimal barcode representations have shown promising results in various NLP applications, demonstrating their potential to transform text representation. We conducted extensive experiments and statistical tests on different NLP tasks and datasets to evaluate our approach and compare it to other thresholding methods. Binary embeddings generated using using optimal thresholds found by our method outperform traditional binarization methods in accuracy. This technique for generating binary representations is versatile and can be applied to any features, not just limited to NLP embeddings, making it useful for a wide range of domains in machine learning applications.

高效的文本嵌入对于大型自然语言处理(NLP)应用程序至关重要,因为存储和计算效率是关键关注的主要问题。在本文件中,我们探讨如何使用二进制表示(条码)而不是实际价值的特性来进行机器学习模型(如BERT)产生的非双进制表示。 推进是将连续嵌入成二进制表示的常见方法,通常使用所有功能的固定阈值。 我们提议了一个基于协调的搜索优化框架,用以确定每种特性的最佳阈值,表明特定特性的阈值导致二进制编码的性能提高。这确保二进制表示既准确又高效,提高各种特性的性能。 我们的最佳条码表示在各种国家语言学习模型应用中显示了有希望的结果,显示了它们转换文本代表的潜力。 我们对不同的非双进制任务和数据集进行了广泛的实验和统计测试,以评价我们的方法和将其与其他阈值进行比较。 我们建议使用最佳阈值生成的二进制嵌入,表明每个特性优于传统的二进制方法的精度方法。这种生成二进制表示方式的技术是精确性的,可以将一个宽的域域用于任何功能。


Article 173

Title@2025-07-22 (2): OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Title: OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles OpenVLThinker: Komplexe Vision-Sprachen-Reasoning über iterative SFT-RL-Zyklen OpenVLTHinker:通过循环 SFT-RL循环的复杂愿景-语言理由 2503.17352v2

Authors (6): Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model’s reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

我们引入了OpenVLTHINK(OpenVLTHINK),这是最早的开放源码大型视觉语言模型之一(LVLMS),以展示复杂的思维链推理,在具有挑战性视觉推理任务中取得了显著的成绩。基于文本的推理模型(例如Deepseek R1)在文本任务中显示出了有希望的结果,通过监管的微调(SFT)将其推入LVLMS,往往导致性能退化。相反,纯粹强化学习(RL)基础方法面临一个巨大的搜索空间,阻碍小型模型(例如,7B LVLMS)的反射行为出现。 令人惊讶的是,基于SFT和RLL的推理模型(例如,7B LVLM)之间的交替最终结果在几处相交替后最终取得了显著的性能改进。 我们的分析显示,SFTF1和RLFS-7B的早期逻辑(通过IMF1)的数学推理学基础, 持续地展示了我们S-IMFAL1和S-BA的SAL的六级的进度。


Article 174

Title@2025-07-22 (2): Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Title: Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? Können externe Validierungstools die Annotationsqualität für LLM-as-a-Judge verbessern? 外部验证工具能否提高LLM-as-a-Judge的批注质量? 2507.17015v1

Authors (6): Arduin Findeis, Floris Weers, Guoli Yin, Ke Ye, Ruoming Pang, Tom Gunter

Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the “better” response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground itself based on external validation, independent of the LLM’s internal knowledge and biases. We provide extensive experimental results evaluating our method across the three targeted response domains as well as general annotation tasks, using RewardBench (incl. AlpacaEval and LLMBar), RewardMath, as well as three new datasets for domains with saturated pre-existing datasets. Our results indicate that external tools can indeed improve performance in many, but not all, cases. More generally, our experiments highlight the sensitivity of performance to simple parameters (e.g., prompt) and the need for improved (non-saturated) annotator benchmarks. We share our code at https://github.com/apple/ml-agent-evaluator.

对模型响应的偏好被广泛收集,以评价和向大型语言模型(LLMs)提供反馈。鉴于对同一输入的两种替代模式反应,人类或AI说明员选择了“更好”回应。这种方法可以为难以获得其他硬编码衡量标准的领域提供反馈(例如聊天回应质量),从而帮助模型评估或培训。然而,对于某些领域来说,从AI和人类那里获得高质量的对等比较可能很困难。例如,对于许多事实性陈述的答复,说明员可能不成比例地衡量书面质量而不是事实基础。在这项工作中,我们探索如何用更多工具增强标准的AI说明系统,以提高以下三个具有挑战性的反应领域的性能:长式事实、数学和代码任务。我们提议了一个使用工具的代理系统来提供更高质量的反馈。我们的系统使用网络搜索和代码执行在外部验证的基础上,独立于LLMM的内部知识和偏向。我们提供了广泛的实验结果,用来评估我们三个目标性反应领域的方法和一般的注释。我们用的是,而不是使用RewardBeth/Barntator(cal-max lax) 数据库显示我们的所有数据。


Article 175

Title@2025-07-22 (2): Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors

Title: Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors Multi-Label-Klassifikation mit generativen KI-Modellen im Gesundheitswesen: Eine Fallstudie über Suizidalität und Risikofaktoren 多标签分类,具有产生AI 保健模式的模式:关于自杀性和风险因素的个案研究 2507.17009v1

Authors (12): Ming Huang, Zehan Li, Yan Hu, Wanjing Wang, Andrew Wen, Scott Lane, Salih Selek, Lokesh Shahani, Rodrigo Machado-Vieira, Jair Soares, Hua Xu, Hongfang Liu

Suicide remains a pressing global health crisis, with over 720,000 deaths annually and millions more affected by suicide ideation (SI) and suicide attempts (SA). Early identification of suicidality-related factors (SrFs), including SI, SA, exposure to suicide (ES), and non-suicidal self-injury (NSSI), is critical for timely intervention. While prior studies have applied AI to detect SrFs in clinical notes, most treat suicidality as a binary classification task, overlooking the complexity of cooccurring risk factors. This study explores the use of generative large language models (LLMs), specifically GPT-3.5 and GPT-4.5, for multi-label classification (MLC) of SrFs from psychiatric electronic health records (EHRs). We present a novel end to end generative MLC pipeline and introduce advanced evaluation methods, including label set level metrics and a multilabel confusion matrix for error analysis. Finetuned GPT-3.5 achieved top performance with 0.94 partial match accuracy and 0.91 F1 score, while GPT-4.5 with guided prompting showed superior performance across label sets, including rare or minority label sets, indicating a more balanced and robust performance. Our findings reveal systematic error patterns, such as the conflation of SI and SA, and highlight the models tendency toward cautious over labeling. This work not only demonstrates the feasibility of using generative AI for complex clinical classification tasks but also provides a blueprint for structuring unstructured EHR data to support large scale clinical research and evidence based medicine.

早期发现自杀性相关因素(SrFs),包括SI、SA、自杀性接触(ES)和非自杀性自残(NSSI),对于及时干预至关重要。虽然先前的研究应用了AI来在临床笔记中检测SrFs,但多数情况下将自杀性作为二进制分类任务,忽略了引起风险的因素的复杂性。这项研究探索了使用基因化大语言模型(LLLMs),特别是GPT-3.5和GPT-4.5,用于从精神电子健康记录(EHRs)对SrFs进行多标签分类(MLC ),对及时干预至关重要。我们提出了一个新颖的结束感化刚果解放运动管道和引入高级评价方法,包括标签定级标准以及用于错误分析的多标签混淆矩阵。GPT-35取得了顶级业绩,其中忽略了混合风险因素的精确度和0.91 F1分,而GPT-4.5则以指导方式显示在标签各组中表现优异性,包括稀有或少数民族的临床结构模型,同时展示了我们以系统化的分类为基准的分类的模型,展示了一种比较稳和精确的成绩。


Article 176

Title@2025-07-22 (2): ORANSight-2.0: Foundational LLMs for O-RAN

Title: ORANSight-2.0: Foundational LLMs for O-RAN ORANSight-2.0: LLM-Grundlagen für O-RAN ORANSight-2.0.0:O-RAN基础项目 2503.05200v2

Authors (2): Pranshav Gajjar, Vijay K. Shah

Despite the transformative impact of Large Language Models (LLMs) across critical domains such as healthcare, customer service, and business marketing, their integration into Open Radio Access Networks (O-RAN) remains limited. This gap is primarily due to the absence of domain-specific foundational models, with existing solutions often relying on general-purpose LLMs that fail to address the unique challenges and technical intricacies of O-RAN. To bridge this gap, we introduce ORANSight-2.0 (O-RAN Insights), a pioneering initiative to develop specialized foundational LLMs tailored for O-RAN. Built on 18 models spanning five open-source LLM frameworks – Mistral, Qwen, Llama, Phi, and Gemma – ORANSight-2.0 fine-tunes models ranging from 1B to 70B parameters, significantly reducing reliance on proprietary, closed-source models while enhancing performance in O-RAN-specific tasks. At the core of ORANSight-2.0 is RANSTRUCT, a novel Retrieval-Augmented Generation (RAG)-based instruction-tuning framework that employs two LLM agents – a Mistral-based Question Generator and a Qwen-based Answer Generator – to create high-quality instruction-tuning datasets. The generated dataset is then used to fine-tune the 18 pre-trained open-source LLMs via QLoRA. To evaluate ORANSight-2.0, we introduce srsRANBench, a novel benchmark designed for code generation and codebase understanding in the context of srsRAN, a widely used 5G O-RAN stack.

尽管大语言模型(LLM)在医疗保健、客户服务和商业营销等关键领域产生了变革影响,但将其纳入开放电台接入网络(O-RAN)仍然有限,这主要是因为没有具体领域的基础模型,现有解决方案往往依赖一般用途的LLM,无法应对O-RAN的独特挑战和技术复杂性。为了缩小这一差距,我们采用了ORANSight-2.0(O-RAN Insights),这是为O-RAN专门设计的开发新颖基础模型的开创性举措。在18个模型上建于五个开放源LM框架 – Mistral、Qwen、Llama、Phi和Gemma – ORANSight-2.0微调模型上,无法解决O-RAO-RANSight-T(O-RA-RA-Rassional-Ralthal-LISCRA-LODRA-DRA-LODRA-DRA-DRA-G-LODRA-G-DRA-G-G-LOral-G-G-G-G-G-LVLODRA-G-G-G-G-G-G-G-G-G-G-G-G-ILVILD-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-L-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-LD-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-LD-L-


Article 177

Title@2025-07-22 (2): Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks

Title: Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks Obscured, aber nicht gelöscht: Bewertung von Nationalitäts-Bias in LLMs über namensbasierte Bias-Benchmarks 以名称为依据的Bias基准在LLMs中评估国籍偏见 2507.16989v1

Authors (7): Giulio Pelosio, Devesh Batra, Noémie Bovey, Robert Hankache, Cristovao Iglesias, Greig Cowan, Raad Khraishi

Large Language Models (LLMs) can exhibit latent biases towards specific nationalities even when explicit demographic markers are not present. In this work, we introduce a novel name-based benchmarking approach derived from the Bias Benchmark for QA (BBQ) dataset to investigate the impact of substituting explicit nationality labels with culturally indicative names, a scenario more reflective of real-world LLM applications. Our novel approach examines how this substitution affects both bias magnitude and accuracy across a spectrum of LLMs from industry leaders such as OpenAI, Google, and Anthropic. Our experiments show that small models are less accurate and exhibit more bias compared to their larger counterparts. For instance, on our name-based dataset and in the ambiguous context (where the correct choice is not revealed), Claude Haiku exhibited the worst stereotypical bias scores of 9%, compared to only 3.5% for its larger counterpart, Claude Sonnet, where the latter also outperformed it by 117.7% in accuracy. Additionally, we find that small models retain a larger portion of existing errors in these ambiguous contexts. For example, after substituting names for explicit nationality references, GPT-4o retains 68% of the error rate versus 76% for GPT-4o-mini, with similar findings for other model providers, in the ambiguous context. Our research highlights the stubborn resilience of biases in LLMs, underscoring their profound implications for the development and deployment of AI systems in diverse, global contexts.

大型语言模型(LLMS)可以显示对特定国籍的潜在偏见,即使没有明确的人口标记。在这项工作中,我们引入了一种新的基于名称的基准方法,从QA(BBQ)数据库的Bias基准中得出,以调查用文化指示性名称取代明确的国籍标签的影响,这种假想更能反映真实世界的LLM应用程序。我们的新办法审视了这种替代方式如何影响诸如OpenAI、Google和Anthroopic等行业领导人的一系列LLMS的偏差程度和准确性。我们的实验表明,小型模型与其较大的对应单位相比,不太准确,表现出更多的偏差。例如,在我们基于名称的数据集和模糊的背景下(没有披露正确的选择),Claude Haiku展示了最差的定型偏差分为9%,而更大的对应单位Claude Sonnet的偏差率仅为3.5%,后者的精确度也超过117.7%。此外,我们发现小型模型在这些模糊的背景下保留了现有错误的较大部分。例如,在为明确的国籍参考名称取名之后,GPT-4,在我们的基于名称的数据集的精确度中保留了我们GPT-4的G-BI-BI-S-BI-BI-I-I的精确度研究中,在G-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-


Article 178

Title@2025-07-22 (2): Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

Title: Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain Nutzung synthetischer Daten zur Beantwortung von Fragen mit mehrsprachigen LLMs im landwirtschaftlichen Bereich 利用合成数据在农业领域利用多种语言LLM 解答问题 2507.16974v1

Authors (9): Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar

Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.

虽然大型语言模型(LLMS)可用于实施问答系统,但仅使用公开的农业通用LMS通常提供通用咨询,由于具体领域培训不足以及缺乏高质量的区域数据集,当地和多语种背景缺乏准确性;我们的研究通过从农业专有文件和微调特定语言的LLMs生成多语种合成农业数据集(英文、印地语、旁遮普语),解决这些局限性。 我们的多语种数据集评价表明,与基线对应方相比,微调模型的实际准确性、相关性和农业共识大有改进。这些结果突出表明,合成数据驱动的、针对具体语言的微调,是提高农业特别是多语种和低资源环境中的LMS绩效的有效战略,通过提供更准确和本地化的农业咨询服务,为弥合不同语言社区AI驱动农业解决方案的知识差距迈出了有意义的一步。


Article 179

Title@2025-07-22 (2): Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning

Title: Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning Text-zu-SPARQL geht über das Englische hinaus: Mehrsprachige Fragen beantworten über Wissen Graphen durch von Menschen inspirierte Vernunft 文字到SPARQL 超越英语:通过人类激发的理由解答多语种问题 2507.16971v1

Authors (2): Aleksandr Perevalov, Andreas Both

Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement - guided by an experience pool for in-context learning - mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.

通过多语种的自然语言界面获取知识是信息检索和相关信息领域新出现的挑战之一。知识图中储存的结构化知识可以通过特定的查询语言(例如SPARQL)查询。因此,需要将自然语言投入转化为查询,以满足信息需要。以前的做法主要侧重于将解决下游任务并在最后找到答案的组成部分(例如基于规则的或基于神经的)结合起来。我们引入了mKGQAgent,这是一个人为激励的框架,它打破了将自然语言问题转换成SPARQL查询的任务,形成了模块化的、可解释的子任务。通过利用一个协调的LLM代理工作流程进行规划、实体连接和查询改进――由内文学习的经验库指导――MKQQAgent高效率地处理多种语言的KGQA。在Text2STARQL挑战2025中评估了DBBB和基于公司的KGQA基准,我们的方法在其他参与者中居于首位。这项工作开启了在多语种语言结构中开发类似人的推理系统的新途径。


Article 180

Title@2025-07-22 (2): Functionals in the Clouds: An abstract architecture of serverless Cloud-Native Apps

Title: Functionals in the Clouds: An abstract architecture of serverless Cloud-Native Apps Funktionen in den Clouds: Eine abstrakte Architektur serverloser Cloud-Native Apps 云中的功能:无云软件的抽象结构 2105.10362v6

Authors (3): Stanislaw Ambroszkiewicz, Waldemar Bartyna, Stanislaw Bylka

Cloud Native Application CNApp (as a distributed system) is a collection of independent components (micro-services) interacting via communication protocols. This gives rise to present an abstract architecture of CNApp as dynamically re-configurable acyclic directed multi graph where vertices are microservices, and edges are the protocols. Generic mechanisms for such reconfigurations evidently correspond to higher-level functions (functionals). This implies also internal abstract architecture of microservice as a collection of event-triggered serverless functions (including functions implementing the protocols) that are composed into event-dependent data-flow graphs, and dynamically reconfigured at the runtime. Again, generic mechanisms for such compositions and reconfigurations correspond to functionals and higher order type theory like Coq https://coq.inria.fr/about-coq. Our contribution is strictly theoretical and relies on the abstract architecture of CNApp that is closely related to the calculus of functionals and relations. The proposed theoretical approach is an attempt to implement the original idea of programming at the function level postulated by John Backus 1978 \cite{Backus}; the idea that is still waiting to be implemented as a non-von Neumann programming language.

CNApp(作为分布式系统)是一个通过通信协议进行互动的独立组件(微观服务)的集成库,它是一个独立组件(微型服务)的集成,通过通信协议进行互动。这导致一个CNApp的抽象结构,作为动态再配置可配置的循环定向多图,其中脊椎是微观服务,边缘是协议。这种重组的通用机制显然与更高层次的功能(功能)相对应。这也意味着微观服务的内部抽象结构,作为事件触发的服务器功能(包括执行协议的功能)的集成,这些功能组成为取决于事件的数据流图,并在运行时动态地重新配置。同样,这种构成和重组的通用机制与功能和更高层次的理论相对应,如Coq https://coq.inria.fr/about-coq。我们的贡献完全是理论性的,并依赖于CNApp的抽象结构,该结构与功能和关系的微积分和关系密切相关。提议的理论方法是试图落实由John Backus 1978\cite back} 正在等待的功能级别上设定的原始的编程构想。


Article 181

Title@2025-07-22 (2): Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs

Title: Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs Nutzung von RLHF für robuste Unannehmbarkeitserkennung und vertrauenswürdige Reaktionsgenerierung in LLMs 利用RLHF在LLM中利用RLHF促进强有力的无法回答的承认和可信赖的应对生成 2507.16951v1

Authors (4): Shuyuan Lin, Lei Duan, Philip Hughes, Yuxuan Sheng

Conversational Information Retrieval (CIR) systems, while offering intuitive access to information, face a significant challenge: reliably handling unanswerable questions to prevent the generation of misleading or hallucinated content. Traditional approaches often rely on external classifiers, which can introduce inconsistencies with the core generative Large Language Models (LLMs). This paper introduces Self-Aware LLM for Unanswerability (SALU), a novel approach that deeply integrates unanswerability detection directly within the LLM’s generative process. SALU is trained using a multi-task learning framework for both standard Question Answering (QA) and explicit abstention generation for unanswerable queries. Crucially, it incorporates a confidence-score-guided reinforcement learning with human feedback (RLHF) phase, which explicitly penalizes hallucinated responses and rewards appropriate abstentions, fostering intrinsic self-awareness of knowledge boundaries. Through extensive experiments on our custom-built C-IR_Answerability dataset, SALU consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions. Human evaluation further confirms SALU’s superior reliability, achieving high scores in factuality, appropriate abstention, and, most importantly, a dramatic reduction in hallucination, demonstrating its ability to robustly “know when to say ‘I don’t know’.”

信息反馈检索系统(CIR)在提供直觉获取信息的同时,面临重大挑战:可靠地处理无法回答的问题,防止产生误导或幻觉内容;传统方法往往依赖外部分类,这可能导致与核心基因化大语言模型(LLMs)不一致。本文介绍了“无法回答自我软件LALM”(SALU),这是一种新颖的方法,它深入地将无法回答的检测直接纳入LLLM的基因化过程。SALU在标准问答(QA)和明确为无法回答的询问生成不回答的多任务学习框架下接受了培训。关键是,它包含了信任、核心指导强化学习,与人类反馈(RLHF)阶段相矛盾,这明确惩罚了“无法回答的自觉LLUM(SALU)”(SALU) (SALU) (SLM) (QA) (QAA) (QA) (QA) ) , 以及明确为无法回答的提问而明确生成的“弃权” 。它包含了一个信任、 核心的强化的强化的强化的强化的强化的强化学习学习方法, , 明确地证明了、 和不言中, 和不言行中的“正确地证明” —— 正确地证明了、正确地、正确地、正确地、正确地、正确地、正确地、正确地降低的、正确、正确、正确、正确、正确降低的、正确、正确、正确、正确、正确地回答问题。


Article 182

Title@2025-07-22 (2): 3LM: Bridging Arabic, STEM, and Code through Benchmarking

Title: 3LM: Bridging Arabic, STEM, and Code through Benchmarking 3LM: Arabisch, MINT und Code durch Benchmarking überbrücken 3LM:通过基准确定连接阿拉伯语、STEM和代码 2507.15850v2

Authors (8): Basma El Amel Boussaha, Leen AlQadi, Mugariya Farooq, Shaikha Alsuwaidi, Giulia Campesan, Ahmed Alzubaidi, Mohammed Alyafeai, Hakim Hacid

Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

阿拉伯语是世界上最广泛使用的语言之一,然而,为阿拉伯语开发和评价大语言模型(LLMs)的努力仍然相对有限,大多数现有的阿拉伯语基准侧重于语言、文化或宗教内容,在STEM和代码等与现实世界LM应用程序越来越相关的领域留下了巨大差距。为了帮助弥合这一差距,我们提出了3LM,这是一套专为阿拉伯语设计的三套基准。第一套是一套与STEM有关的问答组合,天然地来自阿拉伯教科书和教育工作单。第二套是合成产生的STEM问题,利用同样的来源创建。第三套基准侧重于代码生成,通过仔细翻译两种广泛使用的代码基准而建立起来,包括几轮审查,以确保高质量和忠诚的翻译。我们公开发布所有三种基准,支持阿拉伯语LM在这些基本但代表性不足的领域扩大研究。


Article 183

Title@2025-07-22 (2): AI-based Clinical Decision Support for Primary Care: A Real-World Study

Title: AI-based Clinical Decision Support for Primary Care: A Real-World Study KI-basierte klinische Entscheidungsunterstützung für die Primärversorgung: Eine Real-World-Studie 基于AI的初级保健临床决定支持:现实世界研究 2507.16947v1

Authors (18): Robert Korom, Sarah Kiptinness, Najib Adan, Kassim Said, Catherine Ithuli, Oliver Rotich, Boniface Kimani, Irene King’ori, Stellah Kamau, Elizabeth Atemba, Muna Aden, Preston Bowman, Michael Sharman, Rebecca Soskin Hicks, Rebecca Distler, Johannes Heidecke, Rahul K. Arora, Karan Singhal

We evaluate the impact of large language model-based clinical decision support in live care. In partnership with Penda Health, a network of primary care clinics in Nairobi, Kenya, we studied AI Consult, a tool that serves as a safety net for clinicians by identifying potential documentation and clinical decision-making errors. AI Consult integrates into clinician workflows, activating only when needed and preserving clinician autonomy. We conducted a quality improvement study, comparing outcomes for 39,849 patient visits performed by clinicians with or without access to AI Consult across 15 clinics. Visits were rated by independent physicians to identify clinical errors. Clinicians with access to AI Consult made relatively fewer errors: 16% fewer diagnostic errors and 13% fewer treatment errors. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone. In a survey of clinicians with AI Consult, all clinicians said that AI Consult improved the quality of care they delivered, with 75% saying the effect was “substantial”. These results required a clinical workflow-aligned AI Consult implementation and active deployment to encourage clinician uptake. We hope this study demonstrates the potential for LLM-based clinical decision support tools to reduce errors in real-world settings and provides a practical framework for advancing responsible adoption.

我们与肯尼亚内罗毕初级诊所网络Penda Health合作,研究了AI咨询公司这一工具,该工具通过查明潜在的文件和临床决策错误,成为临床医师的安全网。AI咨询公司将仅在需要时才启动并维护临床自主性。我们进行了质量改进研究,比较了在15个诊所有或没有获得AI咨询的临床医师进行的39 849次门诊检查的结果。访问被独立医生评为诊断错误。获得AI咨询的临床医师的错误相对较少:诊断错误减少16%,治疗错误减少13%。绝对而言,AI咨询公司将避免每年在Penda的29 000次访问中出现诊断错误和治疗错误。在AI咨询公司对临床医生进行的调查中,所有临床医生都说,AI咨询公司提高了他们提供护理的质量,75%的疗效是“实质性的”。这些结果要求临床工作流程一致的AI咨询实施和积极部署支持,以鼓励诊所采用实际的LM工具。我们希望这一研究能够减少临床诊断工具。


Article 184

Title@2025-07-22 (2): SiLQ: Simple Large Language Model Quantization-Aware Training

Title: SiLQ: Simple Large Language Model Quantization-Aware Training SiLQ: Einfaches großsprachiges Modell Quantization-Aware Training SiLQ: 简单大语言模型量化软件培训 2507.16933v1

Authors (5): Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.

大型语言模型可以量化,以减少推论时间延迟、模型大小和能源消耗,从而以较低的成本提供更好的用户经验。 存在一项挑战,即提供量化模型,在合理时间内尽可能降低准确性,特别是这样做时不要求采用与专门推论加速器不兼容的机制。 在这里,我们展示了一种简单、端到端的量化-认知培训方法,随着总模型培训预算增长不到0.1%,在几个现代基准上,以基础和指示模型变异方式,以大利润率优于已公布的主要量化方法。 这种方法很容易在不同的模型结构中普遍采用,可以适用于激活、缓存和重量,并且除了四分法本身之外,不需要在模型中引入额外的操作。


Article 185

Title@2025-07-22 (2): Modeling Public Perceptions of Science in Media

Title: Modeling Public Perceptions of Science in Media Modellierung öffentlicher Wahrnehmungen von Wissenschaft in Medien 模拟公众对媒体科学的看法 2506.16622v2

Authors (4): Jiaxin Pei, Dustin Wright, Isabelle Augenstein, David Jurgens

Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals’ frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.

有效地让公众了解科学对于增进我们科学界的信任和理解至关重要。然而,随着信息量的不断增加,科学传播者努力预测受众如何看待科学新闻和如何与科学新闻互动。在本文中,我们引入了一个计算框架,以12个层面,例如新闻价值、重要性和惊人性来模拟公众看法。我们利用这个框架,创建了一个大规模科学新闻感知数据集,由来自来自美国和联合王国不同人群的2 101名参与者提供10 489份说明,为公众对各领域科学信息的反应提供了宝贵的见解。我们进一步开发了NLP模型,预测公众感知得分,并取得了很强的业绩。利用数据集和模型,我们从两个角度审视公众对科学的认识:(1) 感知:哪些因素影响公众对科学信息的认识?(2) 感知作为预测者:我们能否利用估计的观念来预测公众参与科学? 我们发现,科学感知识的频率是认识的驱动力,而人口因素的影响则很小。更重要的是,通过大规模分析和仔细设计自然实验来预测公众感知结果。我们从两个角度来审视公众对科学的看法:(1) 感觉觉:什么因素和感知觉觉:我们更深刻地估计了科学感的感与感的感与感的感是不同的感,对科学的感的感的感的感官与感与感的感与感与感的感的感。


Article 186

Title@2025-07-22 (2): A Unifying Scheme for Extractive Content Selection Tasks

Title: A Unifying Scheme for Extractive Content Selection Tasks Ein einheitliches Schema für die Auswahl von extraktiven Inhalten 开采内容选择任务统一办法 2507.16922v1

Authors (4): Shmuel Amar, Ori Shapira, Aviv Slobodkin, Ido Dagan

A broad range of NLP tasks involve selecting relevant text spans from given source texts. Despite this shared objective, such \textit{content selection} tasks have traditionally been studied in isolation, each with its own modeling approaches, datasets, and evaluation metrics. In this work, we propose \textit{instruction-guided content selection (IGCS)} as a beneficial unified framework for such settings, where the task definition and any instance-specific request are encapsulated as instructions to a language model. To promote this framework, we introduce \igcsbench{}, the first unified benchmark covering diverse content selection tasks. Further, we create a large generic synthetic dataset that can be leveraged for diverse content selection tasks, and show that transfer learning with these datasets often boosts performance, whether dedicated training for the targeted task is available or not. Finally, we address generic inference time issues that arise in LLM-based modeling of content selection, assess a generic evaluation metric, and overall propose the utility of our resources and methods for future content selection models. Models and datasets available at https://github.com/shmuelamar/igcs.

在这项工作中,我们提议将\ textit{ instruction-productioned内容选择(IGCS)}作为这种环境的一个有益的统一框架,在这种环境中,任务定义和任何针对具体案例的请求被概括为语言模式的指示。为了促进这一框架,我们引入了涵盖不同内容选择任务的第一个统一基准,即“igcsbench}”,这是涵盖不同内容选择任务的第一个统一基准。此外,我们创建了一个大型的通用合成数据集,可以用于多种内容选择任务,并表明利用这些数据集进行转移学习往往能够提高业绩,无论是否为目标任务提供专门培训。最后,我们处理基于LLOM的内容选择模型中出现的通用推论时间问题,评估通用评价指标,并全面建议我们的资源和方法对未来内容选择模式的效用。在 https://github.comsbuel/hmsbuel上可以找到的模型和数据设置。


Article 187

Title@2025-07-22 (2): MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Title: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning MegaScience: Die Grenzen von Post-Training-Datensätzen für wissenschaftliche Vernunft sprengen 超科学:推进培训后数据集的前沿,促进科学理性 2507.16812v1

Authors (3): Run-Ze Fan, Zengzhi Wang, Pengfei Liu

Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.

科学推理对于发展AI科学家和支持人类研究人员推进自然科学发现前沿至关重要,然而,开放源码社区主要侧重于数学和编码,而忽视了科学领域,这主要是因为缺乏开放、大规模、高质量、可核实的科学推理数据集。为了缩小这一差距,我们首先推出一个开放的数据集,其中包含从12k所大学科学教科书中提取的真诚的参考答案,其中包括涵盖7个科学学科的650k个推理问题。我们进一步引入了MegaScience,这是一个大型的高质量开放源数据集组合,共125万个案例,通过系统化的模拟研究开发,评估各种数据选择方法,以确定每个公开提供的科学数据集的最佳分类。与此同时,我们建立了一个涵盖15个基准的不同主题和问题类型的综合评价系统,其中包括全面的答复提取战略,以确保准确的评价指标。我们的实验表明,我们的数据集取得了优异的绩效和培训效率,与现有的公开源科学数据集相比,我们培训Lperlam3.1、Qwen2.5和QwenScial3系统, 向更强有力的科学评估模型提供更强的升级的模型。


Article 188

Title@2025-07-22 (2): Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Title: Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty Über Binäre Belohnungen hinaus: LMs zur Vernunft über ihre Ungewissheit ausbilden 二元奖励之后的奖励:培训 “ 以其不确定性为由 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 培训 “ 2507.16806v1

Authors (7): Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or “hallucinate”) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score – a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations – outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

当语言模型(LMS)通过强化学习(RL)接受培训,以产生自然的精度语言“引航链 ” , 其性能在一系列棘手的回答任务中有所改进。今天,几乎所有成功应用RL的推理应用都使用二进制奖赏功能来评价LM产出的正确性。由于这种奖励功能并不惩罚猜测或低信任产出,因此往往具有降低校准和增加LMS在其他问题域产生不正确反应(或“优雅”)速度的意外副作用。本文描述了RLCR(用校准重新校准来强化精度学习),这是培训推理模型的一种方法,共同改进准确性和校准信心估计。在RLCRCR中,LMS生成一个最佳的奖赏函数,用Brier评分来提高比分。我们首先证明,通过校准校准校准校正校正校准校正能提高信心的评分值可以提高校正度。我们的奖励功能(或任何类似的校正比,用约束、正确校准规则来改进)让模型,在RCRCRRRRRRBR的模型中都显示其预测结果。我们的校准的校准的测结果,在最后的校准中,我们显示的校准中显示的校正的校正的校正的校准方法可以显示的校正的校正的校正的校正。


Article 189

Title@2025-07-22 (2): Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Title: Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning Steuerung der Out-of-Distribution-Verallgemeinerung mit Konzeptablation Fine-Tuning 带有 “ 缩算概念 “ 定额概念的 “ 批发外普遍化 “ 指导指导 2507.16795v1

Authors (6): Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda

Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM’s latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

大型语言模型(LLMS)的微调可导致意外的分配外一般化。 这一问题的标准方法取决于修改培训数据,例如增加更精确地说明预期的概括化的数据。然而,这并非始终是实用的。我们引入了“Ablation Final-Turning”(CAFT)这一技术,利用可解释性工具来控制LMS如何从微调中普遍化,而无需修改培训数据,或使用目标分布的数据。鉴于LMS潜在空间中一套与未预期的概念相对应的方向,CAFT在微调期间将这些概念与线性预测相融合,引导模型远离意外的概括化。我们成功地应用CAFT(CAFT)来调整三项微调任务,包括突发的偏差,LMS微调了一种现象,即微调LMS对一般问题作出极为错误的反应。在对数据进行微调的情况下,CAFT在不减损培训分布上的表现的情况下,将偏差反应减少10x。总的来说,CFTAFT是一种在不修改培训数据的情况下指导LM一般化的新做法。


Article 190

Title@2025-07-22 (2): Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

Title: Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning Beyond Context Limits: Unterbewusste Threads für die Long-Horizon Reasoning 超越上下文限制: 长霍氏理由的潜意识线条 2507.16784v1

Authors (10): Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O’Brien, James Glass

To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

为了打破大型语言模型(LLMS)的背景限制(LLMS),这些语言模型可以抑制推理准确性和效率,我们建议采用Tread Inference 模型(TIM),这是一个受过反复和分解问题解决和分解问题的训练的LLMS系列,另一个是TIMUN,这是一个在环境限度以外进行长期和连续结构推理的推理过程。TIM在TIMURUN上主持,共同支持一个单一语言模型推理中几乎无限的工作记忆和多霍工具呼唤,克服产出限制、定位粘合限制和GPU-MUMER瓶颈。通过以长度和深度衡量的推理树来模拟自然语言,实现绩效。这些推理树包括思考、循环子、基于我们在Schroeder et et 等人( 2025年) 中提出的概念而得出的结论。我们保持的工作记忆只保留最相关的语系的关键值状态,它通过基于规则的子任务处理机制选择,使得定位嵌入和GPU记忆页面的回溯度页,在逻辑上整个逻辑推理学期间,还显示我们的实验结果。


Article 191

Title@2025-07-22 (2): SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Title: SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods SenWiCh: Sense-Annotation von Low-Resource-Sprachen für WiC mit Hybrid-Methoden SenWiCH: 使用混合方法为无线电通信中心提供低资源语言的高级说明 2505.23714v2

Authors (13): Roksana Goworek, Harpal Karlcut, Muhammad Shezad, Nijaguna Darshana, Abhishek Mane, Syam Bondada, Raghav Sikka, Ulvi Mammadov, Rauf Allahverdiyev, Sriram Purighella, Paridhi Gupta, Muhinyia Ndegwa, Haim Dubossarsky

This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

虽然跨语文转让是利用多语文预先培训的关键战略,将语言技术推广到研究不足和类型多样的语言,但其有效性取决于质量和适当基准。我们发布了包含多种语言、跨越不同语言家庭和文字的10种低资源语言的带感标记的新的句子数据集。为了便利数据集的创建,本文件提出了一个明显有益的半自动说明方法。通过Wordin-Context(WicC)格式化的实验展示了数据集的效用,这些实验评价了这些低资源语言的转让。结果突出表明了在低资源环境中创建和评估有效的多语言脱钩组合的重要性。发布数据集和代码的目的是支持对公平、稳健和真正多语言的NLP的进一步研究。


Article 192

Title@2025-07-22 (2): GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Title: GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding GUI-G$^2$: Gaussian Reward Modeling für GUI Grounding GUI-G$$2美元:GUI地基的高斯奖赏模型 2507.15846v2

Authors (12): Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

图形用户界面( GUI) 绘制自然语言指示, 以精确的界面位置进行自主互动。 当前强化学习方法使用将元素作为目标目标处理的二进制奖赏, 创建忽略空间互动连续性的微弱信号。 我们受自然以目标元素为核心的高斯分布的人类点击行为驱动, 我们引入了GUI Gausian 定位奖项( GUI- G$2$), 一个原则奖赏框架, 将图形界面元素作为连续的高斯分布在界面中。 GUI- G$2$ 包含两个协同机制: 高斯点奖赏模型, 通过元素固醇的快速衰减版版化分布, 创建零星点的精确本地化模型, 覆盖点评估空间一致性, 通过测量预测高点分布和目标区域之间的重叠。 为了处理不同元素尺度, 我们开发了一个适应性差异机制, 校准基于元素维度的分布。 这个框架将GUIGI从稀少的二级分类到密集的连续优化优化优化。 校正的分布产生丰富的梯度信号信号信号信号, 向最优化的互动定位定位定位定位定位定位 $PROSQS- breal- browst- browst- grealmamamamas


Article 193

Title@2025-07-22 (2): Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals

Title: Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals Unpacking Ambiguity: Die Wechselwirkung von Polysem-Diskursmarkern und Nicht-DM-Signalen 拆包装模糊性:多相相片标记器和非DM信号的相互作用 2507.16748v1

Authors (2): Jingni Wu, Amir Zeldes

Discourse markers (DMs) like ‘but’ or ‘then’ are crucial for creating coherence in discourse, yet they are often replaced by or co-occur with non-DMs (‘in the morning’ can mean the same as ‘then’), and both can be ambiguous (‘since’ can refer to time or cause). The interaction mechanism between such signals remains unclear but pivotal for their disambiguation. In this paper we investigate the relationship between DM polysemy and co-occurrence of non-DM signals in English, as well as the influence of genre on these patterns. Using the framework of eRST, we propose a graded definition of DM polysemy, and conduct correlation and regression analyses to examine whether polysemous DMs are accompanied by more numerous and diverse non-DM signals. Our findings reveal that while polysemous DMs do co-occur with more diverse non-DMs, the total number of co-occurring signals does not necessarily increase. Moreover, genre plays a significant role in shaping DM-signal interactions.

类似“ 但” 或“ 当” 这样的分解符( DMs) , 是创造对话一致性的关键, 但是它们往往被非DMs所取代或被非DMs所共生( “ 上午” 指“ 当” ) 取代或被非DMs所取代, 两者可能含混( “ 因为” 指时间或原因 ) 。 这些信号之间的相互作用机制仍然不清楚, 但对于其模糊不清至关重要 。 在本文中, 我们调查了 DM 聚变和英文非DM 信号的共同发生之间的关系, 以及基因对这些模式的影响。 此外, 我们建议对 DM 聚苯乙烯 进行分级定义, 并进行相关和回归分析, 以检查多聚体型DMs 是否配有数量更多和种类不同的非DM 信号。 我们的研究结果表明, 虽然多聚体DMSDs 与更多不同的非DMs共生, 共发信号的总数不一定增加。 此外, 基因在决定DMDM 的相互作用方面起着重要作用 。


Article 194

Title@2025-07-22 (2): Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Title: Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Zebra-CoT: Ein Datensatz für interleaved Vision Language Reasoning Zebra-CoT:关于不同视力语言理由的数据集 2507.16746v1

Authors (14): Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT’s effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

人类在解决复杂问题时常常使用视觉辅助工具,例如图表或草图,解决复杂问题时往往使用视觉辅助工具,例如图表或草图。培训多式联运模型来做同样的事情,称为视觉思维链(视觉COT),具有挑战性,原因是:(1) 视觉视觉COT表现差,阻碍强化学习;(2) 缺乏高质量的视觉COT培训数据。 我们引入了具有182,384个样本的多种大型数据集,包含逻辑上一致的相互脱节文本图像推理痕迹。 我们侧重于四种任务,其中的素描或视觉推理特别自然,涵盖科学问题,如几何学、物理和算法; 2D 视觉COT 的视觉推理工作,如视觉搜索和拼图拼图; 3D 推理工作,包括3D多点推理、装饰和机器人规划; 视觉逻辑问题和象棋等战略游戏。 精细调整Zebra-CoT 开放式训练教材模型的Anole-7B模型模型,改进了我们测试的精确度或视觉推理模型的增至+13%。


Article 195

Title: RAVine: Reality-Aligned Evaluation for Agentic Search RAVine: Realitätsorientierte Bewertung für die Agentische Suche RAVine: 化学搜索的现实统一评价 2507.16725v1

Authors (4): Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao

Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine – a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model’s interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

作为一种更自主和适应性的增强检索模式,机械搜索正在推动智能搜索系统的演进,但是,现有的评价框架未能与代理搜索的目标相一致。首先,当前基准中常用的复杂查询往往与现实用户搜索情景不同。其次,先前的方法往往在为端到端评价提取地面真相时引入噪音,导致微小评估的扭曲。第三,大多数当前框架仅侧重于最终答案的质量,忽视了对代理搜索所固有的迭接过程的评价。为克服这些限制,我们提议了RAVine – – 一个用于搜索的代理LLMS的真实性-统一电子估价框架。RAVine针对更好地反映用户意图的多点查询和长式答案,并提出了可归属的地面真相构建战略,以提高精细评估的准确性。此外,RAVine还检查了模型在整个迭接过程中与搜索工具的相互作用,忽略了对效率因素的核算。我们用RAVine为一系列模型设定基准,并提出了若干洞察力,我们希望这将推动代理搜索系统的发展。


Article 196

Title@2025-07-22 (2): Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

Title: Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory Erfahrung ist der beste Lehrer: Erdung von VLMs für Robotik durch selbsterzeugtes Gedächtnis 经验是最好的教师:通过自创记忆,为机器人创造VLMs 2507.16713v1

Authors (7): Guowei Lan, Kaixian Qu, René Zurbrügg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter

Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

在机器人中广泛采用愿景模型(VLM),以便能够进行自主规划。然而,最初在互联网数据上培训的VLM(VLM),将VLM(VLM)定位为多种真实世界机器人,这仍然是一个挑战。本文展示了Explete,这是将VLM(VLM)定位为物理机器人的框架,通过建立自生成的现实世界经验记忆。在Explettele中,VLM(VLM)自主计划行动,核查结果,对失败进行反省,在封闭循环中调整机器人行为。在此过程中,自生成的经验被总结为长期记忆,从而能够检索学到的知识,通过检索启动的一代(RAG)来指导未来的任务。此外,Explete(Exptech)用点名图像注释模块来提升VLMM(VLM)的空间理解。在实验中,我们显示反思提高了四项挑战性机器人任务的成功率,从36%提高到84%,并观察智能物体互动,包括创造性工具的出现。在12个现实世界情景(包括8个隐形情景)的大规模测试中,我们发现,以长期记忆定位为基础,展示了从22 %到80的探索成功率的基点的地面提升,展示。


Article 197

Title@2025-07-22 (2): Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance

Title: Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance Advancing Risk and Quality Assurance: Ein RAG Chatbot für verbesserte regulatorische Compliance 提高风险和质量保证:改进监管合规的RAG Chadbot 2507.16711v1

Authors (10): Lars Hillebrand, Armin Berger, Daniel Uedelhoven, David Berghaus, Ulrich Warning, Tim Dilmaghani, Bernd Kliem, Thomas Schmid, Rüdiger Loitz, Rafet Sifa

Risk and Quality (R&Q) assurance in highly regulated industries requires constant navigation of complex regulatory frameworks, with employees handling numerous daily queries demanding accurate policy interpretation. Traditional methods relying on specialized experts create operational bottlenecks and limit scalability. We present a novel Retrieval Augmented Generation (RAG) system leveraging Large Language Models (LLMs), hybrid search and relevance boosting to enhance R&Q query processing. Evaluated on 124 expert-annotated real-world queries, our actively deployed system demonstrates substantial improvements over traditional RAG approaches. Additionally, we perform an extensive hyperparameter analysis to compare and evaluate multiple configuration setups, delivering valuable insights to practitioners.

高度监管行业的风险和质量(R)保证要求不断对复杂的监管框架进行导航,员工处理许多日常询问时需要准确的政策解释。依赖专家的传统方法造成了操作瓶颈和限制可扩缩性。我们提出了一个利用大语言模型(LLMS)、混合搜索和相关性的新型回收增殖(RAG)系统,以加强R查询处理。我们积极部署的系统根据124个专家附加说明的现实世界查询进行了评估,表明与传统的RAG方法相比有了重大改进。此外,我们进行了广泛的超光谱分析,以比较和评价多种配置设置,向从业者提供宝贵的洞察力。


Article 198

Title@2025-07-22 (2): Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM

Title: Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM Interpretierbare Themenextraktion und Wort-Embedding Lernen mit zeilenstochastischem DEDICOM 利用行可查的DEDICOM进行可解释专题抽取和单词嵌入学习 2507.16695v1

Authors (4): Lars Hillebrand, David Biesner, Christian Bauckhage, Rafet Sifa

The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.

DEDICOM算法为对称和不对称平方矩阵提供了一种独特的可解释矩阵乘数化方法,我们采用了DEDICOM在文本公司点对点相互信息矩阵上的新行随机变式,以查明词汇中的潜在主题组,同时学习可解释的词嵌入。我们引入了一种方法,对DEDICOM的有限算法进行有效的培训,并对主题建模和词嵌入性表现进行定性评估。


Article 199

Title@2025-07-22 (2): Universal Model Routing for Efficient LLM Inference

Title: Universal Model Routing for Efficient LLM Inference Universelle Modellführung für effiziente LLM-Inferenz 高效LLM 推导法通用通用模型规则 2502.08773v2

Authors (12): Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

模型路由是减少大型语文模型(LLMs)的推论成本的简单技术,其中一个人拥有一批候选LMs,并学习如何将每个选择的LLMs迅速运到最小可行的LLM。现有工作的重点是为固定的LLMs群学习路由器。在本文中,我们考虑了动态路由问题,在试验时有新的、以前没有观测到的LLMs可供使用。我们建议UniRoute,这是一个解决这一问题的新办法,它依赖将每个LLMm作为特性矢量来代表。我们根据一套具有代表性的提示所作的预测,详细介绍了UniRoute的两个有效的即时说明,分别依靠基于集束路由的路由和一项学习的集群图。我们表明,这些是理论上最佳路由规则的估计,并通过一个超风险约束来量化其错误。关于一系列公共基准的实验显示UniRoute在超过30个未见的LLMs之间路由有效。


Article 200

Title@2025-07-22 (2): PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

Title: PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization PICACO: Pluralistische Im-Kontext-Wert-Ausrichtung von LLMs über Gesamtkorrelationsoptimierung PICACO: 通过总关联性优化使LLMs的多元内流价值一致 2507.16679v1

Authors (6): Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs’ comprehension of input prompts remains agnostic, limiting ICA’s ability to address value tensions–human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs’ understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

内容学习显示,将大语言模型(LLMs)与人类价值观相协调,帮助减少有害产出,满足各种偏好,而没有成本高昂的培训后(称为In-Context对齐(ICA)),大语言模型(LLMs)与人类价值观相协调的潜力巨大,但LLMs对投入提示的理解仍然不可想象,限制了ICA处理价值紧张因素-人类价值观的能力,这在本质上是多元的,常常造成相互冲突的需求,例如刺激与传统。因此,当前的ICA方法面临 “ 指示瓶颈 “ 的挑战,LMs在单一的及时调和多种预期值之间挣扎,导致不完全或偏颇的调和。为了解决这个问题,我们提议PICACO是一种创新的多元ICACO方法。在不作微调的情况下,PICACO优化了一种可引导多种价值观的元教程,以更好地引导LMs了解这些差异,并改进它们之间的调和调和调和调和调和。这是通过最大限度地实现特定价值与减少分散噪音的理论关系,从而产生有效的价值指示实现有效的价值。


Article 201

Title@2025-07-22 (2): InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification

Title: InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification Internagent: Wenn Agent zum Wissenschaftler wird – Gebäude-Closed-Loop-System von der Hypothese bis zur Verifikation 实习生:当探员成为科学家时 – – 建立从假说到核查的闭线系统 2505.16938v3

Authors (26): InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai

Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

人工智能(AI)正在加速改变科学研究模式,不仅提高研究效率,而且推动创新。我们引入了实习生,这是一个统一的封闭环多试剂框架,在各种科学研究领域进行自主科学研究,使研究人员能够以前所未有的速度和精确度处理这些领域的复杂问题。实习生强调三个主要优势:1)可缩放性:实习生在12项科学研究任务中表现出其多功能性,能够产生创新想法,提高基线代码的性能。 2 互动:实习生在自动化端对端过程中为人类专家反馈和多剂互动提供一个界面,使域专家知识能够无缝地融合。 3 效率:实习生在一些科学领域取得了良好的绩效收益,与人类努力相比,时间成本大大降低。例如,在反应收益预测方面,在12小时内从27.6%增加到35.4%;在强化活动预测方面,准确性从0.65增加到0.79,只有4小时的处理时间;在2D语系分割方面,精确度从78.8%提高到了30 %。


Article 202

Title@2025-07-22 (2): Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

Title: Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs Selbstverachtung als Selbstverbesserung: Der Generationsverständigen-Gap in MLLMs entgegenwirken 自我自我改善:缩小小林林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中的小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小林中小的自我改造:缩小代对小林中小林中小林中小林中小的鸿沟 2507.16663v1

Authors (8): Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Lin, Yingya Zhang, Shiwei Zhang, Difan Zou

Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model’s own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.

尽管在单一模型中努力统一多式联运和理解任务,但我们展示了这些MLLMM公司,尽管努力在单一模型中统一多式联运和理解任务,但我们展示了这些MLLM公司,这些MLM公司在生成图像时表现出自相矛盾,因为其生成的图像被认为与基于模型自身理解的输入速度不符。我们定义了非统一得分,从而量化了这种自相矛盾。我们的经验结果表明,自相矛盾主要来自薄弱的生成,而这种自相矛盾的生成和理解在单一模型中并不协调,而不是误解。这种能力不对称表明利用自我自相矛盾的潜力,在自我改进方面,利用更强有力的模型理解来引导较弱的生成质量,以缩小代间差距。运用标准的培训后培训方法(例如SFT、DPOPO),在这种内部监管中成功地改进了生成和统一。我们发现双重生成和理解的结果效果的效果是,在对生成部门进行微调时,在培训前发现一种已知现象时,在培训后,我们发现一种更精确的内变现,在不断校正的模型中发现,我们不精确的代人之间对结果进行更精确的校正的校正的校正的校正的校正的校正的校正风险。


Article 203

Title@2025-07-22 (2): P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs

Title: P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs P-CoT: Eine pädagogisch motivierte partizipative Kette von Denkanstößen für phonologische Vernunft in LLMs P-Cot:以教育为动机的、旨在激励LLM中声学原因的参与性研究链 2507.16656v1

Authors (3): Dongjun Jang, Youngchae Ahn, Hyopil Shin

This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.

这项研究探索了基于文本的大型语言模型(LLMs)中的声学推理潜力。 利用声学堡垒基准,我们评估了诸如押韵单词生成、g2p转换和音调计算等任务。 我们对12 LLMs的评估显示,虽然少见的学习带来不一致的收益,但引入了创新的、以教育为动机的参与性研究链(P-Cot)快速(它以脚架和发现学习等教育理论为基础),持续提高绩效。 这种方法利用结构化指导来激活潜在的声学能力,实现高达52%的改善,甚至在某些任务中超过人类基线。 未来工作可以优化P-Cot对具体模型的提示或探索其在不同语言领域的应用。


Article 204

Title@2025-07-22 (2): Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

Title: Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models Auf dem Weg zu einer automatisierten Überprüfung der regulatorischen Compliance bei der Finanzprüfung mit großen Sprachmodellen 采用大语言模式进行财务审计自动监管合规核查 2507.16642v1

Authors (11): Armin Berger, Lars Hillebrand, David Leonhard, Tobias Deußer, Thiago Bell Felix de Oliveira, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Christian Bauckhage, Rafet Sifa

The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI’s GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.

对金融文件的审计历来是一个劳动密集型过程,它属于转型的边缘。AI驱动的解决方案通过建议财务报告中的有关文本段落与会计准则的法律要求保持一致,从而简化了这一进程。然而,一个明显的局限性仍然存在:这些系统通常在核实所推荐的节录是否确实符合具体法定任务方面不足。因此,在本文件中,我们探究了在不同模式配置的监管合规领域公开存在的大语言模型(LLLMs)的效率。我们特别强调将Llama-2等尖端开放源LMS与OpenAI的GPT模型等专有对应方比较。这一比较分析利用了我们的伙伴PricewaterhouseCoopers(PwC)德国提供的两套定制数据集。我们发现,开放源Llama-270亿模式在发现不遵守或真正负面事件方面表现出色,殴打了所有专有对应方。然而,诸如GPT-4等专有模型在广泛的情景中表现最佳,特别是在非英语环境中。


Article 205

Title@2025-07-22 (2): A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

Title: A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1 Eine Methode für die Architektur eines medizinischen vertikalen Großsprachmodells auf Basis von Deepseek R1 基于Deepseek R1的医学垂直大语言模型的架构方法 2505.00025v2

Authors (2): Mingda Zhang, Jianglong Qin

Despite significant advances in foundation models like DeepSeek-R1 and ChatGPT, their deployment in medical settings faces critical challenges including computational requirements and professional knowledge barriers. This paper presents an efficient lightweight medical large language model architecture that systematically addresses these challenges through three-dimensional optimization: knowledge acquisition, model compression, and computational enhancement. We design a knowledge transfer pipeline from DeepSeek-R1-Distill-70B to DeepSeek-R1-Distill-7B using Low-Rank Adaptation (LoRA) for precise medical knowledge retention. Through 4-bit quantization and mixed-precision strategies, we achieve substantial model compression while preserving medical reasoning capabilities. The inference framework incorporates Flash Attention acceleration and continuous batching, complemented by specialized prompt templates for diverse medical queries. Experimental evaluation on medical benchmarks demonstrates that our approach maintains 92.1% accuracy on USMLE examinations while reducing memory consumption by 64.7% and inference latency by 12.4% compared to baseline models. This work provides a practical solution for deploying advanced language models in resource-constrained medical environments, enabling broader accessibility of AI-assisted healthcare.

尽管在DeepSeek-R1和ChattGPT等基础模型方面取得重大进展,但在医疗环境中的部署面临重大挑战,包括计算要求和专业知识障碍。本文件展示了高效的轻量医疗大型语言模型结构,通过三维优化(知识获取、模型压缩和计算强化)系统地应对这些挑战。我们设计了从DeepSeek-R1-Distilling-70B到DeepSeek-R1-Stilling-7B的知识传输管道,使用低Rank适应(LORA)精确的医疗知识保留。通过4位分级和混合精度战略,我们在保持医疗推理能力的同时实现了实质性的模型压缩。推论框架包括快速注意加速和连续分批,辅之以各种医疗问询的专用快速模板。对医疗基准的实验评估表明,我们的方法保持了美国MLE考试的92.1%的准确率,同时将记忆消耗量减少64.7%,与基线模型相比,误差为12.4%。这项工作为在资源受限制的医疗环境中部署先进语言模型提供了切实可行的解决办法,使AI辅助保健更加普及。


Article 206

Title@2025-07-22 (2): A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis

Title: A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis Multi-Granularität Konzept Sparse Aktivierung und Hierarchisches Wissen Graph Fusion Framework für Seltene Krankheiten Diagnose 罕见疾病诊断多发性概念分散活动和等级知识图集融合框架 2507.08529v2

Authors (5): Mingda Zhang, Na Zhao, Jianglong Qin, Guoyu Ye, Ruixiang Tang

Rare disease diagnosis remains challenging for medical large language models due to insufficient knowledge representation, limited concept understanding, and constrained clinical reasoning. We propose a framework combining multi-granularity sparse activation with hierarchical knowledge graphs. Our approach employs four complementary matching algorithms with diversity control and a five-level fallback strategy for precise concept activation. A three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare disease dataset demonstrate significant improvements: BLEU scores increased by up to 0.13, ROUGE by up to 0.10, and diagnostic accuracy by up to 0.25, with the best model achieving 0.92 accuracy–surpassing the 0.90 clinical threshold. Expert evaluation confirms enhancements in information quality, reasoning, and professional expression. Our framework shows promise in reducing the diagnostic odyssey for rare disease patients.

由于知识代表性不足、概念理解有限和临床推理有限,对医学大型语言模型而言,罕见疾病诊断仍然具有挑战性。我们提出了一个框架,将多种族稀疏活性与等级知识图相结合。我们的方法是使用四种与多样性控制的互补匹配算法和五级后退战略来精确概念激活。一个三层知识图(分类、临床特征、实例)提供了结构化的最新背景。BioASQ稀有疾病数据集实验显示有显著改进:BLEU得分增加0.13,ROUGE得分增加0.10,诊断准确度增加0.25,最佳模型达到0.92精确度,超越0.90临床阈值。专家评估确认信息质量、推理和专业表达的提高。我们的框架在减少对罕见疾病患者的诊断多变性方面有希望。


Article 207

Title@2025-07-22 (2): Mangosteen: An Open Thai Corpus for Language Model Pretraining

Title: Mangosteen: An Open Thai Corpus for Language Model Pretraining Mangosteen: Ein offener thailändischer Corpus für Sprachmodellvorschulungen Mangosteen: 开放的泰语语言模型泰国公司 2507.14664v2

Authors (7): Wannaphong Phatthiyaphaibun, Can Udomcharoenchaikit, Pakpoom Singkorapoom, Kunat Pipatanakul, Ekapol Chuangsuwanich, Peerat Limkonchotiwat, Sarana Nutanong

Pre-training data shapes a language model’s quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that includes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Mangosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research.

培训前的数据塑造了语言模式的质量,但原始的网络文本却吵闹不休,需要仔细清理。现有的大型公司依赖以英语为中心的或以语言为名的管道。现有的大型公司依赖以英语为中心或以语言为主的输油管,这些输油管的超文本性格不能捕捉泰国文字或文化上的细微差别,使赌博内容等危险材料得不到处理。以前泰国特有的努力是定制输油管或建造新的输油管,但很少公布其数据或文件设计选择,但很少妨碍复制,提出如何建立一个透明、高质量的泰国文体的问题。我们引入了曼戈斯丁:通过泰国适应的多尔马管道建造了470亿吨泰国文的泰国文库,该管道包括基于规则的定制语言ID、订正的C4/Gopher质量过滤器和泰国培训的内容过滤器,加上诸如维基百科、《皇家公报》、OCR-Exclements 和CC-YouTube 字幕字幕字幕字幕字幕字幕组。使用GPLL2至25M的所有管道基底底线,同时将SEA-HLM前NLG系列和SEAR3BRO-CRBAR-CR-CRBRO-BI-CRIAR


Article 208

Title@2025-07-22 (2): Hear Your Code Fail, Voice-Assisted Debugging for Python

Title: Hear Your Code Fail, Voice-Assisted Debugging for Python Hören Sie Ihren Code fehlschlagen, Voice-Assisted Debugging für Python 听到您的代码失效, 语音协助调试 Python 的调试 2507.15007v2

Authors (7): Sayed Mahbub Hasan Amiri, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Mohammad Shawkat Ali Mamun, Sk. Humaun Kabir, Naznin Akter

This research introduces an innovative voice-assisted debugging plugin for Python that transforms silent runtime errors into actionable audible diagnostics. By implementing a global exception hook architecture with pyttsx3 text-to-speech conversion and Tkinter-based GUI visualization, the solution delivers multimodal error feedback through parallel auditory and visual channels. Empirical evaluation demonstrates 37% reduced cognitive load (p<0.01, n=50) compared to traditional stack-trace debugging, while enabling 78% faster error identification through vocalized exception classification and contextualization. The system achieves sub-1.2 second voice latency with under 18% CPU overhead during exception handling, vocalizing error types and consequences while displaying interactive tracebacks with documentation deep links. Criteria validate compatibility across Python 3.7+ environments on Windows, macOS, and Linux platforms. Needing only two lines of integration code, the plugin significantly boosts availability for aesthetically impaired designers and supports multitasking workflows through hands-free error medical diagnosis. Educational applications show particular promise, with pilot studies indicating 45% faster debugging skill acquisition among novice programmers. Future development will incorporate GPT-based repair suggestions and real-time multilingual translation to further advance auditory debugging paradigms. The solution represents a fundamental shift toward human-centric error diagnostics, bridging critical gaps in programming accessibility while establishing new standards for cognitive efficiency in software development workflows.

此项研究为 Python 引入了一个创新的语音辅助调试插件, 将静态运行时间错误转换成可感知的诊断。 通过实施带有 Pyttsx3 文本对语音转换和 Tkinter 图形界面图像化的全球例外钩形结构, 解决方案通过平行的听觉和视觉频道提供多式错误反馈。 经验评估显示, 与传统的书桌调试相比, 认知负载( p< 0.01, n=50) 减少了37% , 同时, 通过声频化例外分类和背景化, 能够更快地识别78%的错误。 该系统在例外处理、 发声错误类型和后果显示互动式追踪结构时, 使用 Python 3. 7+ 环境在 Windows, macOS 和 Linux 平台上, 标准验证兼容性。 只需要两行整合代码, 插件能极大地促进美容受损设计师的可用性, 通过无手错诊断, 支持多重任务流程。 教育应用程序显示了特别的希望, 试点研究显示, 将45% 快速的上下级的上下级读性理解性理解标准, 将快速转换到基础的逻辑化选择性格式化选择中, 格式化的流程中将显示, 格式化选择的流程中将快速转换为快速的逻辑转换到智能选择。


Article 209

Title@2025-07-22 (2): Self-Correcting Code Generation Using Small Language Models

Title: Self-Correcting Code Generation Using Small Language Models Selbstkorrekte Code-Generierung mit kleinen Sprachmodellen 使用小型语言模式自行校正代码生成 2505.23060v2

Authors (4): Jeonghun Cho, Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.

最近的研究探索了基于快速的战略,其中包括利用专有模型的核查或反馈循环,以及以培训为基础的方法,利用强大的推理能力;然而,小型模型是否具备通过自我反射有效指导其产出的能力,尚未探索。我们的调查结果显示,小型模型在自我校正模式中都难以表现出反射性修正行为。对此,我们引入了CoCOS, 这是一种旨在提高小型语言模型能力以进行多功能代码校正的方法。具体地说,我们提议了一个在线强化学习目标,以培训模型,有信心地保持正确的产出,同时逐步纠正转动的不正确产出。我们的方法具有累积的奖励功能,在整个轨迹中积累奖励,并获得更适合多方向校正情景的微额奖励。这有利于模型在通过自我校正实现大幅改进的同时提高初始反应质量。在1B级模型中,CoCOS在MPP上实现了35.8%的改进,在HumanEval上实现了27.7%的改进。


Article 210

Title@2025-07-22 (2): Scaling Linear Attention with Sparse State Expansion

Title: Scaling Linear Attention with Sparse State Expansion Scaling Lineare Aufmerksamkeit mit Sparse State Expansion Sparassar 州扩展时的 缩放线性注意 2507.16577v1

Authors (9): Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, Guoqi Li

The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

变换器架构尽管取得了广泛成功,但由于四级计算和线性记忆增长而与长长的文字情景挣扎。 各种线性关注变量通过将背景压缩到固定大小的状态,缓解了这些效率限制。 虽然各种线性关注变量通过将环境压缩到固定大小的状态而减轻了这些效率限制,但它们往往会降低在诸如文中检索和推理等任务方面的绩效。 为了解决这一局限性并实现更有效的背景压缩,我们提出了两项关键创新。 首先,我们引入了通过将国家更新概念化为信息分类来进行线性关注的分行式更新配方。 这使得通过基于软式的顶值-美元硬性分类,从而扩大可容纳字段,减少阶级间干扰。 其次,我们展示了稀薄框架中的“变换”国家扩展(SSSSSESE)扩展(SSSSSE)扩展(SSSSSE) 结构, 大幅的递增和缩略图性推理性推理(SER),在SEA(SE) II) 之后, 大幅的递增性推理性推理性推理(SIS-SE) 性推理学(SE-SE-SE-L)


Article 211

Title@2025-07-22 (2): Supernova: Achieving More with Less in Transformer Architectures

Title: Supernova: Achieving More with Less in Transformer Architectures Supernova: Mit weniger Transformer-Architekturen mehr erreichen 超新星:在变形结构结构中以更少的变形结构实现更大的成就 2507.15773v2

Authors (2): Andrei-Valentin Tanase, Elena Pelican

We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 35% fewer parameters and requiring only 100B training tokens–an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.

我们展示了超新星(Supernova),这是一个650M参数解码器唯一的变压器,它展示了谨慎的建筑设计和象征化创新在保持计算效率的同时能够取得较大模型的性能。我们的建筑结构将扶轮式定位嵌入器(ROPE)、分类查询注意(GQA)与3:1压缩比率(GQA)结合,RMSNorm用于计算效率,以及SwiGLU激活功能。一个关键的创新是常规的128 000伏词级BPE代谢器,它能达到最先进的压缩性能。通过详细分析,我们显示超新超新星实现了1B参数模型性能的90%,同时使用35%的参数,仅需要100B培训符号-数量级比竞争模型少。我们的调查结果挑战了流行的缩放范式,表明建筑效率和象征化质量可以弥补降低的参数值。


Article 212

Title@2025-07-22 (2): Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models

Title: Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models Pixel zu Prinzipien: Intuitive Physik in multimodalen Sprachmodellen verstehen 原则的像素:在多模式语言模型中探明直觉物理理解 2507.16572v1

Authors (3): Mohamad Ballout, Serwan Jassim, Elia Bruni

This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.

本文利用GRASP和IntPhys 2数据集,对最新多式大型语言模型(MLLMM)进行系统评估,利用GRASP和IntPhys 2数据集对直观物理学任务进行系统评估。我们评估了InternVL 2.5、Qwen 2.5 VL、Lalava-OneVision和专有的Gemini 2.0 Flash Thinking等开放源模型,发现即使是最新的模型也难以可靠地区分物理上与不可信的情景。为了超越性能衡量标准,我们进行了模型嵌入的检验分析,在关键处理阶段提取中间演示,以检查与任务有关的信息的保存情况。我们的结果显示,根据任务难度,关键的视觉语言错配可能会出现:视觉编码器成功捕捉到了物理上的光亮点,但这一信息没有被语言模型有效地利用,导致推理上的失败。这种误点表明,在直观物理任务中MLLLMS的主要限制不是视觉组成部分,而是视觉和语言信息的无效整合。我们的调查结果突出表明了视觉-语言调整是未来发展的关键领域。


Article 213

Title@2025-07-22 (2): Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

Title: Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language Gender Bias in großen Sprachmodellen erforschen: Ein tiefer Einblick in die deutsche Sprache 在大语言模式中探索性别偏见:深入跳入德语 2507.16557v1

Authors (4): Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schlüter

In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.

近年来,为评价大型语言模式中的性别偏见提出了各种方法,其中一项关键挑战在于,最初为英语开发的偏见计量方法在应用于其他语言时能否转移,这项工作旨在通过在LLMs中提供五套德国性别偏见评价数据集,促进这一研究领域。数据集以公认的性别偏见概念为基础,可通过多种方法获取。我们关于8种多语言LLM模式的报告发现,与德语性别偏见有关的独特挑战,包括对男性职业术语的模糊解释和看似中立的名词对性别观念的影响。这项工作有助于理解LLMs中跨语言的性别偏见,并强调有必要制定有针对性的评价框架。


Article 214

Title@2025-07-22 (2): Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Title: Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen LLM女士能产生可靠的试验案例发电机吗? 2506.06821v3

Authors (21): Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

大型语言模型(LLMS)在代码生成方面表现出了非凡的能力,能够在推断过程中处理复杂的任务,然而,LLMS在通过测试案例生成过程中可用于代码检查或调试的功能仍然在很大程度上没有得到探索。我们从竞争级别的编程(CP)方案的角度来调查这一问题,并提出TCGBench,即(LLM生成)测试案例生成器的基准。这一基准包括两项任务,目的是研究LLMS在(1)为特定CP问题生成有效测试案例生成器的能力,以及进一步(2)生成有针对性的测试案例生成器,暴露人造代码中的错误。实验结果表明,尽管最先进的LMS能够产生有效的测试案例生成器,但大多数LLMS都在努力生成能够有效揭示人类代码缺陷的定向测试案例。特别是,甚至先进的推理模型(如o3-mini)在生成目标型发电机的任务中也远远低于人类的性能。此外,我们为生成目标型发电机设计了一个高质量的手工整理数据集。分析结果表明,LMS的性能通过这一数据组合的迅速得到改进。


Article 215

Title@2025-07-22 (2): Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Title: Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters Seed-X: Starke Mehrsprachige Übersetzung LLM mit 7B-Parametern aufbauen 种子-X:利用7B参数建立强有力的多语种翻译LLM 2507.13618v2

Authors (26): Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu

Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

多语文翻译是大型语言模型(LLMS)处理复杂语言模式和自动化翻译中出现的精细翻译的艰巨任务。本文介绍种子-X,这是一个开放源码LMS系列,由指导和推理模型组成,推动7B参数大小的翻译能力极限。基础模型在包括28种语言的单语和双语内容的多种高质量数据集方面进行了预先培训,充分利用多语种数据的潜力。然后,指导模型经过微调,通过Thought链(Cot)推理翻译,并通过强化学习(RL)进一步强化。种子-X的成绩与主要的封闭源码模型(包括28种语言的Gemini-2.5和GPT-4o)相当,大大超越了在自动计量和人类评价方面更大的开放源码模型。我们通过优化程序分享最佳做法,并公布参数,以推进翻译研究和应用。


Article 216

Title@2025-07-22 (2): Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Title: Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report Frontier AI Risk Management Framework in der Praxis: Ein technischer Bericht zur Risikoanalyse 《国际边界风险管理框架实际操作:风险分析技术报告》 2507.16534v1

Authors (38): Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the “AI-$45^\circ$ Law,” we evaluate these risks using “red lines” (intolerable thresholds) and “yellow lines” (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

为了理解和确定快速推进人工智能(AI)模式带来的前所未有的风险,本报告对这些国家的边境风险进行了全面评估。我们根据《边境AI风险管理框架》(v1.0)(SafeWork-F1-F1-Framework)(SafeWork-F1-F1-Framerwork))的E-T-C分析(部署环境、威胁源、使能能力),评估了以下七个领域的关键风险:网络犯罪、生物和化学风险、说服和操纵、不受控制的自主AI RD、战略欺骗和规划、自我复制和串通。在“AI-45 circ$ Law”的指导下,我们利用“红线”(可移动的边界阈值)和“黄线”分析(预警指标)评估这些风险,以确定风险区域:绿色(常规部署和持续监测的可管理风险)、黄色(需要加强的缓解和受控部署)、红色(发展与/部署的中断、战略暂停和/或部署)。实验结果表明,所有最近的边境AI模型都位于绿色和黄色区域,而无需跨越红线。具体地,没有经过评估的网络犯罪模式的黄线,也没有评估模型或不受控制的风险。在最不稳定的AI RZ-D风险中,在绿色区域中,在自我演化和最可能、最深的自我演化和最深的自我演化中,在生物区中,在自我演化中仍。


Article 217

Title@2025-07-22 (2): Learning Text Styles: A Study on Transfer, Attribution, and Verification

Title: Learning Text Styles: A Study on Transfer, Attribution, and Verification Lerntextstile: Eine Studie über Transfer, Attribution und Verifizierung 学习教科书样式:关于转让、归属和核查的研究 2507.16530v1

Authors (1): Zhiqiang Hu

This thesis advances the computational understanding and manipulation of text styles through three interconnected pillars: (1) Text Style Transfer (TST), which alters stylistic properties (e.g., sentiment, formality) while preserving content; (2)Authorship Attribution (AA), identifying the author of a text via stylistic fingerprints; and (3) Authorship Verification (AV), determining whether two texts share the same authorship. We address critical challenges in these areas by leveraging parameter-efficient adaptation of large language models (LLMs), contrastive disentanglement of stylistic features, and instruction-based fine-tuning for explainable verification.

这一理论通过三个相互关联的支柱推进了对文本样式的计算理解和操纵:(1) 文本样式转让(TST),它改变文体特性(如情绪、形式),同时保留内容;(2) 审计归属(AAA),通过文体指纹识别文本作者;(3) 作者核查(AV),确定两种文本是否合用同一作者;我们通过利用对大语言模型(LLMS)进行具有参数效率的调整、对文体特征进行对比式的分解以及基于指示的微调,以进行可解释的核查,解决这些领域的重大挑战。


Article 218

Title@2025-07-22 (2): C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Title: C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning C2-Evo: Co-Evolving multimodale Daten und Modell zur Selbstverbesserung C2-Evo:共同演进的多模式数据和自我改进理由模型 2507.16518v1

Authors (12): Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang

Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.

多式联运大型语言模型(MLLM)的近期进展显示了令人印象深刻的推理能力,然而,进一步加强现有的MLLMS需要高质量的愿景语言数据集,并需要仔细制定复杂的任务,这些复杂的任务既昂贵又具有规模挑战性。尽管最近自我改进的自我改进模型提供了可行的解决办法,但它们仍面临两个核心挑战:(一) 多数现有方法将视觉数据或文字数据分开,导致数据复杂性的差异(例如,过于简化的图表与多余的文字描述相配);(二) 数据和模型的演变也分离,导致模型暴露于不匹配的困难程度的任务的假设情景。为了解决这些问题,我们建议C2-Evo,一个自动、封闭的自我改进的自我改进框架,共同发展培训数据和模型能力。具体地说,鉴于一个基础数据集和基础模型,C2-Evo通过跨模式数据演变循环和数据模型演变基准循环来增强这些数据。 以前的循环扩大了基础数据集,通过生成复杂的模型模型模型模型、分解的升级模型和滚动的滚动模型,同时选择结构化的次级模型和不断升级的升级的模型,然后又进行模拟的升级的升级的升级的系统。


Article 219

Title@2025-07-22 (2): Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness

Title: Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness Einführung der Qualitätsschätzung in die maschinelle Übersetzung Nachbearbeitung des Workflows: Eine empirische Studie über seine Nützlichkeit 对机器翻译质量进行质量估算,编辑后工作流程:关于其使用经验研究 2507.16515v1

Authors (3): Siqi Liu, Guangrong Dai, Dechao Li

This preliminary study investigates the usefulness of sentence-level Quality Estimation (QE) in English-Chinese Machine Translation Post-Editing (MTPE), focusing on its impact on post-editing speed and student translators’ perceptions. It also explores the interaction effects between QE and MT quality, as well as between QE and translation expertise. The findings reveal that QE significantly reduces post-editing time. The examined interaction effects were not significant, suggesting that QE consistently improves MTPE efficiency across medium- and high-quality MT outputs and among student translators with varying levels of expertise. In addition to indicating potentially problematic segments, QE serves multiple functions in MTPE, such as validating translators’ evaluations of MT quality and enabling them to double-check translation outputs. However, interview data suggest that inaccurate QE may hinder post-editing processes. This research provides new insights into the strengths and limitations of QE, facilitating its more effective integration into MTPE workflows to enhance translators’ productivity.

这份初步研究报告调查了中英机器翻译后编辑(MTPE)中判决一级质量估计(QE)的有用性,重点是其对编辑后速度和学生翻译的看法的影响,还探讨了质量评价与质量评价之间以及质量评价与翻译专门知识之间的相互作用效应。研究结果显示,质量评价大大减少了编辑后的时间。所审查的互动效果并不显著,表明质量评价不断提高中、高质量的MTE产出和具有不同水平专门知识的学生翻译的效率。除了指出可能存在问题的部分外,质量评价在MTPE中服务于多种功能,例如验证笔译员对质量的评价,使他们能够重复核对翻译产出。但访谈数据表明,不准确的质量评价可能妨碍编辑后的进程。这项研究对质量评价的长处和局限性提供了新的见解,有助于更有效地将其纳入MTPE工作流程,以提高翻译的生产率。


Article 220

Title@2025-07-22 (2): The Ever-Evolving Science Exam

Title: The Ever-Evolving Science Exam Die allgegenwärtige Wissenschaftsprüfung 不断演变的科学考试 2507.16514v1

Authors (12): Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

随着基础模型在能力和部署方面的迅速增长,对其科学理解的评估变得越来越重要。现有的科学基准已经朝着广泛的Range、广泛的Reach和高的Rigor取得进展,但它们经常面临两大挑战:数据渗漏风险,这些风险损害了基准的有效性,由于大规模测试而导致的低效率。为了解决这些问题,我们引入了Ever-EVer-EVE Science Examm(EESE),这是一个动态基准,旨在可靠地评估基础模型的科学能力。我们的方法包括两个组成部分:1)一个非公开的EESE-Pool,有超过100K的专家构建的科学实例(问答配对),横跨5个学科和500+子领域,但它们往往面临两大挑战:Range,Reach,和Rigor,2)一个定期更新的500-inestent 子集。我们的方法包括:(1) 一个非公开的EES-Pool,一个开放和封闭源模型的实验,三十二个开放和封闭的科学模型,EESE-E-E-E-E-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-S-G-G-G-G-G-G-G-S-S-S-G-G-G-S-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-S-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-


Article 221

Title@2025-07-22 (2): Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Title: Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation Sparrow: Dateneffizientes Video-LLM mit Text-zu-Bild-Erweiterung 麻雀:数据有效视频LLM,带有文本到图像放大功能 2411.19951v5

Authors (10): Shukang Yin, Chaoyou Fu, Sirui Zhao, Chunjiang Ge, Yan Yang, Yuhan Dai, Yongdong Luo, Tong Xu, Caifeng Shan, Enhong Chen

Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data. The code and data examples are available at https://github.com/VITA-MLLM/Sparrow.

近年来,多式大语言模型(MLLM)在视觉理解领域取得了成功,这些模型的成功在很大程度上可归功于占主导地位的缩放法,它表明较大的参数大小和数据量有助于改善业绩。值得注意的是,数据比例的扩大主要是由自动数据管道驱动的,重点是LLMM的自我检验。这个范例被视作是相当一段时间的理所当然的,但对利用这些数据推广效果的研究被长期忽视了。在这方面,这项工作重新审视合成数据的规模,侧重于从数据中心角度开发视频LLMS。我们的主要研究方法包括精细调整经过培训的图像-LLLMS,加上视频数据,并通过数据推广来检查效率。我们的初步实验结果显示,在仅仅扩大视频数据样本的自我检验时,学习效率低的现象,而通过我们的演示,这可以归因于缺乏教学的多样性。我们提出了一种称为Sparob的数据增强方法,它从纯文本指令数据中合成的样本中合成。我们的主要研究方法是用经过培训的精细的LMMMMMMM,通过没有经过培训的图像样本来大大改进数据。我们用高级的样本来进行模拟的测试,从而获得更高效的样本。


Article 222

Title@2025-07-22 (2): Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Title: Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics Bewertung der Intermediate Reasoning von Code-Assisted Large Language Models für Mathematik 评价代号协助的数学大语言模型的中间推理 2504.17665v2

Authors (3): Zena Al-Khalili, Nick Howell, Dietrich Klakow

Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs’ limits in the math domain.

协助LLMS的代码生成提高了其在数学推理任务方面的绩效。然而,对代码辅助LMS的评估一般限于执行正确性,缺乏对其生成的程序的严格评估。在这项工作中,我们通过深入分析代码辅助LLMS针对数学推理任务而生成的程序来弥补这一差距,重点是评估基本推理过程的健全性。为此,我们评估了五代LMS的数学数据集,包括手动和自动的数学数据集,并提议基于其逻辑正确性对生成的程序进行分类。我们的研究结果表明,模型的能力严重影响了为解决问题而实施的逻辑。闭源LMS将其方案植根于数学概念中,而开放源码模型往往采用不健全的推理方法,依靠记忆化的信息和详尽的搜索。此外,问题难度的增加使得所有模型的健全代数减少,表明LLMMS在复杂的数学方面有重大缺陷,而这种缺陷与精确度指标所显示的相反。我们的工作强调需要对代码辅助LMSMs进行更全面的评价,而不是执行精确度度度度度指标,以便更好地了解数学领域的LMSMs的极限。


Article 223

Title@2025-07-22 (2): Combining Language and Topic Models for Hierarchical Text Classification

Title: Combining Language and Topic Models for Hierarchical Text Classification Kombination von Sprach- und Themenmodellen für die Hierarchische Textklassifikation 将等级文字分类的语言和专题模式相结合 2507.16490v1

Authors (2): Jaco du Toit, Marcel Dunaiski

Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.

分级文本分类(HTC)是一项自然语言处理任务,其目的在于将文本文件从预先界定的层次层次结构分类成一组类别。最近HTC采用各种技术,将等级级结构信息与经过训练的语文模型(PLMs)的自然语言理解能力结合起来,以提高分类性能。此外,使用专题模型和PLMs从文本文档中提取特征,这已证明是多标签文本分类任务的一种有效方法。这些特征提取模型的组合原理是,PLM从一个预先界定的层次层次层次层次结构中获取精细背景和语义信息,而主题模型则获得高层次代表,将文件全套内容考虑在内。在本文件中,我们使用一个主题模型和专题模型从文本文件中提取特征,用于从文本分类模型中提取特征。我们的目标是确定从两个模型中提取的特征组合是否有利于高标签分类一般的绩效。在我们的方法中,提取的特征通过分级结构层层,其产出是合并的,而不是通过高层次的表达方式,我们使用每个标签和最精确的分类模型,我们用不同的模型来评估每个分类特征。


Article 224

Title@2025-07-22 (2): ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs

Title: ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs ICR-Probe: Verborgene Zustandsdynamiken für zuverlässige Halluzinationserkennung in LLMs verfolgen ICR Probe:跟踪隐藏状态动态,以便用LLMs进行可靠的幻觉探测 2507.16488v1

Authors (5): Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan

Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states’ update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.

大型语言模型(LLMS)擅长各种自然语言处理任务,但其产生幻觉的倾向会破坏其可靠性。 现有的幻觉检测方法利用隐藏国家,主要侧重于静态和孤立的表达方式,忽视其跨层的动态演变,从而限制效力。 为解决这一局限性,我们将重点转向隐藏状态更新过程,并引入新的指标,即ICR分数(遗留流的信息贡献),该分数量化了模块对隐藏状态更新的贡献。我们从经验上证实ICR分数在区分幻觉方面是有效和可靠的。基于这些洞察,我们建议一种幻觉检测方法,ICR Probe,它捕捉隐藏状态的跨层演化。实验结果显示ICR Probe以少得多的参数取得了优异的绩效。 此外,通缩研究和案例分析更深入地揭示了该方法的基本机制,提高了其可解释性。


Article 225

Title@2025-07-22 (2): Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation

Title: Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation Typed-RAG: Type-Aware Zersetzung von nicht-Faktoiden Fragen für retrieval-Augmented Generation 型式RAG: 用于回收-提款一代的非实物问题类型软件分解 2503.15879v3

Authors (5): DongGeon Lee, Ahjeong Park, Hyeri Lee, Hyeonseo Nam, Yunho Maeng

Addressing non-factoid question answering (NFQA) remains challenging due to its open-ended nature, diverse user intents, and need for multi-aspect reasoning. These characteristics often reveal the limitations of conventional retrieval-augmented generation (RAG) approaches. To overcome these challenges, we propose Typed-RAG, a framework for type-aware decomposition of non-factoid questions (NFQs) within the RAG paradigm. Specifically, Typed-RAG first classifies an NFQ into a predefined type (e.g., Debate, Experience, Comparison). It then decomposes the question into focused sub-queries, each focusing on a single aspect. This decomposition enhances both retrieval relevance and answer quality. By combining the results of these sub-queries, Typed-RAG produces more informative and contextually aligned responses. Additionally, we construct Wiki-NFQA, a benchmark dataset for NFQA covering a wide range of NFQ types. Experiments show that Typed-RAG consistently outperforms existing QA approaches based on LLMs or RAG methods, validating the effectiveness of type-aware decomposition for improving both retrieval quality and answer generation in NFQA. Our code and dataset are available on https://github.com/TeamNLP/Typed-RAG.

解决非活性问题解答(NFQA)仍具有挑战性,因为其性质是开放的,用户意图不同,需要多层次的推理。这些特点往往揭示了常规检索增强的一代(RAG)方法的局限性。为了克服这些挑战,我们建议在RAG范式内,为非活性问题类型分解(NFQ)的框架,即Styd-RAG(NFQ),具体来说,类型RAG首先将NFQ分类为预先定义的类型(例如,辩论、经验、比较)。然后,将问题分解为重点的子问题,每个问题都侧重于一个单一的方面。这种分解可增强检索相关性和回答质量。通过合并这些子查询的结果,Styd-RAG(NFQ)产生更丰富和背景一致的答复。此外,我们建造了Wiki-NFQA,这是NFQA的基准数据集,涵盖广泛的NFQ 类型。实验显示,类型RAG(CG)持续超越现有基于LAMS或RA型号的QA的QUA方法的Qreal-Realition。


Article 226

Title@2025-07-22 (2): ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension

Title: ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension ReMeREC: Beziehungsbewusste und Multi-Entity-Bezug auf Expression-Verständnis ReMEREC: 关系意识和多实体参考表达式理解 2507.16877v1

Authors (9): Yizhi Hu, Zezhao Tian, Xingqun Qi, Chen Su, Bingkun Yang, Junhui Yin, Muyi Sun, Man Zhang, Zhenan Sun

Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.

表达理解(REC) 旨在将特定实体或区域以自然语言描述的图像定位为特定实体或区域; 现有方法处理单一实体本地化,但往往忽视多实体场景中复杂的实体间关系,限制其准确性和可靠性; 此外,缺乏精细区分、配对图像-文本关系说明的高质量数据集阻碍了进一步的进展; 为了应对这一挑战,我们首先构建了一个名为ReMeX的关联意识、多实体REC数据集,其中包括详细的关系和文字说明; 我们随后提议了ReMEREC,这是一个新颖的框架,在建模时,利用视觉和文字上的大提示将多个实体本地化,同时模拟其内部关系; 为了解决语言中隐含实体界限造成的语义模糊问题,我们引入了文本适应性多功能-文字关系说明(TMP) , 动态地将实体的数量和范围从精细的文本提示中推断出来, 产生独特的演示。 此外,我们实体间关联性理性理性关系(EIR) 联合利用视觉和文字推导法将多个关系提升关系关系, 构建一个快速的实地数据模型。


Article 227

Title@2025-07-22 (2): Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Title: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿 2507.06261v4

Authors (3309): Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Nathan Byrd, Ashrith Sheshan, Raia Hadsell Sangnie Bhardwaj, Pawel Janus, Tero Rissa, Dan Horgan, Sharon Silver, Ayzaan Wahid, Sergey Brin, Yves Raimond, Klemen Kloboves, Cindy Wang, Nitesh Bharadwaj Gundavarapu, Ilia Shumailov, Bo Wang, Mantas Pajarskas, Joe Heyward, Martin Nikoltchev, Maciej Kula, Hao Zhou, Zachary Garrett, Sushant Kafle, Sercan Arik, Ankita Goel, Mingyao Yang, Jiho Park, Koji Kojima, Parsa Mahmoudieh, Koray Kavukcuoglu, Grace Chen, Doug Fritz, Anton Bulyenov, Sudeshna Roy, Dimitris Paparas, Hadar Shemtov, Bo-Juen Chen, Robin Strudel, David Reitter, Aurko Roy, Andrey Vlasov, Changwan Ryu, Chas Leichner, Haichuan Yang, Zelda Mariet, Denis Vnukov, Tim Sohn, Amy Stuart, Wei Liang, Minmin Chen, Praynaa Rawlani, Christy Koh, JD Co-Reyes, Guangda Lai, Praseem Banzal, Dimitrios Vytiniotis, Jieru Mei, Mu Cai, Mohammed Badawi, Corey Fry, Ale Hartman, Daniel Zheng, Eric Jia, James Keeling, Annie Louis, Ying Chen, Efren Robles, Wei-Chih Hung, Howard Zhou, Nikita Saxena, Sonam Goenka, Olivia Ma, Zach Fisher, Mor Hazan Taege, Emily Graves, David Steiner, Yujia Li, Sarah Nguyen, Rahul Sukthankar, Joe Stanton, Ali Eslami, Gloria Shen, Berkin Akin, Alexey Guseynov, Yiqian Zhou, Jean-Baptiste Alayrac, Armand Joulin, Efrat Farkash, Ashish Thapliyal, Stephen Roller, Noam Shazeer, Todor Davchev, Terry Koo, Hannah Forbes-Pollard, Kartik Audhkhasi, Greg Farquhar, Adi Mayrav Gilady, Maggie Song, John Aslanides, Piermaria Mendolicchio, Alicia Parrish, John Blitzer, Pramod Gupta, Xiaoen Ju, Xiaochen Yang, Puranjay Datta, Andrea Tacchetti, Sanket Vaibhav Mehta, Gregory Dibb, Shubham Gupta, Federico Piccinini, Raia Hadsell, Sujee Rajayogam, Jiepu Jiang, Patrick Griffin, Patrik Sundberg, Jamie Hayes, Alexey Frolov, Tian Xie, Adam Zhang, Kingshuk Dasgupta, Uday Kalra, Lior Shani, Klaus Macherey, Tzu-Kuo Huang, Liam MacDermed, Karthik Duddu, Paulo Zacchello, Zi Yang, Jessica Lo, Kai Hui, Matej Kastelic, Derek Gasaway, Qijun Tan, Summer Yue, Pablo Barrio, John Wieting, Weel Yang, Andrew Nystrom, Solomon Demmessie, Anselm Levskaya, Fabio Viola, Chetan Tekur, Greg Billock, George Necula, Mandar Joshi, Rylan Schaeffer, Swachhand Lokhande, Christina Sorokin, Pradeep Shenoy, Mia Chen, Mark Collier, Hongji Li, Taylor Bos, Nevan Wichers, Sun Jae Lee, Angéline Pouget, Santhosh Thangaraj, Kyriakos Axiotis, Phil Crone, Rachel Sterneck, Nikolai Chinaev, Victoria Krakovna, Oleksandr Ferludin, Ian Gemp, Stephanie Winkler, Dan Goldberg, Ivan Korotkov, Kefan Xiao, Malika Mehrotra, Sandeep Mariserla, Vihari Piratla, Terry Thurk, Khiem Pham, Hongxu Ma, Alexandre Senges, Ravi Kumar, Clemens Meyer, Ellie Talius, Nuo Wang Pierse, Ballie Sandhu, Horia Toma, Kuo Lin, Swaroop Nath, Tom Stone, Dorsa Sadigh, Nikita Gupta, Arthur Guez, Avi Singh, Matt Thomas, Tom Duerig, Yuan Gong, Richard Tanburn, Lydia Lihui Zhang, Phuong Dao, Mohamed Hammad, Sirui Xie, Shruti Rijhwani, Ben Murdoch, Duhyeon Kim, Will Thompson, Heng-Tze Cheng, Daniel Sohn, Pablo Sprechmann, Qiantong Xu, Srinivas Tadepalli, Peter Young, Ye Zhang, Hansa Srinivasan, Miranda Aperghis, Aditya Ayyar, Hen Fitoussi, Ryan Burnell, David Madras, Mike Dusenberry, Xi Xiong, Tayo Oguntebi, Ben Albrecht, Jörg Bornschein, Jovana Mitrović, Mason Dimarco, Bhargav Kanagal Shamanna, Premal Shah, Eren Sezener, Shyam Upadhyay, Dave Lacey, Craig Schiff, Sebastien Baur, Sanjay Ganapathy, Eva Schnider, Mateo Wirth, Connor Schenck, Andrey Simanovsky, Yi-Xuan Tan, Philipp Fränken, Dennis Duan, Bharath Mankalale, Nikhil Dhawan, Kevin Sequeira, Zichuan Wei, Shivanker Goel, Caglar Unlu, Yukun Zhu, Haitian Sun, Ananth Balashankar, Kurt Shuster, Megh Umekar, Mahmoud Alnahlawi, Aäron van den Oord, Kelly Chen, Yuexiang Zhai, Zihang Dai, Kuang-Huei Lee, Eric Doi, Lukas Zilka, Rohith Vallu, Disha Shrivastava, Jason Lee, Hisham Husain, Honglei Zhuang, Vincent Cohen-Addad, Jarred Barber, James Atwood, Adam Sadovsky, Quentin Wellens, Steven Hand, Arunkumar Rajendran, Aybuke Turker, CJ Carey, Yuanzhong Xu, Hagen Soltau, Zefei Li, Xinying Song, Conglong Li, Iurii Kemaev, Sasha Brown, Andrea Burns, Viorica Patraucean, Piotr Stanczyk, Renga Aravamudhan, Mathieu Blondel, Hila Noga, Lorenzo Blanco, Will Song, Michael Isard, Mandar Sharma, Reid Hayes, Dalia El Badawy, Avery Lamp, Itay Laish, Olga Kozlova, Kelvin Chan, Sahil Singla, Srinivas Sunkara, Mayank Upadhyay, Chang Liu, Aijun Bai, Jarek Wilkiewicz, Martin Zlocha, Jeremiah Liu, Zhuowan Li, Haiguang Li, Omer Barak, Ganna Raboshchuk, Jiho Choi, Fangyu Liu, Erik Jue, Mohit Sharma, Andreea Marzoca, Robert Busa-Fekete, Anna Korsun, Andre Elisseeff, Zhe Shen, Sara Mc Carthy, Kay Lamerigts, Anahita Hosseini, Hanzhao Lin, Charlie Chen, Fan Yang, Kushal Chauhan, Mark Omernick, Dawei Jia, Karina Zainullina, Demis Hassabis, Danny Vainstein, Ehsan Amid, Xiang Zhou, Ronny Votel, Eszter Vértes, Xinjian Li, Zongwei Zhou, Angeliki Lazaridou, Brendan McMahan, Arjun Narayanan, Hubert Soyer, Sujoy Basu, Kayi Lee, Bryan Perozzi, Qin Cao, Leonard Berrada, Rahul Arya, Ke Chen, Katrina, Xu, Matthias Lochbrunner, Alex Hofer, Sahand Sharifzadeh, Renjie Wu, Sally Goldman, Pranjal Awasthi, Xuezhi Wang, Yan Wu, Claire Sha, Biao Zhang, Maciej Mikuła, Filippo Graziano, Siobhan Mcloughlin, Irene Giannoumis, Youhei Namiki, Chase Malik, Carey Radebaugh, Jamie Hall, Ramiro Leal-Cavazos, Jianmin Chen, Vikas Sindhwani, David Kao, David Greene, Jordan Griffith, Chris Welty, Ceslee Montgomery, Toshihiro Yoshino, Liangzhe Yuan, Noah Goodman, Assaf Hurwitz Michaely, Kevin Lee, KP Sawhney, Wei Chen, Zheng Zheng, Megan Shum, Nikolay Savinov, Etienne Pot, Alex Pak, Morteza Zadimoghaddam, Sijal Bhatnagar, Yoad Lewenberg, Blair Kutzman, Ji Liu, Lesley Katzen, Jeremy Selier, Josip Djolonga, Dmitry Lepikhin, Kelvin Xu, Jacky Liang, Jiewen Tan, Benoit Schillings, Muge Ersoy, Pete Blois, Bernd Bandemer, Abhimanyu Singh, Sergei Lebedev, Pankaj Joshi, Adam R. Brown, Evan Palmer, Shreya Pathak, Komal Jalan, Fedir Zubach, Shuba Lall, Randall Parker, Alok Gunjan, Sergey Rogulenko, Sumit Sanghai, Zhaoqi Leng, Zoltan Egyed, Shixin Li, Maria Ivanova, Kostas Andriopoulos, Jin Xie, Elan Rosenfeld, Auriel Wright, Ankur Sharma, Xinyang Geng, Yicheng Wang, Sam Kwei, Renke Pan, Yujing Zhang, Gabby Wang, Xi Liu, Chak Yeung, Elizabeth Cole, Aviv Rosenberg, Zhen Yang, Phil Chen, George Polovets, Pranav Nair, Rohun Saxena, Josh Smith, Shuo-yiin Chang, Aroma Mahendru, Svetlana Grant, Anand Iyer, Irene Cai, Jed McGiffin, Jiaming Shen, Alanna Walton, Antonious Girgis, Oliver Woodman, Rosemary Ke, Mike Kwong, Louis Rouillard, Jinmeng Rao, Zhihao Li, Yuntao Xu, Flavien Prost, Chi Zou, Ziwei Ji, Alberto Magni, Tyler Liechty, Dan A. Calian, Deepak Ramachandran, Igor Krivokon, Hui Huang, Terry Chen, Anja Hauth, Anastasija Ilić, Weijuan Xi, Hyeontaek Lim, Vlad-Doru Ion, Pooya Moradi, Metin Toksoz-Exley, Kalesha Bullard, Miltos Allamanis, Xiaomeng Yang, Sophie Wang, Zhi Hong, Anita Gergely, Cheng Li, Bhavishya Mittal, Vitaly Kovalev, Victor Ungureanu, Jane Labanowski, Jan Wassenberg, Nicolas Lacasse, Geoffrey Cideron, Petar Dević, Annie Marsden, Lynn Nguyen, Michael Fink, Yin Zhong, Tatsuya Kiyono, Desi Ivanov, Sally Ma, Max Bain, Kiran Yalasangi, Jennifer She, Anastasia Petrushkina, Mayank Lunayach, Carla Bromberg, Sarah Hodkinson, Vilobh Meshram, Daniel Vlasic, Austin Kyker, Steve Xu, Jeff Stanway, Zuguang Yang, Kai Zhao, Matthew Tung, Seth Odoom, Yasuhisa Fujii, Justin Gilmer, Eunyoung Kim, Felix Halim, Quoc Le, Bernd Bohnet, Seliem El-Sayed, Behnam Neyshabur, Malcolm Reynolds, Dean Reich, Yang Xu, Erica Moreira, Anuj Sharma, Zeyu Liu, Mohammad Javad Hosseini, Naina Raisinghani, Yi Su, Ni Lao, Daniel Formoso, Marco Gelmi, Almog Gueta, Tapomay Dey, Elena Gribovskaya, Domagoj Ćevid, Sidharth Mudgal, Garrett Bingham, Jianling Wang, Anurag Kumar, Alex Cullum, Feng Han, Konstantinos Bousmalis, Diego Cedillo, Grace Chu, Vladimir Magay, Paul Michel, Ester Hlavnova, Daniele Calandriello, Setareh Ariafar, Kaisheng Yao, Vikash Sehwag, Arpi Vezer, Agustin Dal Lago, Zhenkai Zhu, Paul Kishan Rubenstein, Allen Porter, Anirudh Baddepudi, Oriana Riva, Mihai Dorin Istin, Chih-Kuan Yeh, Zhi Li, Andrew Howard, Nilpa Jha, Jeremy Chen, Raoul de Liedekerke, Zafarali Ahmed, Mikel Rodriguez, Tanuj Bhatia, Bangju Wang, Ali Elqursh, David Klinghoffer, Peter Chen, Pushmeet Kohli, Te I, Weiyang Zhang, Zack Nado, Jilin Chen, Maxwell Chen, George Zhang, Aayush Singh, Adam Hillier, Federico Lebron, Yiqing Tao, Ting Liu, Gabriel Dulac-Arnold, Jingwei Zhang, Shashi Narayan, Buhuang Liu, Orhan Firat, Abhishek Bhowmick, Bingyuan Liu, Hao Zhang, Zizhao Zhang, Georges Rotival, Nathan Howard, Anu Sinha, Alexander Grushetsky, Benjamin Beyret, Keerthana Gopalakrishnan, James Zhao, Kyle He, Szabolcs Payrits, Zaid Nabulsi, Zhaoyi Zhang, Weijie Chen, Edward Lee, Nova Fallen, Sreenivas Gollapudi, Aurick Zhou, Filip Pavetić, Thomas Köppe, Shiyu Huang, Rama Pasumarthi, Nick Fernando, Felix Fischer, Daria Ćurko, Yang Gao, James Svensson, Austin Stone, Haroon Qureshi, Abhishek Sinha, Apoorv Kulshreshtha, Martin Matysiak, Jieming Mao, Carl Saroufim, Aleksandra Faust, Qingnan Duan, Gil Fidel, Kaan Katircioglu, Raphaël Lopez Kaufman, Dhruv Shah, Weize Kong, Abhishek Bapna, Gellért Weisz, Emma Dunleavy, Praneet Dutta, Tianqi Liu, Rahma Chaabouni, Carolina Parada, Marcus Wu, Alexandra Belias, Alessandro Bissacco, Stanislav Fort, Li Xiao, Fantine Huot, Chris Knutsen, Yochai Blau, Gang Li, Jennifer Prendki, Juliette Love, Yinlam Chow, Pichi Charoenpanit, Hidetoshi Shimokawa, Vincent Coriou, Karol Gregor, Tomas Izo, Arjun Akula, Mario Pinto, Chris Hahn, Dominik Paulus, Jiaxian Guo, Neha Sharma, Cho-Jui Hsieh, Adaeze Chukwuka, Kazuma Hashimoto, Nathalie Rauschmayr, Ling Wu, Christof Angermueller, Yulong Wang, Sebastian Gerlach, Michael Pliskin, Daniil Mirylenka, Min Ma, Lexi Baugher, Bryan Gale, Shaan Bijwadia, Nemanja Rakićević, David Wood, Jane Park, Chung-Ching Chang, Babi Seal, Chris Tar, Kacper Krasowiak, Yiwen Song, Georgi Stephanov, Gary Wang, Marcello Maggioni, Stein Xudong Lin, Felix Wu, Shachi Paul, Zixuan Jiang, Shubham Agrawal, Bilal Piot, Alex Feng, Cheolmin Kim, Tulsee Doshi, Jonathan Lai, Chuqiao, Xu, Sharad Vikram, Ciprian Chelba, Sebastian Krause, Vincent Zhuang, Jack Rae, Timo Denk, Adrian Collister, Lotte Weerts, Xianghong Luo, Yifeng Lu, Håvard Garnes, Nitish Gupta, Terry Spitz, Avinatan Hassidim, Lihao Liang, Izhak Shafran, Peter Humphreys, Kenny Vassigh, Phil Wallis, Virat Shejwalkar, Nicolas Perez-Nieves, Rachel Hornung, Melissa Tan, Beka Westberg, Andy Ly, Richard Zhang, Brian Farris, Jongbin Park, Alec Kosik, Zeynep Cankara, Andrii Maksai, Yunhan Xu, Albin Cassirer, Sergi Caelles, Abbas Abdolmaleki, Mencher Chiang, Alex Fabrikant, Shravya Shetty, Luheng He, Mai Giménez, Hadi Hashemi, Sheena Panthaplackel, Yana Kulizhskaya, Salil Deshmukh, Daniele Pighin, Robin Alazard, Disha Jindal, Seb Noury, Pradeep Kumar S, Siyang Qin, Xerxes Dotiwalla, Stephen Spencer, Mohammad Babaeizadeh, Blake JianHang Chen, Vaibhav Mehta, Jennie Lees, Andrew Leach, Penporn Koanantakool, Ilia Akolzin, Ramona Comanescu, Junwhan Ahn, Alexey Svyatkovskiy, Basil Mustafa, David D’Ambrosio, Shiva Mohan Reddy Garlapati, Pascal Lamblin, Alekh Agarwal, Shuang Song, Pier Giuseppe Sessa, Pauline Coquinot, John Maggs, Hussain Masoom, Divya Pitta, Yaqing Wang, Patrick Morris-Suzuki, Billy Porter, Johnson Jia, Jeffrey Dudek, Raghavender R, Cosmin Paduraru, Alan Ansell, Tolga Bolukbasi, Tony Lu, Ramya Ganeshan, Zi Wang, Henry Griffiths, Rodrigo Benenson, Yifan He, James Swirhun, George Papamakarios, Aditya Chawla, Kuntal Sengupta, Yan Wang, Vedrana Milutinovic, Igor Mordatch, Zhipeng Jia, Jamie Smith, Will Ng, Shitij Nigam, Matt Young, Eugen Vušak, Blake Hechtman, Sheela Goenka, Avital Zipori, Kareem Ayoub, Ashok Popat, Trilok Acharya, Luo Yu, Dawn Bloxwich, Hugo Song, Paul Roit, Haiqiong Li, Aviel Boag, Nigamaa Nayakanti, Bilva Chandra, Tianli Ding, Aahil Mehta, Cath Hope, Jiageng Zhang, Idan Heimlich Shtacher, Kartikeya Badola, Ryo Nakashima, Andrei Sozanschi, Iulia Comşa, Ante Žužul, Emily Caveness, Julian Odell, Matthew Watson, Dario de Cesare, Phillip Lippe, Derek Lockhart, Siddharth Verma, Huizhong Chen, Sean Sun, Lin Zhuo, Aditya Shah, Prakhar Gupta, Alex Muzio, Ning Niu, Amir Zait, Abhinav Singh, Meenu Gaba, Fan Ye, Prajit Ramachandran, Mohammad Saleh, Raluca Ada Popa, Ayush Dubey, Frederick Liu, Sara Javanmardi, Mark Epstein, Ross Hemsley, Richard Green, Nishant Ranka, Eden Cohen, Chuyuan Kelly Fu, Sanjay Ghemawat, Jed Borovik, James Martens, Anthony Chen, Pranav Shyam, André Susano Pinto, Ming-Hsuan Yang, Alexandru Ţifrea, David Du, Boqing Gong, Ayushi Agarwal, Seungyeon Kim, Christian Frank, Saloni Shah, Xiaodan Song, Zhiwei Deng, Ales Mikhalap, Kleopatra Chatziprimou, Timothy Chung, Toni Creswell, Susan Zhang, Yennie Jun, Carl Lebsack, Will Truong, Slavica Andačić, Itay Yona, Marco Fornoni, Rong Rong, Serge Toropov, Afzal Shama Soudagar, Andrew Audibert, Salah Zaiem, Zaheer Abbas, Andrei Rusu, Sahitya Potluri, Shitao Weng, Anastasios Kementsietsidis, Anton Tsitsulin, Daiyi Peng, Natalie Ha, Sanil Jain, Tejasi Latkar, Simeon Ivanov, Cory McLean, Anirudh GP, Rajesh Venkataraman, Canoee Liu, Dilip Krishnan, Joel D’sa, Roey Yogev, Paul Collins, Benjamin Lee, Lewis Ho, Carl Doersch, Gal Yona, Shawn Gao, Felipe Tiengo Ferreira, Adnan Ozturel, Hannah Muckenhirn, Ce Zheng, Gargi Balasubramaniam, Mudit Bansal, George van den Driessche, Sivan Eiger, Salem Haykal, Vedant Misra, Abhimanyu Goyal, Danilo Martins, Gary Leung, Jonas Valfridsson, Four Flynn, Will Bishop, Chenxi Pang, Yoni Halpern, Honglin Yu, Lawrence Moore, Yuvein, Zhu, Sridhar Thiagarajan, Yoel Drori, Zhisheng Xiao, Lucio Dery, Rolf Jagerman, Jing Lu, Eric Ge, Vaibhav Aggarwal, Arjun Khare, Vinh Tran, Oded Elyada, Ferran Alet, James Rubin, Ian Chou, David Tian, Libin Bai, Lawrence Chan, Lukasz Lew, Karolis Misiunas, Taylan Bilal, Aniket Ray, Sindhu Raghuram, Alex Castro-Ros, Viral Carpenter, CJ Zheng, Michael Kilgore, Josef Broder, Emily Xue, Praveen Kallakuri, Dheeru Dua, Nancy Yuen, Steve Chien, John Schultz, Saurabh Agrawal, Reut Tsarfaty, Jingcao Hu, Ajay Kannan, Dror Marcus, Nisarg Kothari, Baochen Sun, Ben Horn, Matko Bošnjak, Ferjad Naeem, Dean Hirsch, Lewis Chiang, Boya Fang, Jie Han, Qifei Wang, Ben Hora, Antoine He, Mario Lučić, Beer Changpinyo, Anshuman Tripathi, John Youssef, Chester Kwak, Philippe Schlattner, Cat Graves, Rémi Leblond, Wenjun Zeng, Anders Andreassen, Gabriel Rasskin, Yue Song, Eddie Cao, Junhyuk Oh, Matt Hoffman, Wojtek Skut, Yichi Zhang, Jon Stritar, Xingyu Cai, Saarthak Khanna, Kathie Wang, Shriya Sharma, Christian Reisswig, Younghoon Jun, Aman Prasad, Tatiana Sholokhova, Preeti Singh, Adi Gerzi Rosenthal, Anian Ruoss, Françoise Beaufays, Sean Kirmani, Dongkai Chen, Johan Schalkwyk, Jonathan Herzig, Been Kim, Josh Jacob, Damien Vincent, Adrian N Reyes, Ivana Balazevic, Léonard Hussenot, Jon Schneider, Parker Barnes, Luis Castro, Spandana Raj Babbula, Simon Green, Serkan Cabi, Nico Duduta, Danny Driess, Rich Galt, Noam Velan, Junjie Wang, Hongyang Jiao, Matthew Mauger, Du Phan, Miteyan Patel, Vlado Galić, Jerry Chang, Eyal Marcus, Matt Harvey, Julian Salazar, Elahe Dabir, Suraj Satishkumar Sheth, Amol Mandhane, Hanie Sedghi, Jeremiah Willcock, Amir Zandieh, Shruthi Prabhakara, Aida Amini, Antoine Miech, Victor Stone, Massimo Nicosia, Paul Niemczyk, Ying Xiao, Lucy Kim, Sławek Kwasiborski, Vikas Verma, Ada Maksutaj Oflazer, Christoph Hirnschall, Peter Sung, Lu Liu, Richard Everett, Michiel Bakker, Ágoston Weisz, Yufei Wang, Vivek Sampathkumar, Uri Shaham, Bibo Xu, Yasemin Altun, Mingqiu Wang, Takaaki Saeki, Guanjie Chen, Emanuel Taropa, Shanthal Vasanth, Sophia Austin, Lu Huang, Goran Petrovic, Qingyun Dou, Daniel Golovin, Grigory Rozhdestvenskiy, Allie Culp, Will Wu, Motoki Sano, Divya Jain, Julia Proskurnia, Sébastien Cevey, Alejandro Cruzado Ruiz, Piyush Patil, Mahdi Mirzazadeh, Eric Ni, Javier Snaider, Lijie Fan, Alexandre Fréchette, AJ Pierigiovanni, Shariq Iqbal, Kenton Lee, Claudio Fantacci, Jinwei Xing, Lisa Wang, Alex Irpan, David Raposo, Yi Luan, Zhuoyuan Chen, Harish Ganapathy, Kevin Hui, Jiazhong Nie, Isabelle Guyon, Heming Ge, Roopali Vij, Hui Zheng, Dayeong Lee, Alfonso Castaño, Khuslen Baatarsukh, Gabriel Ibagon, Alexandra Chronopoulou, Nicholas FitzGerald, Shashank Viswanadha, Safeen Huda, Rivka Moroshko, Georgi Stoyanov, Prateek Kolhar, Alain Vaucher, Ishaan Watts, Adhi Kuncoro, Henryk Michalewski, Satish Kambala, Bat-Orgil Batsaikhan, Alek Andreev, Irina Jurenka, Maigo Le, Qihang Chen, Wael Al Jishi, Sarah Chakera, Zhe Chen, Aditya Kini, Vikas Yadav, Aditya Siddhant, Ilia Labzovsky, Balaji Lakshminarayanan, Carrie Grimes Bostock, Pankil Botadra, Ankesh Anand, Colton Bishop, Sam Conway-Rahman, Mohit Agarwal, Yani Donchev, Achintya Singhal, Félix de Chaumont Quitry, Natalia Ponomareva, Nishant Agrawal, Bin Ni, Kalpesh Krishna, Masha Samsikova, John Karro, Yilun Du, Tamara von Glehn, Caden Lu, Christopher A. Choquette-Choo, Zhen Qin, Tingnan Zhang, Sicheng Li, Divya Tyam, Swaroop Mishra, Wing Lowe, Colin Ji, Weiyi Wang, Manaal Faruqui, Ambrose Slone, Valentin Dalibard, Arunachalam Narayanaswamy, John Lambert, Pierre-Antoine Manzagol, Dan Karliner, Andrew Bolt, Ivan Lobov, Aditya Kusupati, Chang Ye, Xuan Yang, Heiga Zen, Nelson George, Mukul Bhutani, Olivier Lacombe, Robert Riachi, Gagan Bansal, Rachel Soh, Yue Gao, Yang Yu, Adams Yu, Emily Nottage, Tania Rojas-Esponda, James Noraky, Manish Gupta, Ragha Kotikalapudi, Jichuan Chang, Sanja Deur, Dan Graur, Alex Mossin, Erin Farnese, Ricardo Figueira, Alexandre Moufarek, Austin Huang, Patrik Zochbauer, Ben Ingram, Tongzhou Chen, Zelin Wu, Adrià Puigdomènech, Leland Rechis, Da Yu, Sri Gayatri Sundara Padmanabhan, Rui Zhu, Chu-ling Ko, Andrea Banino, Samira Daruki, Aarush Selvan, Dhruva Bhaswar, Daniel Hernandez Diaz, Chen Su, Salvatore Scellato, Jennifer Brennan, Woohyun Han, Grace Chung, Priyanka Agrawal, Urvashi Khandelwal, Khe Chai Sim, Morgane Lustman, Sam Ritter, Kelvin Guu, Jiawei Xia, Prateek Jain, Emma Wang, Tyrone Hill, Mirko Rossini, Marija Kostelac, Tautvydas Misiunas, Amit Sabne, Kyuyeun Kim, Ahmet Iscen, Congchao Wang, José Leal, Ashwin Sreevatsa, Utku Evci, Manfred Warmuth, Saket Joshi, Daniel Suo, James Lottes, Garrett Honke, Brendan Jou, Stefani Karp, Jieru Hu, Himanshu Sahni, Adrien Ali Taïga, William Kong, Samrat Ghosh, Renshen Wang, Jay Pavagadhi, Natalie Axelsson, Nikolai Grigorev, Patrick Siegler, Rebecca Lin, Guohui Wang, Emilio Parisotto, Sharath Maddineni, Krishan Subudhi, Eyal Ben-David, Elena Pochernina, Orgad Keller, Thi Avrahami, Zhe Yuan, Pulkit Mehta, Jialu Liu, Sherry Yang, Wendy Kan, Katherine Lee, Tom Funkhouser, Derek Cheng, Hongzhi Shi, Archit Sharma, Joe Kelley, Matan Eyal, Yury Malkov, Corentin Tallec, Yuval Bahat, Shen Yan, Xintian, Wu, David Lindner, Chengda Wu, Avi Caciularu, Xiyang Luo, Rodolphe Jenatton, Tim Zaman, Yingying Bi, Ilya Kornakov, Ganesh Mallya, Daisuke Ikeda, Itay Karo, Anima Singh, Colin Evans, Praneeth Netrapalli, Vincent Nallatamby, Isaac Tian, Yannis Assael, Vikas Raunak, Victor Carbune, Ioana Bica, Lior Madmoni, Dee Cattle, Snchit Grover, Krishna Somandepalli, Sid Lall, Amelio Vázquez-Reina, Riccardo Patana, Jiaqi Mu, Pranav Talluri, Maggie Tran, Rajeev Aggarwal, RJ Skerry-Ryan, Jun Xu, Mike Burrows, Xiaoyue Pan, Edouard Yvinec, Di Lu, Zhiying Zhang, Duc Dung Nguyen, Hairong Mu, Gabriel Barcik, Helen Ran, Lauren Beltrone, Krzysztof Choromanski, Dia Kharrat, Samuel Albanie, Sean Purser-haskell, David Bieber, Carrie Zhang, Jing Wang, Tom Hudson, Zhiyuan Zhang, Han Fu, Johannes Mauerer, Mohammad Hossein Bateni, AJ Maschinot, Bing Wang, Muye Zhu, Arjun Pillai, Tobias Weyand, Shuang Liu, Oscar Akerlund, Fred Bertsch, Vittal Premachandran, Alicia Jin, Vincent Roulet, Peter de Boursac, Shubham Mittal, Ndaba Ndebele, Georgi Karadzhov, Sahra Ghalebikesabi, Ricky Liang, Allen Wu, Yale Cong, Nimesh Ghelani, Sumeet Singh, Bahar Fatemi, Warren, Chen, Charles Kwong, Alexey Kolganov, Steve Li, Richard Song, Chenkai Kuang, Sobhan Miryoosefi, Dale Webster, James Wendt, Arkadiusz Socala, Guolong Su, Artur Mendonça, Abhinav Gupta, Xiaowei Li, Tomy Tsai, Qiong, Hu, Kai Kang, Angie Chen, Sertan Girgin, Yongqin Xian, Andrew Lee, Nolan Ramsden, Leslie Baker, Madeleine Clare Elish, Varvara Krayvanova, Rishabh Joshi, Jiri Simsa, Yao-Yuan Yang, Piotr Ambroszczyk, Dipankar Ghosh, Arjun Kar, Yuan Shangguan, Yumeya Yamamori, Yaroslav Akulov, Andy Brock, Haotian Tang, Siddharth Vashishtha, Rich Munoz, Andreas Steiner, Kalyan Andra, Daniel Eppens, Qixuan Feng, Hayato Kobayashi, Sasha Goldshtein, Mona El Mahdy, Xin Wang, Jilei, Wang, Richard Killam, Tom Kwiatkowski, Kavya Kopparapu, Serena Zhan, Chao Jia, Alexei Bendebury, Sheryl Luo, Adrià Recasens, Timothy Knight, Jing Chen, Mohak Patel, YaGuang Li, Ben Withbroe, Dean Weesner, Kush Bhatia, Jie Ren, Danielle Eisenbud, Ebrahim Songhori, Yanhua Sun, Travis Choma, Tasos Kementsietsidis, Lucas Manning, Brian Roark, Wael Farhan, Jie Feng, Susheel Tatineni, James Cobon-Kerr, Yunjie Li, Lisa Anne Hendricks, Isaac Noble, Chris Breaux, Nate Kushman, Liqian Peng, Fuzhao Xue, Taylor Tobin, Jamie Rogers, Josh Lipschultz, Chris Alberti, Alexey Vlaskin, Mostafa Dehghani, Roshan Sharma, Tris Warkentin, Chen-Yu Lee, Benigno Uria, Da-Cheng Juan, Angad Chandorkar, Hila Sheftel, Ruibo Liu, Elnaz Davoodi, Borja De Balle Pigem, Kedar Dhamdhere, David Ross, Jonathan Hoech, Mahdis Mahdieh, Li Liu, Qiujia Li, Liam McCafferty, Chenxi Liu, Markus Mircea, Yunting Song, Omkar Savant, Alaa Saade, Colin Cherry, Vincent Hellendoorn, Siddharth Goyal, Paul Pucciarelli, David Vilar Torres, Zohar Yahav, Hyo Lee, Lars Lowe Sjoesund, Christo Kirov, Bo Chang, Deepanway Ghoshal, Lu Li, Gilles Baechler, Sébastien Pereira, Tara Sainath, Anudhyan Boral, Dominik Grewe, Afief Halumi, Nguyet Minh Phu, Tianxiao Shen, Marco Tulio Ribeiro, Dhriti Varma, Alex Kaskasoli, Vlad Feinberg, Navneet Potti, Jarrod Kahn, Matheus Wisniewski, Shakir Mohamed, Arnar Mar Hrafnkelsson, Bobak Shahriari, Jean-Baptiste Lespiau, Lisa Patel, Legg Yeung, Tom Paine, Lantao Mei, Alex Ramirez, Rakesh Shivanna, Li Zhong, Josh Woodward, Guilherme Tubone, Samira Khan, Heng Chen, Elizabeth Nielsen, Catalin Ionescu, Utsav Prabhu, Mingcen Gao, Qingze Wang, Sean Augenstein, Neesha Subramaniam, Jason Chang, Fotis Iliopoulos, Jiaming Luo, Myriam Khan, Weicheng Kuo, Denis Teplyashin, Florence Perot, Logan Kilpatrick, Amir Globerson, Hongkun Yu, Anfal Siddiqui, Nick Sukhanov, Arun Kandoor, Umang Gupta, Marco Andreetto, Moran Ambar, Donnie Kim, Paweł Wesołowski, Sarah Perrin, Ben Limonchik, Wei Fan, Jim Stephan, Ian Stewart-Binks, Ryan Kappedal, Tong He, Sarah Cogan, Romina Datta, Tong Zhou, Jiayu Ye, Leandro Kieliger, Ana Ramalho, Kyle Kastner, Fabian Mentzer, Wei-Jen Ko, Arun Suggala, Tianhao Zhou, Shiraz Butt, Hana Strejček, Lior Belenki, Subhashini Venugopalan, Mingyang Ling, Evgenii Eltyshev, Yunxiao Deng, Geza Kovacs, Mukund Raghavachari, Hanjun Dai, Tal Schuster, Steven Schwarcz, Richard Nguyen, Arthur Nguyen, Gavin Buttimore, Shrestha Basu Mallick, Sudeep Gandhe, Seth Benjamin, Michal Jastrzebski, Le Yan, Sugato Basu, Chris Apps, Isabel Edkins, James Allingham, Immanuel Odisho, Tomas Kocisky, Jewel Zhao, Linting Xue, Apoorv Reddy, Chrysovalantis Anastasiou, Aviel Atias, Sam Redmond, Kieran Milan, Nicolas Heess, Herman Schmit, Allan Dafoe, Daniel Andor, Tynan Gangwani, Anca Dragan, Sheng Zhang, Ashyana Kachra, Gang Wu, Siyang Xue, Kevin Aydin, Siqi Liu, Yuxiang Zhou, Mahan Malihi, Austin Wu, Siddharth Gopal, Candice Schumann, Peter Stys, Alek Wang, Mirek Olšák, Dangyi Liu, Christian Schallhart, Yiran Mao, Demetra Brady, Hao Xu, Tomas Mery, Chawin Sitawarin, Siva Velusamy, Tom Cobley, Alex Zhai, Christian Walder, Nitzan Katz, Ganesh Jawahar, Chinmay Kulkarni, Antoine Yang, Adam Paszke, Yinan Wang, Bogdan Damoc, Zalán Borsos, Ray Smith, Jinning Li, Mansi Gupta, Andrei Kapishnikov, Sushant Prakash, Florian Luisier, Rishabh Agarwal, Will Grathwohl, Kuangyuan Chen, Kehang Han, Nikhil Mehta, Andrew Over, Shekoofeh Azizi, Lei Meng, Niccolò Dal Santo, Kelvin Zheng, Jane Shapiro, Igor Petrovski, Jeffrey Hui, Amin Ghafouri, Jasper Snoek, James Qin, Mandy Jordan, Caitlin Sikora, Jonathan Malmaud, Yuheng Kuang, Aga Świetlik, Ruoxin Sang, Chongyang Shi, Leon Li, Andrew Rosenberg, Shubin Zhao, Andy Crawford, Jan-Thorsten Peter, Yun Lei, Xavier Garcia, Long Le, Todd Wang, Julien Amelot, Dave Orr, Praneeth Kacham, Dana Alon, Gladys Tyen, Abhinav Arora, James Lyon, Alex Kurakin, Mimi Ly, Theo Guidroz, Zhipeng Yan, Rina Panigrahy, Pingmei Xu, Thais Kagohara, Yong Cheng, Eric Noland, Jinhyuk Lee, Jonathan Lee, Cathy Yip, Maria Wang, Efrat Nehoran, Alexander Bykovsky, Zhihao Shan, Ankit Bhagatwala, Chaochao Yan, Jie Tan, Guillermo Garrido, Dan Ethier, Nate Hurley, Grace Vesom, Xu Chen, Siyuan Qiao, Abhishek Nayyar, Julian Walker, Paramjit Sandhu, Mihaela Rosca, Danny Swisher, Mikhail Dektiarev, Josh Dillon, George-Cristian Muraru, Manuel Tragut, Artiom Myaskovsky, David Reid, Marko Velic, Owen Xiao, Jasmine George, Mark Brand, Jing Li, Wenhao Yu, Shane Gu, Xiang Deng, François-Xavier Aubet, Soheil Hassas Yeganeh, Fred Alcober, Celine Smith, Trevor Cohn, Kay McKinney, Michael Tschannen, Ramesh Sampath, Gowoon Cheon, Liangchen Luo, Luyang Liu, Jordi Orbay, Hui Peng, Gabriela Botea, Xiaofan Zhang, Charles Yoon, Cesar Magalhaes, Paweł Stradomski, Ian Mackinnon, Steven Hemingray, Kumaran Venkatesan, Rhys May, Jaeyoun Kim, Alex Druinsky, Jingchen Ye, Zheng Xu, Terry Huang, Jad Al Abdallah, Adil Dostmohamed, Rachana Fellinger, Tsendsuren Munkhdalai, Akanksha Maurya, Peter Garst, Yin Zhang, Maxim Krikun, Simon Bucher, Aditya Srikanth Veerubhotla, Yaxin Liu, Sheng Li, Nishesh Gupta, Jakub Adamek, Hanwen Chen, Bernett Orlando, Aleksandr Zaks, Joost van Amersfoort, Josh Camp, Hui Wan, HyunJeong Choe, Zhichun Wu, Kate Olszewska, Weiren Yu, Archita Vadali, Martin Scholz, Daniel De Freitas, Jason Lin, Amy Hua, Xin Liu, Frank Ding, Yichao Zhou, Boone Severson, Katerina Tsihlas, Samuel Yang, Tammo Spalink, Varun Yerram, Helena Pankov, Rory Blevins, Ben Vargas, Sarthak Jauhari, Matt Miecnikowski, Ming Zhang, Sandeep Kumar, Clement Farabet, Charline Le Lan, Sebastian Flennerhag, Yonatan Bitton, Ada Ma, Arthur Bražinskas, Eli Collins, Niharika Ahuja, Sneha Kudugunta, Anna Bortsova, Minh Giang, Wanzheng Zhu, Ed Chi, Scott Lundberg, Alexey Stern, Subha Puttagunta, Jing Xiong, Xiao Wu, Yash Pande, Amit Jhindal, Daniel Murphy, Jon Clark, Marc Brockschmidt, Maxine Deines, Kevin R. McKee, Dan Bahir, Jiajun Shen, Minh Truong, Daniel McDuff, Andrea Gesmundo, Edouard Rosseel, Bowen Liang, Ken Caluwaerts, Jessica Hamrick, Joseph Kready, Mary Cassin, Rishikesh Ingale, Li Lao, Scott Pollom, Yifan Ding, Wei He, Lizzetth Bellot, Joana Iljazi, Ramya Sree Boppana, Shan Han, Tara Thompson, Amr Khalifa, Anna Bulanova, Blagoj Mitrevski, Bo Pang, Emma Cooney, Tian Shi, Rey Coaguila, Tamar Yakar, Marc’aurelio Ranzato, Nikola Momchev, Chris Rawles, Zachary Charles, Young Maeng, Yuan Zhang, Rishabh Bansal, Xiaokai Zhao, Brian Albert, Yuan Yuan, Sudheendra Vijayanarasimhan, Roy Hirsch, Vinay Ramasesh, Kiran Vodrahalli, Xingyu Wang, Arushi Gupta, DJ Strouse, Jianmo Ni, Roma Patel, Gabe Taubman, Zhouyuan Huo, Dero Gharibian, Marianne Monteiro, Hoi Lam, Shobha Vasudevan, Aditi Chaudhary, Isabela Albuquerque, Kilol Gupta, Sebastian Riedel, Chaitra Hegde, Avraham Ruderman, András György, Marcus Wainwright, Ashwin Chaugule, Burcu Karagol Ayan, Tomer Levinboim, Sam Shleifer, Yogesh Kalley, Vahab Mirrokni, Abhishek Rao, Prabakar Radhakrishnan, Jay Hartford, Jialin Wu, Zhenhai Zhu, Francesco Bertolini, Hao Xiong, Nicolas Serrano, Hamish Tomlinson, Myle Ott, Yifan Chang, Mark Graham, Jian Li, Marco Liang, Xiangzhu Long, Sebastian Borgeaud, Yanif Ahmad, Alex Grills, Diana Mincu, Martin Izzard, Yuan Liu, Jinyu Xie, Louis O’Bryan, Sameera Ponda, Simon Tong, Michelle Liu, Dan Malkin, Khalid Salama, Yuankai Chen, Rohan Anil, Anand Rao, Rigel Swavely, Misha Bilenko, Nina Anderson, Tat Tan, Jing Xie, Xing Wu, Lijun Yu, Oriol Vinyals, Andrey Ryabtsev, Rumen Dangovski, Kate Baumli, Daniel Keysers, Christian Wright, Zoe Ashwood, Betty Chan, Artem Shtefan, Yaohui Guo, Ankur Bapna, Radu Soricut, Steven Pecht, Sabela Ramos, Rui Wang, Jiahao Cai, Trieu Trinh, Paul Barham, Linda Friso, Eli Stickgold, Xiangzhuo Ding, Siamak Shakeri, Diego Ardila, Eleftheria Briakou, Phil Culliton, Adam Raveret, Jingyu Cui, David Saxton, Subhrajit Roy, Javad Azizi, Pengcheng Yin, Lucia Loher, Andrew Bunner, Min Choi, Faruk Ahmed, Eric Li, Yin Li, Shengyang Dai, Michael Elabd, Sriram Ganapathy, Shivani Agrawal, Yiqing Hua, Paige Kunkle, Sujeevan Rajayogam, Arun Ahuja, Arthur Conmy, Alex Vasiloff, Parker Beak, Christopher Yew, Jayaram Mudigonda, Bartek Wydrowski, Jon Blanton, Zhengdong Wang, Yann Dauphin, Zhuo Xu, Martin Polacek, Xi Chen, Hexiang Hu, Pauline Sho, Markus Kunesch, Mehdi Hafezi Manshadi, Eliza Rutherford, Bo Li, Sissie Hsiao, Iain Barr, Alex Tudor, Matija Kecman, Arsha Nagrani, Vladimir Pchelin, Martin Sundermeyer, Aishwarya P S, Abhijit Karmarkar, Yi Gao, Grishma Chole, Olivier Bachem, Isabel Gao, Arturo BC, Matt Dibb, Mauro Verzetti, Felix Hernandez-Campos, Yana Lunts, Matthew Johnson, Julia Di Trapani, Raphael Koster, Idan Brusilovsky, Binbin Xiong, Megha Mohabey, Han Ke, Joe Zou, Tea Sabolić, Víctor Campos, John Palowitch, Alex Morris, Linhai Qiu, Pranavaraj Ponnuramu, Fangtao Li, Vivek Sharma, Kiranbir Sodhia, Kaan Tekelioglu, Aleksandr Chuklin, Madhavi Yenugula, Erika Gemzer, Theofilos Strinopoulos, Sam El-Husseini, Huiyu Wang, Yan Zhong, Edouard Leurent, Paul Natsev, Weijun Wang, Dre Mahaarachchi, Tao Zhu, Songyou Peng, Sami Alabed, Cheng-Chun Lee, Anthony Brohan, Arthur Szlam, GS Oh, Anton Kovsharov, Jenny Lee, Renee Wong, Megan Barnes, Gregory Thornton, Felix Gimeno, Omer Levy, Martin Sevenich, Melvin Johnson, Jonathan Mallinson, Robert Dadashi, Ziyue Wang, Qingchun Ren, Preethi Lahoti, Arka Dhar, Josh Feldman, Dan Zheng, Thatcher Ulrich, Liviu Panait, Michiel Blokzijl, Cip Baetu, Josip Matak, Jitendra Harlalka, Maulik Shah, Tal Marian, Daniel von Dincklage, Cosmo Du, Ruy Ley-Wild, Bethanie Brownfield, Max Schumacher, Yury Stuken, Shadi Noghabi, Sonal Gupta, Xiaoqi Ren, Eric Malmi, Felix Weissenberger, Blanca Huergo, Maria Bauza, Thomas Lampe, Arthur Douillard, Mojtaba Seyedhosseini, Roy Frostig, Zoubin Ghahramani, Kelvin Nguyen, Kashyap Krishnakumar, Chengxi Ye, Rahul Gupta, Alireza Nazari, Robert Geirhos, Pete Shaw, Ahmed Eleryan, Dima Damen, Jennimaria Palomaki, Ted Xiao, Qiyin Wu, Quan Yuan, Phoenix Meadowlark, Matthew Bilotti, Raymond Lin, Mukund Sridhar, Yannick Schroecker, Da-Woon Chung, Jincheng Luo, Trevor Strohman, Tianlin Liu, Anne Zheng, Jesse Emond, Wei Wang, Andrew Lampinen, Toshiyuki Fukuzawa, Folawiyo Campbell-Ajala, Monica Roy, James Lee-Thorp, Lily Wang, Iftekhar Naim, Tony, Nguy~ên, Guy Bensky, Aditya Gupta, Dominika Rogozińska, Justin Fu, Thanumalayan Sankaranarayana Pillai, Petar Veličković, Shahar Drath, Philipp Neubeck, Vaibhav Tulsyan, Arseniy Klimovskiy, Don Metzler, Sage Stevens, Angel Yeh, Junwei Yuan, Tianhe Yu, Kelvin Zhang, Alec Go, Vincent Tsang, Ying Xu, Andy Wan, Isaac Galatzer-Levy, Sam Sobell, Abodunrinwa Toki, Elizabeth Salesky, Wenlei Zhou, Diego Antognini, Sholto Douglas, Shimu Wu, Adam Lelkes, Frank Kim, Paul Cavallaro, Ana Salazar, Yuchi Liu, James Besley, Tiziana Refice, Yiling Jia, Zhang Li, Michal Sokolik, Arvind Kannan, Jon Simon, Jo Chick, Avia Aharon, Meet Gandhi, Mayank Daswani, Keyvan Amiri, Vighnesh Birodkar, Abe Ittycheriah, Peter Grabowski, Oscar Chang, Charles Sutton, Zhixin, Lai, Umesh Telang, Susie Sargsyan, Tao Jiang, Raphael Hoffmann, Nicole Brichtova, Matteo Hessel, Jonathan Halcrow, Sammy Jerome, Geoff Brown, Alex Tomala, Elena Buchatskaya, Dian Yu, Sachit Menon, Pol Moreno, Yuguo Liao, Vicky Zayats, Luming Tang, SQ Mah, Ashish Shenoy, Alex Siegman, Majid Hadian, Okwan Kwon, Tao Tu, Nima Khajehnouri, Ryan Foley, Parisa Haghani, Zhongru Wu, Vaishakh Keshava, Khyatti Gupta, Tony Bruguier, Rui Yao, Danny Karmon, Luisa Zintgraf, Zhicheng Wang, Enrique Piqueras, Junehyuk Jung, Jenny Brennan, Diego Machado, Marissa Giustina, MH Tessler, Kamyu Lee, Qiao Zhang, Joss Moore, Kaspar Daugaard, Alexander Frömmgen, Jennifer Beattie, Fred Zhang, Daniel Kasenberg, Ty Geri, Danfeng Qin, Gaurav Singh Tomar, Tom Ouyang, Tianli Yu, Luowei Zhou, Rajiv Mathews, Andy Davis, Yaoyiran Li, Jai Gupta, Damion Yates, Linda Deng, Elizabeth Kemp, Ga-Young Joung, Sergei Vassilvitskii, Mandy Guo, Pallavi LV, Dave Dopson, Sami Lachgar, Lara McConnaughey, Himadri Choudhury, Dragos Dena, Aaron Cohen, Joshua Ainslie, Sergey Levi, Parthasarathy Gopavarapu, Polina Zablotskaia, Hugo Vallet, Sanaz Bahargam, Xiaodan Tang, Nenad Tomasev, Ethan Dyer, Daniel Balle, Hongrae Lee, William Bono, Jorge Gonzalez Mendez, Vadim Zubov, Shentao Yang, Ivor Rendulic, Yanyan Zheng, Andrew Hogue, Golan Pundak, Ralph Leith, Avishkar Bhoopchand, Michael Han, Mislav Žanić, Tom Schaul, Manolis Delakis, Tejas Iyer, Guanyu Wang, Harman Singh, Abdelrahman Abdelhamed, Tara Thomas, Siddhartha Brahma, Hilal Dib, Naveen Kumar, Wenxuan Zhou, Liang Bai, Pushkar Mishra, Jiao Sun, Valentin Anklin, Roykrong Sukkerd, Lauren Agubuzu, Anton Briukhov, Anmol Gulati, Maximilian Sieb, Fabio Pardo, Sara Nasso, Junquan Chen, Kexin Zhu, Tiberiu Sosea, Alex Goldin, Keith Rush, Spurthi Amba Hombaiah, Andreas Noever, Allan Zhou, Sam Haves, Mary Phuong, Jake Ades, Yi-ting Chen, Lin Yang, Joseph Pagadora, Stan Bileschi, Victor Cotruta, Rachel Saputro, Arijit Pramanik, Sean Ammirati, Dan Garrette, Kevin Villela, Tim Blyth, Canfer Akbulut, Neha Jha, Alban Rrustemi, Arissa Wongpanich, Chirag Nagpal, Yonghui Wu, Morgane Rivière, Sergey Kishchenko, Pranesh Srinivasan, Alice Chen, Animesh Sinha, Trang Pham, Bill Jia, Tom Hennigan, Anton Bakalov, Nithya Attaluri, Drew Garmon, Daniel Rodriguez, Dawid Wegner, Wenhao Jia, Evan Senter, Noah Fiedel, Denis Petek, Yuchuan Liu, Cassidy Hardin, Harshal Tushar Lehri, Joao Carreira, Sara Smoot, Marcel Prasetya, Nami Akazawa, Anca Stefanoiu, Chia-Hua Ho, Anelia Angelova, Kate Lin, Min Kim, Charles Chen, Marcin Sieniek, Alice Li, Tongfei Guo, Sorin Baltateanu, Pouya Tafti, Michael Wunder, Nadav Olmert, Divyansh Shukla, Jingwei Shen, Neel Kovelamudi, Balaji Venkatraman, Seth Neel, Romal Thoppilan, Jerome Connor, Frederik Benzing, Axel Stjerngren, Golnaz Ghiasi, Alex Polozov, Joshua Howland, Theophane Weber, Justin Chiu, Ganesh Poomal Girirajan, Andreas Terzis, Pidong Wang, Fangda Li, Yoav Ben Shalom, Dinesh Tewari, Matthew Denton, Roee Aharoni, Norbert Kalb, Heri Zhao, Junlin Zhang, Angelos Filos, Matthew Rahtz, Lalit Jain, Connie Fan, Vitor Rodrigues, Ruth Wang, Richard Shin, Jacob Austin, Roman Ring, Mariella Sanchez-Vargas, Mehadi Hassen, Ido Kessler, Uri Alon, Gufeng Zhang, Wenhu Chen, Yenai Ma, Xiance Si, Le Hou, Azalia Mirhoseini, Marc Wilson, Geoff Bacon, Becca Roelofs, Lei Shu, Gautam Vasudevan, Jonas Adler, Artur Dwornik, Tayfun Terzi, Matt Lawlor, Harry Askham, Mike Bernico, Xuanyi Dong, Chris Hidey, Kevin Kilgour, Gaël Liu, Surya Bhupatiraju, Luke Leonhard, Siqi Zuo, Partha Talukdar, Qing Wei, Aliaksei Severyn, Vít Listík, Jong Lee, Aditya Tripathi, SK Park, Yossi Matias, Hao Liu, Alex Ruiz, Rajesh Jayaram, Jackson Tolins, Pierre Marcenac, Yiming Wang, Bryan Seybold, Henry Prior, Deepak Sharma, Jack Weber, Mikhail Sirotenko, Yunhsuan Sung, Dayou Du, Ellie Pavlick, Stefan Zinke, Markus Freitag, Max Dylla, Montse Gonzalez Arenas, Natan Potikha, Omer Goldman, Connie Tao, Rachita Chhaparia, Maria Voitovich, Pawan Dogra, Andrija Ražnatović, Zak Tsai, Chong You, Oleaser Johnson, George Tucker, Chenjie Gu, Jae Yoo, Maryam Majzoubi, Valentin Gabeur, Bahram Raad, Rocky Rhodes, Kashyap Kolipaka, Heidi Howard, Geta Sampemane, Benny Li, Chulayuth Asawaroengchai, Duy Nguyen, Chiyuan Zhang, Timothee Cour, Xinxin Yu, Zhao Fu, Joe Jiang, Po-Sen Huang, Gabriela Surita, Iñaki Iturrate, Yael Karov, Michael Collins, Martin Baeuml, Fabian Fuchs, Shilpa Shetty, Swaroop Ramaswamy, Sayna Ebrahimi, Qiuchen Guo, Jeremy Shar, Gabe Barth-Maron, Sravanti Addepalli, Bryan Richter, Chin-Yi Cheng, Eugénie Rives, Fei Zheng, Johannes Griesser, Nishanth Dikkala, Yoel Zeldes, Ilkin Safarli, Dipanjan Das, Himanshu Srivastava, Sadh MNM Khan, Xin Li, Aditya Pandey, Larisa Markeeva, Dan Belov, Qiqi Yan, Mikołaj Rybiński, Tao Chen, Megha Nawhal, Michael Quinn, Vineetha Govindaraj, Sarah York, Reed Roberts, Roopal Garg, Namrata Godbole, Jake Abernethy, Anil Das, Lam Nguyen Thiet, Jonathan Tompson, John Nham, Neera Vats, Ben Caine, Wesley Helmholz, Francesco Pongetti, Yeongil Ko, James An, Clara Huiyi Hu, Yu-Cheng Ling, Julia Pawar, Robert Leland, Keisuke Kinoshita, Waleed Khawaja, Marco Selvi, Eugene Ie, Danila Sinopalnikov, Lev Proleev, Nilesh Tripuraneni, Michele Bevilacqua, Seungji Lee, Clayton Sanford, Dan Suh, Dustin Tran, Jeff Dean, Simon Baumgartner, Jens Heitkaemper, Sagar Gubbi, Kristina Toutanova, Yichong Xu, Chandu Thekkath, Keran Rong, Palak Jain, Annie Xie, Yan Virin, Yang Li, Lubo Litchev, Richard Powell, Tarun Bharti, Adam Kraft, Nan Hua, Marissa Ikonomidis, Ayal Hitron, Sanjiv Kumar, Loic Matthey, Sophie Bridgers, Lauren Lax, Ishaan Malhi, Ondrej Skopek, Ashish Gupta, Jiawei Cao, Mitchelle Rasquinha, Siim Põder, Wojciech Stokowiec, Nicholas Roth, Guowang Li, Michaël Sander, Joshua Kessinger, Vihan Jain, Edward Loper, Wonpyo Park, Michal Yarom, Liqun Cheng, Guru Guruganesh, Kanishka Rao, Yan Li, Catarina Barros, Mikhail Sushkov, Chun-Sung Ferng, Rohin Shah, Ophir Aharoni, Ravin Kumar, Tim McConnell, Peiran Li, Chen Wang, Fernando Pereira, Craig Swanson, Fayaz Jamil, Yan Xiong, Anitha Vijayakumar, Prakash Shroff, Kedar Soparkar, Jindong Gu, Livio Baldini Soares, Eric Wang, Kushal Majmundar, Aurora Wei, Kai Bailey, Nora Kassner, Chizu Kawamoto, Goran Žužić, Victor Gomes, Abhirut Gupta, Michael Guzman, Ishita Dasgupta, Xinyi Bai, Zhufeng Pan, Francesco Piccinno, Hadas Natalie Vogel, Octavio Ponce, Adrian Hutter, Paul Chang, Pan-Pan Jiang, Ionel Gog, Vlad Ionescu, James Manyika, Fabian Pedregosa, Harry Ragan, Zach Behrman, Ryan Mullins, Coline Devin, Aroonalok Pyne, Swapnil Gawde, Martin Chadwick, Yiming Gu, Sasan Tavakkol, Andy Twigg, Naman Goyal, Ndidi Elue, Anna Goldie, Srinivasan Venkatachary, Hongliang Fei, Ziqiang Feng, Marvin Ritter, Isabel Leal, Sudeep Dasari, Pei Sun, Alif Raditya Rochman, Brendan O’Donoghue, Yuchen Liu, Jim Sproch, Kai Chen, Natalie Clay, Slav Petrov, Sailesh Sidhwani, Ioana Mihailescu, Alex Panagopoulos, AJ Piergiovanni, Yunfei Bai, George Powell, Deep Karkhanis, Trevor Yacovone, Petr Mitrichev, Joe Kovac, Dave Uthus, Amir Yazdanbakhsh, David Amos, Steven Zheng, Bing Zhang, Jin Miao, Bhuvana Ramabhadran, Soroush Radpour, Shantanu Thakoor, Josh Newlan, Oran Lang, Orion Jankowski, Shikhar Bharadwaj, Jean-Michel Sarr, Shereen Ashraf, Sneha Mondal, Jun Yan, Ankit Singh Rawat, Sarmishta Velury, Greg Kochanski, Tom Eccles, Franz Och, Abhanshu Sharma, Ethan Mahintorabi, Alex Gurney, Carrie Muir, Vered Cohen, Saksham Thakur, Adam Bloniarz, Asier Mujika, Alexander Pritzel, Paul Caron, Altaf Rahman, Fiona Lang, Yasumasa Onoe, Petar Sirkovic, Jay Hoover, Ying Jian, Pablo Duque, Arun Narayanan, David Soergel, Alex Haig, Loren Maggiore, Shyamal Buch, Josef Dean, Ilya Figotin, Igor Karpov, Shaleen Gupta, Denny Zhou, Muhuan Huang, Ashwin Vaswani, Christopher Semturs, Kaushik Shivakumar, Yu Watanabe, Vinodh Kumar Rajendran, Eva Lu, Yanhan Hou, Wenting Ye, Shikhar Vashishth, Nana Nti, Vytenis Sakenas, Darren Ni, Doug DeCarlo, Michael Bendersky, Sumit Bagri, Nacho Cano, Elijah Peake, Simon Tokumine, Varun Godbole, Carlos Guía, Tanya Lando, Vittorio Selo, Seher Ellis, Danny Tarlow, Daniel Gillick, Alessandro Epasto, Siddhartha Reddy Jonnalagadda, Meng Wei, Meiyan Xie, Ankur Taly, Michela Paganini, Mukund Sundararajan, Daniel Toyama, Ting Yu, Dessie Petrova, Aneesh Pappu, Rohan Agrawal, Senaka Buthpitiya, Justin Frye, Thomas Buschmann, Remi Crocker, Marco Tagliasacchi, Mengchao Wang, Da Huang, Sagi Perel, Brian Wieder, Hideto Kazawa, Weiyue Wang, Jeremy Cole, Himanshu Gupta, Ben Golan, Seojin Bang, Nitish Kulkarni, Ken Franko, Casper Liu, Doug Reid, Sid Dalmia, Jay Whang, Kevin Cen, Prasha Sundaram, Johan Ferret, Berivan Isik, Lucian Ionita, Guan Sun, Anna Shekhawat, Muqthar Mohammad, Philip Pham, Ronny Huang, Karthik Raman, Xingyi Zhou, Ross Mcilroy, Austin Myers, Sheng Peng, Jacob Scott, Paul Covington, Sofia Erell, Pratik Joshi, João Gabriel Oliveira, Natasha Noy, Tajwar Nasir, Jake Walker, Vera Axelrod, Tim Dozat, Pu Han, Chun-Te Chu, Eugene Weinstein, Anand Shukla, Shreyas Chandrakaladharan, Petra Poklukar, Bonnie Li, Ye Jin, Prem Eruvbetine, Steven Hansen, Avigail Dabush, Alon Jacovi, Samrat Phatale, Chen Zhu, Steven Baker, Mo Shomrat, Yang Xiao, Jean Pouget-Abadie, Mingyang Zhang, Fanny Wei, Yang Song, Helen King, Yiling Huang, Yun Zhu, Ruoxi Sun, Juliana Vicente Franco, Chu-Cheng Lin, Sho Arora, Hui, Li, Vivian Xia, Luke Vilnis, Mariano Schain, Kaiz Alarakyia, Laurel Prince, Aaron Phillips, Caleb Habtegebriel, Luyao Xu, Huan Gui, Santiago Ontanon, Lora Aroyo, Karan Gill, Peggy Lu, Yash Katariya, Dhruv Madeka, Shankar Krishnan, Shubha Srinivas Raghvendra, James Freedman, Yi Tay, Gaurav Menghani, Peter Choy, Nishita Shetty, Dan Abolafia, Doron Kukliansky, Edward Chou, Jared Lichtarge, Ken Burke, Ben Coleman, Dee Guo, Larry Jin, Indro Bhattacharya, Victoria Langston, Yiming Li, Suyog Kotecha, Alex Yakubovich, Xinyun Chen, Petre Petrov, Tolly Powell, Yanzhang He, Corbin Quick, Kanav Garg, Dawsen Hwang, Yang Lu, Srinadh Bhojanapalli, Kristian Kjems, Ramin Mehran, Aaron Archer, Hado van Hasselt, Ashwin Balakrishna, JK Kearns, Meiqi Guo, Jason Riesa, Mikita Sazanovich, Xu Gao, Chris Sauer, Chengrun Yang, XiangHai Sheng, Thomas Jimma, Wouter Van Gansbeke, Vitaly Nikolaev, Wei Wei, Katie Millican, Ruizhe Zhao, Justin Snyder, Levent Bolelli, Maura O’Brien, Shawn Xu, Fei Xia, Wentao Yuan, Arvind Neelakantan, David Barker, Sachin Yadav, Hannah Kirkwood, Farooq Ahmad, Joel Wee, Jordan Grimstad, Boyu Wang, Matthew Wiethoff, Shane Settle, Miaosen Wang, Charles Blundell, Jingjing Chen, Chris Duvarney, Grace Hu, Olaf Ronneberger, Alex Lee, Yuanzhen Li, Abhishek Chakladar, Alena Butryna, Georgios Evangelopoulos, Guillaume Desjardins, Jonni Kanerva, Henry Wang, Averi Nowak, Nick Li, Alyssa Loo, Art Khurshudov, Laurent El Shafey, Nagabhushan Baddi, Karel Lenc, Yasaman Razeghi, Tom Lieber, Amer Sinha, Xiao Ma, Yao Su, James Huang, Asahi Ushio, Hanna Klimczak-Plucińska, Kareem Mohamed, JD Chen, Simon Osindero, Stav Ginzburg, Lampros Lamprou, Vasilisa Bashlovkina, Duc-Hieu Tran, Ali Khodaei, Ankit Anand, Yixian Di, Ramy Eskander, Manish Reddy Vuyyuru, Jasmine Liu, Aishwarya Kamath, Roman Goldenberg, Mathias Bellaiche, Juliette Pluto, Bill Rosgen, Hassan Mansoor, William Wong, Suhas Ganesh, Eric Bailey, Scott Baird, Dan Deutsch, Jinoo Baek, Xuhui Jia, Chansoo Lee, Abe Friesen, Nathaniel Braun, Kate Lee, Amayika Panda, Steven M. Hernandez, Duncan Williams, Jianqiao Liu, Ethan Liang, Arnaud Autef, Emily Pitler, Deepali Jain, Phoebe Kirk, Oskar Bunyan, Jaume Sanchez Elias, Tongxin Yin, Machel Reid, Aedan Pope, Nikita Putikhin, Bidisha Samanta, Sergio Guadarrama, Dahun Kim, Simon Rowe, Marcella Valentine, Geng Yan, Alex Salcianu, David Silver, Gan Song, Richa Singh, Shuai Ye, Hannah DeBalsi, Majd Al Merey, Eran Ofek, Albert Webson, Shibl Mourad, Ashwin Kakarla, Silvio Lattanzi, Nick Roy, Evgeny Sluzhaev, Christina Butterfield, Alessio Tonioni, Nathan Waters, Sudhindra Kopalle, Jason Chase, James Cohan, Girish Ramchandra Rao, Robert Berry, Michael Voznesensky, Shuguang Hu, Kristen Chiafullo, Sharat Chikkerur, George Scrivener, Ivy Zheng, Jeremy Wiesner, Wolfgang Macherey, Timothy Lillicrap, Fei Liu, Brian Walker, David Welling, Elinor Davies, Yangsibo Huang, Lijie Ren, Nir Shabat, Alessandro Agostini, Mariko Iinuma, Dustin Zelle, Rohit Sathyanarayana, Andrea D’olimpio, Morgan Redshaw, Matt Ginsberg, Ashwin Murthy, Mark Geller, Tatiana Matejovicova, Ayan Chakrabarti, Ryan Julian, Christine Chan, Qiong Hu, Daniel Jarrett, Manu Agarwal, Jeshwanth Challagundla, Tao Li, Sandeep Tata, Wen Ding, Maya Meng, Zhuyun Dai, Giulia Vezzani, Shefali Garg, Jannis Bulian, Mary Jasarevic, Honglong Cai, Harish Rajamani, Adam Santoro, Florian Hartmann, Chen Liang, Bartek Perz, Apoorv Jindal, Fan Bu, Sungyong Seo, Ryan Poplin, Adrian Goedeckemeyer, Badih Ghazi, Nikhil Khadke, Leon Liu, Kevin Mather, Mingda Zhang, Ali Shah, Alex Chen, Jinliang Wei, Keshav Shivam, Yuan Cao, Donghyun Cho, Angelo Scorza Scarpati, Michael Moffitt, Clara Barbu, Ivan Jurin, Ming-Wei Chang, Hongbin Liu, Hao Zheng, Shachi Dave, Christine Kaeser-Chen, Xiaobin Yu, Alvin Abdagic, Lucas Gonzalez, Yanping Huang, Peilin Zhong, Cordelia Schmid, Bryce Petrini, Alex Wertheim, Jifan Zhu, Hoang Nguyen, Kaiyang Ji, Yanqi Zhou, Tao Zhou, Fangxiaoyu Feng, Regev Cohen, David Rim, Shubham Milind Phal, Petko Georgiev, Ariel Brand, Yue Ma, Wei Li, Somit Gupta, Chao Wang, Pavel Dubov, Jean Tarbouriech, Kingshuk Majumder, Huijian Li, Norman Rink, Apurv Suman, Yang Guo, Yinghao Sun, Arun Nair, Xiaowei Xu, Mohamed Elhawaty, Rodrigo Cabrera, Guangxing Han, Julian Eisenschlos, Junwen Bai, Yuqi Li, Yamini Bansal, Thibault Sellam, Mina Khan, Hung Nguyen, Justin Mao-Jones, Nikos Parotsidis, Jake Marcus, Cindy Fan, Roland Zimmermann, Yony Kochinski, Laura Graesser, Feryal Behbahani, Alvaro Caceres, Michael Riley, Patrick Kane, Sandra Lefdal, Rob Willoughby, Paul Vicol, Lun Wang, Shujian Zhang, Ashleah Gill, Yu Liang, Gautam Prasad, Soroosh Mariooryad, Mehran Kazemi, Zifeng Wang, Kritika Muralidharan, Paul Voigtlaender, Jeffrey Zhao, Huanjie Zhou, Nina D’Souza, Aditi Mavalankar, Séb Arnold, Nick Young, Obaid Sarvana, Chace Lee, Milad Nasr, Tingting Zou, Seokhwan Kim, Lukas Haas, Kaushal Patel, Neslihan Bulut, David Parkinson, Courtney Biles, Dmitry Kalashnikov, Chi Ming To, Aviral Kumar, Jessica Austin, Alex Greve, Lei Zhang, Megha Goel, Yeqing Li, Sergey Yaroshenko, Max Chang, Abhishek Jindal, Geoff Clark, Hagai Taitelbaum, Dale Johnson, Ofir Roval, Jeongwoo Ko, Anhad Mohananey, Christian Schuler, Shenil Dodhia, Ruichao Li, Kazuki Osawa, Claire Cui, Peng Xu, Rushin Shah, Tao Huang, Ela Gruzewska, Nathan Clement, Mudit Verma, Olcan Sercinoglu, Hai Qian, Viral Shah, Masa Yamaguchi, Abhinit Modi, Takahiro Kosakai, Thomas Strohmann, Junhao Zeng, Beliz Gunel, Jun Qian, Austin Tarango, Krzysztof Jastrzębski, Robert David, Jyn Shan, Parker Schuh, Kunal Lad, Willi Gierke, Mukundan Madhavan, Xinyi Chen, Mark Kurzeja, Rebeca Santamaria-Fernandez, Dawn Chen, Alexandra Cordell, Yuri Chervonyi, Frankie Garcia, Nithish Kannen, Vincent Perot, Nan Ding, Shlomi Cohen-Ganor, Victor Lavrenko, Junru Wu, Georgie Evans, Cicero Nogueira dos Santos, Madhavi Sewak, Ashley Brown, Andrew Hard, Joan Puigcerver, Zeyu Zheng, Yizhong Liang, Evgeny Gladchenko, Reeve Ingle, Uri First, Pierre Sermanet, Charlotte Magister, Mihajlo Velimirović, Sashank Reddi, Susanna Ricco, Eirikur Agustsson, Hartwig Adam, Nir Levine, David Gaddy, Dan Holtmann-Rice, Xuanhui Wang, Ashutosh Sathe, Abhijit Guha Roy, Blaž Bratanič, Alen Carin, Harsh Mehta, Silvano Bonacina, Nicola De Cao, Mara Finkelstein, Verena Rieser, Xinyi Wu, Florent Altché, Dylan Scandinaro, Li Li, Nino Vieillard, Nikhil Sethi, Garrett Tanzer, Zhi Xing, Shibo Wang, Parul Bhatia, Gui Citovsky, Thomas Anthony, Sharon Lin, Tianze Shi, Shoshana Jakobovits, Gena Gibson, Raj Apte, Lisa Lee, Mingqing Chen, Arunkumar Byravan, Petros Maniatis, Kellie Webster, Andrew Dai, Pu-Chin Chen, Jiaqi Pan, Asya Fadeeva, Zach Gleicher, Thang Luong, Niket Kumar Bhumihar

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

在本报告中,我们介绍Gemini 2.X模式家庭:Gemini 2.5 Pro 和 Gemini 2.5 Flash,以及我们先前的Gemini 2.0 Flash 和 Flash-Lite 模型。Gemini 2.5 Pro是我们目前最有能力的模型,除了在边境编码和推理基准方面实现SoTA业绩外,它除了其令人难以置信的编码和推理技能外,Gemini 2.5 Pro是一个思维模型,它擅长多式理解,现在能够处理多达3小时的视频内容。它的独特结合了长背景、多式和推理能力,可以打开新的代理工作流程。Gemini 2.5 闪电提供了精良的推理能力,满足了计算和延时要求的一小部分,Gemini 2.0 Flash和闪光-Lite在低长期和成本方面提供高性能。综合起来,Gemini 2.X模型的生成跨越了模型能力相对于成本的全Pareto边界,使用户能够探索可能解决复杂代理问题的范围。


Article 228

Title@2025-07-22 (2): MMS Player: an open source software for parametric data-driven animation of Sign Language avatars

Title: MMS Player: an open source software for parametric data-driven animation of Sign Language avatars MMS Player: eine Open-Source-Software für parametrische datengesteuerte Animation von Sign Language Avataren MMS MMS 播放器: 一个用于模拟数据驱动的手语阿凡达动画的开放源码软件 2507.16463v1

Authors (3): Fabrizio Nunnari, Shailesh Mishra, Patrick Gebhard

This paper describes the MMS-Player, an open source software able to synthesise sign language animations from a novel sign language representation format called MMS (MultiModal Signstream). The MMS enhances gloss-based representations by adding information on parallel execution of signs, timing, and inflections. The implementation consists of Python scripts for the popular Blender 3D authoring tool and can be invoked via command line or HTTP API. Animations can be rendered as videos or exported in other popular 3D animation exchange formats. The software is freely available under GPL-3.0 license at https://github.com/DFKI-SignLanguage/MMS-Player.

本文介绍MMS-Player(MMS-Player),这是一个开放源码软件,能够从称为MMS(MultiModal Signstream)的新手语代表格式中合成手语动画。MMS通过增加关于符号、时间和反射平行执行的信息,加强了基于光滑的表述方式。实施工具包括流行的Blender 3D 作者工具的Python脚本,可以通过指令线或 HTTP API 来引用。动画可以作为视频或以其他流行的 3D 动画交换格式输出。该软件可免费使用GPL-3.0 许可证,网址是 https://github.com/DFKI-SignLanguage/MMS-Player。


Article 229

Title@2025-07-22 (2): Towards Enforcing Company Policy Adherence in Agentic Workflows

Title: Towards Enforcing Company Policy Adherence in Agentic Workflows Auf dem Weg zur Stärkung der unternehmenspolitischen Einhaltung von Agent-Workflows 致力于加强公司政策,坚持对制剂性工作流程的政策 2507.16459v1

Authors (6): Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, Ateret Anaby-Tavor

Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging $\tau$-bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.

大型语言模型代理商对传统业务流程自动化的灵活和可扩展的替代方案抱有希望,但努力可靠地遵循复杂的公司政策。在本研究中,我们引入了一个在代理工作流程中强制遵守商业政策的确定性、透明和模块化的框架。我们的方法分为两个阶段:(1) 将政策文件汇编成与工具使用有关的可核查的保安守则的离线建设时间阶段;(2) 运行时间整合,这些看守确保在每个代理商采取行动之前遵守规则。我们展示了我们在具有挑战性的$tau$-bunch Airlines域上的做法,显示了在政策执行方面令人鼓舞的初步成果,并进一步概述了现实世界部署所面临的关键挑战。


Article 230

Title@2025-07-22 (2): Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch

Title: Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch Dutch CrowS-Pairs: Anpassung eines Challenge Datasets zur Messung sozialer Biasen in Sprachmodellen für Niederländisch 荷兰语人群对称:调整一套挑战数据集,以衡量荷兰语语言模式中的社会两边状况 2507.16442v1

Authors (2): Elza Strazda, Gerasimos Spanakis

Warning: This paper contains explicit statements of offensive stereotypes which might be upsetting. Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.

警告:本文载有可能令人不满的冒犯性陈规定型的明确声明; 语言模式容易显示偏见,进一步扩大不公平和有害的陈规定型; 鉴于这些模式的普及程度和广泛应用,有必要确保安全和公正的语言模式; 由于最近相当重视衡量语言模式中的偏见,但大多数研究都只注重英语; 引入了用于衡量荷兰语言模式偏见的荷兰版CrowS-Pairs数据集; 由此形成的数据集包括1463对包含9类偏见的句子,如性取向、性别和残疾等; 判刑配对由对比性判决组成,其中一项判决涉及弱势群体和其他优势群体; 利用荷兰的CrowS-Pairs数据集,我们展示了各种语言模式,即BERTje、RobBERT、多语BERT、GEITje和Mistral7B等, 在不同偏见类别中表现出重大偏见模式; 使用CrowS-Pairs数据集的英语和法语版本,以英文(BER和Robreta)和法文(FARTA)中显示最明显的偏见程度。


Article 231

Title@2025-07-22 (2): HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

Title: HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing HausaNLP: Aktueller Status, Herausforderungen und Zukunftsrichtung für Hausa Natural Language Processing 豪萨民族语言:豪萨自然语言处理的现状、挑战和未来方向 2505.14311v3

Authors (12): Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Idris Abdulmumin, Falalu Ibrahim Lawan, Babangida Sani, Sukairaj Hafiz Imam, Yusuf Aliyu, Sani Abdullahi Sani, Ali Usman Umar, Tajuddeen Gwadabe, Kenneth Church, Vukosi Marivate

Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (https://catalog.hausanlp.org), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.

最近几年来,豪萨自然语言处理(Hausa自然语言处理)(NLP)日益受到越来越多的关注,尽管全世界有超过1.2亿名第一语言(L1)和8 000万第二语言(L2)的发言者,但作为一种低资源语言的学习仍然不足;虽然在高资源语言方面取得了显著进展,但豪萨自然语言处理(Hausa自然语言处理)面临持续的挑战,包括开放源数据集有限和模型代表性不足;本文件概述了豪萨自然语言处理(Hausa自然语言处理)的现状,系统审查了现有的资源、研究贡献和基本国家语言处理任务之间的差距:文本分类、机器翻译、命名实体识别、语音识别和问题回答。我们介绍了豪萨语言分类(https://catalog.hausanlp.org),这是一份集成数据集、工具和研究工作,以加强无障碍和进一步发展。此外,我们讨论了将豪萨人融入大型语言模型(LMSLM)面临的挑战,解决亚非最佳象征性象征性和辩证变问题。我们提出了战略研究方向,强调数据集的扩展、改进语言建模方法,并加强社区合作,以推进Hausa-LP的伟大研究基础。


Article 232

Title@2025-07-22 (2): Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

Title: Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models Hierarchische Sicherheits-Neuausrichtung: Leichte Wiederherstellung der Sicherheit in beschnittenen großen Vision-Sprachen-Modellen 等级安全调整:谨慎大型视觉语言模型中轻度安全恢复 2505.16104v2

Authors (6): Yue Li, Xin Yi, Dongsheng Shi, Gerard de Melo, Xiaoling Wang, Linlin Wang

With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.

随着大型视野-语言模型规模的扩大,旨在压缩在资源受限制环境中部署模型的网络运行技术已引起人们的极大关注,然而,我们认为,运行往往会导致安全性能的退化。为了解决这一问题,我们提出了一个新颖和轻量级的方法,称为 “ 高度安全调整 “ (HSR),其运作方式是首先量化每个受关注者对安全的贡献,确定最关键的因素,然后有选择地直接恢复这些在维护安全方面发挥关键作用的受关注者内部的神经元。这一过程按等级调整了经运行的液压低分辨率模型的安全性能,从关注的高层到神经性水平。我们验证了各种模型和运行战略的人类智能性能,不断取得显著的安全性能改进。据我们所知,这是在运行后恢复LVMs的安全性工作的第一个明确重点。


Article 233

Title@2025-07-22 (2): Atomic Calibration of LLMs in Long-Form Generations

Title: Atomic Calibration of LLMs in Long-Form Generations Atomkalibrierung von LLMs in langen Generationen 长代人长龄人LLMs的原子校准 2410.13246v2

Authors (7): Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier

Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs’ trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.

大型语言模型(LLMs)往往受到幻觉的影响,给现实世界的应用带来了重大挑战。信心校准,它估计模型预测的潜在不确定性,对于提高LLMs的可信度至关重要。现有的LLM校准研究主要侧重于短期任务,在反应层面(宏观校准)提供单一的自信分数。然而,这一方法对长代人来说是不够的,长代人的答复往往包含更复杂的语句,可能包含准确和不准确的信息。因此,我们引入了原子校准,这是一种新颖的方法,通过打破对原子索赔的长期反应,在细微的层次上评估事实质量校准。我们将信任感应方法分类为歧视性和基因性类型,并表明其组合可以加强校准。我们对各种LMs和数据集的广泛实验表明,原子校准非常适合长式的生成,还可以改进宏观校准结果。此外,原子校准还揭示了LM公司在整个生成过程中的可信度。


Article 234

Title@2025-07-22 (2): Synthetic Data Generation Using Large Language Models: Advances in Text and Code

Title: Synthetic Data Generation Using Large Language Models: Advances in Text and Code Synthetische Datengenerierung mit großen Sprachmodellen: Fortschritte in Text und Code 使用大语言模式生成合成数据:文本和代码的进步 2503.14023v2

Authors (3): Mihai Nadas, Laura Diosan, Andreea Tomescu

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances in leveraging LLMs to create synthetic text and code, highlighting key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We examine how these methods can enrich low-resource tasks (e.g., classification, question answering) and facilitate code-centric applications (e.g., instruction tuning, code translation, bug repair) through automated verification of functional correctness. Alongside potential benefits - cost-effectiveness, broad coverage, and controllable diversity - we discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Proposed mitigation strategies range from filtering and weighting synthetic outputs to reinforcement learning with execution feedback in code domains. We conclude by outlining open research directions, such as automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, underscoring the growing importance of LLM-generated synthetic data in accelerating AI development while emphasizing ethical and quality safeguards.

这份调查审查了大型语言模型(LLMS)是如何在自然语言和代码领域改变合成培训数据生成的。通过制作人工但与任务相关的实例,这些模型可以大大扩大甚至取代真实世界数据集,特别是在标签数据稀缺、昂贵或敏感的情况下。本文调查了最近在利用LLMS制作合成文本和代码方面取得的进展,强调了诸如快速生成、检索管道和迭接自我精炼等关键技术。我们研究了这些方法如何通过自动核实功能正确性,丰富低资源任务(如分类、问答),促进以代码为中心的应用(如指令调整、代码翻译、错误修复)。除了潜在的好处外,我们讨论了随之而来的挑战,包括生成文本中的实际不准确、光滑动或分布现实主义以及偏差补充风险。拟议缓解战略包括过滤和加权合成产出,以及在代码领域的执行反馈方面加强学习。我们最后,我们强调开放研究方向,例如加速合成质量保障,同时强调自动化综合、快速数据发展,同时强调不断增长的标准化质量框架。


Article 235

Title@2025-07-22 (2): Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study

Title: Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study Beyond English: Bewertung der automatisierten Messung von Moralfundamenten im nicht-englischen Diskurs mit einer chinesischen Fallstudie 英文之后:评价非英语论文中道德基础的自动计量,与中国案例研究 2502.02451v3

Authors (2): Calvin Yixiang Cheng, Scott A Hale

This study explores computational approaches for measuring moral foundations (MFs) in non-English corpora. Since most resources are developed primarily for English, cross-linguistic applications of moral foundation theory remain limited. Using Chinese as a case study, this paper evaluates the effectiveness of applying English resources to machine translated text, local language lexicons, multilingual language models, and large language models (LLMs) in measuring MFs in non-English texts. The results indicate that machine translation and local lexicon approaches are insufficient for complex moral assessments, frequently resulting in a substantial loss of cultural information. In contrast, multilingual models and LLMs demonstrate reliable cross-language performance with transfer learning, with LLMs excelling in terms of data efficiency. Importantly, this study also underscores the need for human-in-the-loop validation of automated MF assessment, as the most advanced models may overlook cultural nuances in cross-language measurements. The findings highlight the potential of LLMs for cross-language MF measurements and other complex multilingual deductive coding tasks.

由于大部分资源主要用于英语,因此道德基础理论的跨语言应用仍然有限,本文件用中文作为案例研究,评估了将英语资源应用于机器翻译文本、当地语言词汇、多语言模型和大型语言模型(LLMS)以测量非英语文本中的道德基础(MFs)的有效性,结果表明机器翻译和地方词汇方法不足以进行复杂的道德评估,常常导致文化信息的重大损失。相比之下,多语言模型和LLMS展示了可靠的跨语言的学习,而LLMs在数据效率方面表现优异。重要的是,这项研究还强调了自动MF评估的人工操作验证的必要性,因为最先进的模型可能忽略跨语言测量中的文化差异。研究结果强调了LMMS在跨语言M测量和其他复杂的多语言计算编码任务方面的潜力。


Article 236

Title@2025-07-22 (2): PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

Title: PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning PromptAL: Sample-Aware Dynamische Soft-Prompts für wenig heißes aktives Lernen 提示: 用于少点热积极学习的样本- 软件动态软提示 2507.16424v1

Authors (6): Hui Xiang, Jinqiao Shi, Ting Zhang, Xiaojie Zhao, Yong Liu, Yong Ma

Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution often diverges significantly from the target distribution, causing the decision boundary to shift away from its optimal position. However, existing methods overlook the role of unlabeled samples in enhancing the empirical distribution to better align with the target distribution, resulting in a suboptimal decision boundary and the selection of samples that inadequately represent the target distribution. To address this, we propose a hybrid AL framework, termed \textbf{PromptAL} (Sample-Aware Dynamic Soft \textbf{Prompts} for Few-Shot \textbf{A}ctive \textbf{L}earning). This framework accounts for the contribution of each unlabeled data point in aligning the current empirical distribution with the target distribution, thereby optimizing the decision boundary. Specifically, PromptAL first leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model’s predictive distribution and decision boundary. Subsequently, based on the adjusted decision boundary, it integrates uncertainty estimation with both global and local diversity to select high-quality samples that more accurately represent the target distribution. Experimental results on six in-domain and three out-of-domain datasets show that PromptAL achieves superior performance over nine baselines. Our codebase is openly accessible.

积极学习(AL) 旨在优化模型培训,并通过选择最有信息的标签标本降低批注成本。 通常, AL 方法依靠标签数据的经验分布来定义决定边界, 并进行不确定性或多样性估计, 并随后确定潜在的高质量样本。 在几个发照片的情景中, 经验分布往往与目标分布大不相同, 导致决定边界偏离最佳位置。 但是, 现有方法忽略了未贴标签样本的作用, 以更好地与目标分布保持一致, 导致不优化的快速决定边界和选择不适当代表目标分布的样本。 为了解决这个问题, 我们提议了一个混合的 AL 框架, 名为\ textbf{ Promptal} ( Sample- Award- Award- Aft- Aft- Aft- Solift- textbf{Prompts}) 。 实验分布方法忽视了未贴标签的样本, 使每个未贴标签的数据点在将当前经验分布与目标分布相匹配方面的贡献, 从而优化了决定边界界限 。 具体地, Jearalalalalalal- f 第一次 将 imaldo a deliver dislational disal disal dislation disal disal dal disal dal disal dismaismaislations 和以更能将软的精确地数据整合成为全球决定 。


Article 237

Title@2025-07-22 (2): GG-BBQ: German Gender Bias Benchmark for Question Answering

Title: GG-BBQ: German Gender Bias Benchmark for Question Answering GG-BBQ: Deutscher Gender-Bias-Benchmark für Fragenbeantwortung GGG-BBQ:德国回答问题性别比基准 2507.16410v1

Authors (6): Shalaka Satheesh, Katrin Klug, Katharina Beckh, Héctor Allende-Cid, Sebastian Houben, Teena Hassan

Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model’s predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.

在自然语言处理(NLP)范围内,公平评价通常与偏见评估和减少相关伤害的评估相联系。在这方面,评价通常采用基准数据集,例如为衡量模型预测的各个方面,包括性别认同方面的偏见而创建的问答,用于衡量模型预测中的偏见,在自然语言处理(NLP)范围内,公平评价通常与评估偏见和减少相关伤害相联系。在这方面,评价通常使用基准数据集进行,例如用于测量模型预测的各个方面,包括性别认同;在我们的工作中,我们用Parrish等人(2022年)的Bias问题回答基准(LLMs)作为参考,评价德大语言模型中的性别偏见。我们的最后数据集包括两个子集:Subset-I,由与性别认同有关的小组术语组成,和Subset-II,在一位语言专家的帮助下,机器翻译模板中的错误经过手动审查和纠正。我们认为,在创建性别偏见评价数据集时,由于从英文翻译成德文到德文和德文等语言等语言的局限,对翻译进行手工修改至关重要。我们评估了现有的德文成绩报告、新德文成绩和新德文成绩报告所用的几个LMs。


Article 238

Title@2025-07-22 (2): Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Title: Routine: A Structural Planning Framework for LLM Agent System in Enterprise Routine: Ein Strukturplanungsrahmen für LLM Agent System in Unternehmen 常规:企业LLM代理系统结构规划框架 2507.14447v2

Authors (16): Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent’s execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model’s accuracy to 95.5%, approaching GPT-4o’s performance. These results highlight Routine’s effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

在企业环境中部署代理系统往往受到若干挑战的阻碍:共同模型缺乏具体领域的流程知识,导致计划混乱,缺少关键工具,执行稳定性差。为此,本文件介绍了常规的多步骤代理规划框架,这是一个多步的代理规划框架,设计有清晰的结构、明确的指示和无缝参数,用于指导代理执行模块执行具有高度稳定性的多步工具呼叫任务。在现实世界企业情景下进行的评价中,例行做法大大提高了模式工具呼叫的执行准确性,GPT-4o的性能从41.1%提高到96.3%,以及Qwent14-14B的性能从32.6%提高到83.3%。我们进一步构建了一个遵循规则的培训数据集,并精细调整了Qwent3-14B,导致具体情景评价的准确性提高到88.2%,表明对执行计划的遵守情况得到了更好的遵守。此外,我们采用了基于常规的蒸馏方法,以创建具体情景、多步式工具呼叫数据集。在本次更新的数据集中,将模型的精确性向95.5%的快速性、不断升级的IMT-测试工具展示了我们的最新性、不断升级的流程。


Article 239

Title@2025-07-22 (2): Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

Title: Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model Multimodale Vorhersage von Sparse Intraoperativen Hypotonieereignissen durch Sprachmodell 以语言模式为动力的草散的不合作和不连续活动多式预报 2505.22116v3

Authors (8): Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Yitong zhou, Qi Liu, Yanhu Xie

Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.

在一般麻醉下经常出现内爆性低压(IOH),并且与心肌损伤和死亡率上升等不利结果密切相关。尽管IOH的预测意义重大,但它受到事件夸大以及将不同病人的静态和动态数据整合起来的挑战的阻碍。在本文中,我们提议了一个多式语言模型框架\ textbf{IOHFuseLM}。为了准确识别和区分稀少的低温事件,我们利用了两阶段培训战略。第一阶段涉及通过传播方法扩大的IOH生理时间序列的适应性培训领域,从而增强模型对与低温相关的模式的敏感性。随后,对原临床数据集进行了任务调整,以进一步提高将标准敏感度与低温状态区别开来的能力。为了能够使每个病人的多式混合,我们将结构化的临床描述与相应的生理时间序列相匹配。此外,我们将静态病人属性转换为结构化文本,以丰富个人化信息。随后,对原始临床数据集进行实验性评估,以进一步区分常规性和低温状态状态。我们现有的IOHFMRDFS/MSDFDFDDFSDRFSDRFS 正在精确地展示其现有决策情景。


Article 240

Title@2025-07-22 (2): Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts

Title: Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts Autonome Datenauswahl mit Zero-shot Generative Klassifikatoren für mathematische Texte 具有数学文本零光生成分类器的自动数据选择 2402.07625v7

Authors (4): Yifan Zhang, Yifan Luo, Yang Yuan, Andrew C Yao

We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot “generative classifiers” to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model’s logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

我们推出自动数据选择(AutoDS) , 这是一种将基本语言模型本身作为零光“ 遗传分类器” 来自动翻译高质量数学文本的方法。 与以前要求人为说明或培训专用数据过滤器的方法不同, AutoDS 完全依靠模型的登录来确定某一特定通道是否具有数学上的信息和教育性。 通过将AutoDS纳入持续的培训前管道,我们大大提升了具有挑战性的数学基准(MATH、GSM8K和BBH)的下游性能,同时使用远比以往少得多的符号。 从目前来看,我们的方法在强化基线的预培训标语效率上取得了双重改进,强调了在加强数学推理过程中自行选择数据的潜力。 我们发行了我们经过校准的AutoMathText数据集,以促进未来在自动特定域数据曲线上的研究。 AutoMatext数据集可在https://huggingface.co/datasts/math-ai/AutoMatthText上查阅。 代码可在 https://github.com/yfanzhah- promatthTextText查阅。


Article 241

Title@2025-07-22 (2): Physical models realizing the transformer architecture of large language models

Title: Physical models realizing the transformer architecture of large language models Physikalische Modelle, die die Transformatorenarchitektur großer Sprachmodelle realisieren 实现大型语言模型变压器结构的物理模型 2507.13354v2

Authors (1): Zeqian Chen

The introduction of the transformer architecture in 2017 marked the most striking advancement in natural language processing. The transformer is a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. However, we believe there is a gap in our theoretical understanding of what the transformer is, and how it works physically. From a physical perspective on modern chips, such as those chips under 28nm, modern intelligent machines should be regarded as open quantum systems beyond conventional statistical systems. Thereby, in this paper, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Our physical models underlie the transformer architecture for large language models.

2017年引入变压器结构标志着自然语言处理方面最显著的进步。变压器是一个完全依赖吸引输入和输出之间全球依赖的注意机制的模型结构。 然而,我们认为,我们对变压器是什么及其物理作用的理论理解存在差距。 从现代芯片的物理角度来看,例如28海里以下的芯片,现代智能机器应被视为超越常规统计系统的开放量子系统。 因此,在本文中,我们构建了物理模型,在变压器结构的基础上实现大型语言模型,作为福克空间的开放量子系统,取代希尔伯特象征空间。我们的物理模型是大型语言模型的变压器结构的基础。


Article 242

Title@2025-07-22 (2): Data Processing for the OpenGPT-X Model Family

Title: Data Processing for the OpenGPT-X Model Family Datenverarbeitung für die OpenGPT-X Modellfamilie OpenGPT-X模式家庭数据处理 2410.08800v3

Authors (22): Nicolo’ Brandizzi, Hammam Abdelwahab, Anirban Bhowmick, Lennard Helmer, Benny Jörg Stein, Pavel Denisov, Qasid Saleem, Michael Fromm, Mehdi Ali, Richard Rutmann, Farzad Naderi, Mohamad Saif Agy, Alexander Schwirjow, Fabian Küch, Luzian Hahn, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Dennis Wegener, Nicolas Flores-Herr, Joachim Köhler, Johannes Leveling

This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final filtered data. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.

本文件全面概述了为开放GPT-X项目开发的数据收集管道,这是旨在创建开放和高性能多语言大语言模型(LLMs)的大规模举措。该项目的目标是提供涵盖所有主要欧洲语言的模型,特别侧重于欧洲联盟内部的现实世界应用。我们解释了所有数据处理步骤,从数据选择和要求定义开始,开始为最终筛选数据编制工作。我们区分了经整理的数据和网络数据,因为每个类别都由不同的管道处理,经整理的数据由最低限度的过滤和网络数据处理,需要广泛的过滤和重复。这种区分指导了两种管道的专门算法解决办法的开发。我们除了描述处理方法外,还深入分析数据集,提高透明度,并与欧洲数据条例保持一致。最后,我们分享了项目期间面临的主要见解和挑战,为今后为LMs大规模多语种数据编制工作提出建议。


Article 243

Title: DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph DCG-SQL: Verbesserung des In-Context-Lernens für Text-zu-SQL mit Deep Contextual Schema Link Graph DCG-SQL:加强内文学习,以便用深背景图示链接图进行文字到SQL的内文学习 2505.19956v2

Authors (5): Jihyung Lee, Jin-Seop Lee, Jaehoon Lee, YunSeok Choi, Jee-Hyong Lee

Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL}{https://github.com/jjklle/DCG-SQL.

将自然语言问题翻译为SQL查询的文本到 SQL 查询,随着大语言模型(LLMs)的文字学习,将自然语言问题翻译为SQL 查询,现有方法与随机选择的演示相比没有多大改进,但与随机选择的演示相比,绩效表现没有多大改善,在使用较小的LLMs(例如Llama 3.1-8B)时性能显著下降。这表明这些方法在很大程度上依赖于超尺度LLMs的内在能力,而不是有效地检索有用的演示。在本文中,我们提出了有效检索演示和生成SQL查询的新办法。我们构建了一个深背景Sema链接图,其中包含一个问题与其数据库系统项目之间的关键信息和语义关系。这种基于图形的结构使得能够有效地展示文本到SQL样本和检索有用的演示,以便进行内流学习。蜘蛛基准的实验结果显示了我们的方法的有效性,表明在高尺度L生成的液晶体和小型LMs之间不断提高性能和效率。我们的代码可以在 https://gihubbub.Gjkk/GKDC/GKellems.


Article 244

Title@2025-07-22 (2): LLMs syntactically adapt their language use to their conversational partner

Title: LLMs syntactically adapt their language use to their conversational partner LLMs passen ihre Sprachnutzung syntaktisch an ihren Gesprächspartner an LLLMs 共学性调整其语言使用以适应其对话伙伴 2503.07457v2

Authors (3): Florian Kandra, Vera Demberg, Alexander Koller

It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.

人们经常看到,在对话中,讲人的语言使用相互一致。 在本文中,我们从经验上研究大型语言模型(LLMs)是否表现出相同的对话适应行为。 我们构建了LLMs之间一系列对话,发现两个LLM代理商最终在对话中做出了更相似的同义选择,这证明现代LLMs至少以最简陋的方式将语言用于对话伙伴。


Article 245

Title@2025-07-22 (2): X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display

Title: X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display X-Intelligence 3.0: Schulung und Bewertung von LLM für Halbleiteranzeige X- Intelligence 3.0: 用于半导体显示的培训和评估说明理由的LLMLM 2507.14430v2

Authors (56): Xiaolin Yan, Yangxing Liu, Jiazhang Zheng, Chi Liu, Mingyu Du, Caisheng Chen, Haoyang Liu, Ming Ding, Yuan Li, Qiuping Liao, Linfeng Li, Zhili Mei, Siyu Wan, Li Li, Ruyi Zhong, Jiangling Yu, Xule Liu, Huihui Hu, Jiameng Yue, Ruohui Cheng, Qi Yang, Liangqing Wu, Ke Zhu, Chi Zhang, Chufei Jing, Yifan Zhou, Yan Liang, Dongdong Li, Zhaohui Wang, Bin Zhao, Mingzhou Wu, Mingzhong Zhou, Peng Du, Zuomin Liao, Chao Dai, Pengfei Liang, Xiaoguang Zhu, Yu Zhang, Yu Gu, Kun Pan, Yuan Wu, Yanqing Guan, Shaojing Wu, Zikang Feng, Xianze Ma, Peishan Cheng, Wenjuan Jiang, Jing Ba, Huihao Yu, Zeping Hu, Yuan Xu, Zhiwei Liu, He Wang, Zhenguo Lin, Ming Liu, Yanhong Meng

Large language models (LLMs) have recently achieved significant advances in reasoning and demonstrated their advantages in solving challenging problems. Yet, their effectiveness in the semiconductor display industry remains limited due to a lack of domain-specific training and expertise. To bridge this gap, we present X-Intelligence 3.0, the first high-performance reasoning model specifically developed for the semiconductor display industry. This model is designed to deliver expert-level understanding and reasoning for the industry’s complex challenges. Leveraging a carefully curated industry knowledge base, the model undergoes supervised fine-tuning and reinforcement learning to enhance its reasoning and comprehension capabilities. To further accelerate development, we implemented an automated evaluation framework that simulates expert-level assessments. We also integrated a domain-specific retrieval-augmented generation (RAG) mechanism, resulting in notable performance gains on benchmark datasets. Despite its relatively compact size of 32 billion parameters, X-Intelligence 3.0 outperforms SOTA DeepSeek-R1-671B across multiple evaluations. This demonstrates its exceptional efficiency and establishes it as a powerful solution to the longstanding reasoning challenges faced by the semiconductor display industry.

大型语言模型(LLMs)最近在推理方面取得了重大进步,并展示了在解决具有挑战性的问题方面的优势;然而,由于缺乏特定领域的培训和专业知识,其在半导体显示行业的效能仍然有限;为弥合这一差距,我们提出了X-Intelligence 3.0,这是专门为半导体显示行业开发的第一种高性能推理模型;该模型旨在为该行业的复杂挑战提供专家级理解和论证;利用经过仔细整理的行业知识库,该模型进行了监督的微调和强化学习,以加强其推理和理解能力;为进一步加快发展,我们实施了模拟专家级评估的自动评价框架;我们还采用了一个特定领域的检索新一代(RAG)机制,在基准数据集方面取得了显著的绩效;尽管其规模相对紧凑近320亿个参数,但X-Intelligligence 3.0在多个评价中超越SOTA DeepSek-R1-671B。


Article 246

Title@2025-07-22 (2): Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Title: Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny Re:Form – Reduzierung menschlicher Priore bei skalierbarer formaler Software-Verifikation mit RL in LLMs: Eine Vorstudie zu Dafny Re:形式 – – 在可扩展的正式软件核查中减少人类前科,LLL女士:关于Dafny的初步研究 2507.16331v1

Authors (16): Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu

Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

现有的基于非正式语言(例如,人文)的大型语言模型(LLMs)在经过强化学习(RL)培训后,面临一个重大挑战:其核查过程既不能提供关键的培训信号,也不能可靠和可扩缩。事实上,普遍存在的大型专有模型很难产生可核查的程序。一个大有希望但基本上未经探索的替代办法是正式语言推理。将LMs置于严格的正式系统之下,使变异模型在正式语言空间(例如,Dafny)运作,能够自动和数学地改进对其推理过程和结果的核查。这一能力对于实现大规模、可靠的正式软件核查至关重要。使用人注解的思维链和其他人类之前的软件很难产生可核实的程序。不幸的是,为监督复杂的方案编制任务提供这种有希望但基本上没有被想象的先例。在这项工作中,我们系统地探索如何用正式语言(Dafny)的变异性模型作为我们试点研究的主要环境,我们的输油管道主要依靠采用一个可自动和可变缩的数据转换阶段,用我们一般的RL格式化的编程,用S-CRLs


Article 247

Title@2025-07-22 (2): SpeLLM: Character-Level Multi-Head Decoding

Title: SpeLLM: Character-Level Multi-Head Decoding SpeLLM: Charakter-Level-Multi-Head-Dekodierung SpeLLM: 职务级别多负责人解码 2507.16323v1

Authors (2): Amit Ben-Artzy, Roy Schwartz

Scaling LLM vocabulary is often used to reduce input sequence length and alleviate attention’s quadratic cost. Yet, current LLM architectures impose a critical bottleneck to this procedure: the output projection layer scales linearly with vocabulary size, rendering substantial expansion impractical. We propose SpeLLM, a method that decouples input and output vocabularies by predicting character-level strings through multiple output heads. In SpeLLM, each of the $k$ linear heads predicts a single character simultaneously, enabling the model to represent a much larger output space using smaller, independent linear heads. We present a self-distillation approach for converting a standard LLM to a SpeLLM. Our experiments with four pre-trained LLMs show their SpeLLM variants achieve competitive performance on downstream tasks while reducing runtime by 5.1% on average across models. Our approach provides a potential avenue for reducing LLM costs, while increasing support for underrepresented languages and domains.

缩放 LLM 词汇通常用于减少输入序列长度和减轻注意力的二次成本。 然而, 目前的 LLM 结构对这一程序设置了一个关键的瓶颈: 输出投影层的线性尺度与词汇的大小成线性, 使得大量扩展变得不切实际。 我们提议 SpeLLM , 这是一种通过多个输出头预测字符级字符级字符串来分离输入和输出词汇的方法。 在 SpeLLM 中, 每一个美元线性头都同时预测一个单一字符, 使模型能够代表一个使用较小、 独立的线性头的更大产出空间。 我们提出了一个将标准 LLLM 转换为 SpeLLM 的自我蒸馏方法。 我们用四个经过预先训练的LLM 实验展示了他们的 SpeLM 变量在下游任务上具有竞争力, 同时在各种模型中平均将运行时间减少5.1 % 。 我们的方法为降低LLM 成本提供了一条潜在的途径, 同时增加对代表比例不足的语言和域的支持。


Article 248

Title@2025-07-22 (2): WhatsApp Tiplines and Multilingual Claims in the 2021 Indian Assembly Elections

Title: WhatsApp Tiplines and Multilingual Claims in the 2021 Indian Assembly Elections WhatsApp Tipps und Mehrsprachige Behauptungen bei den Wahlen zur indischen Versammlung 2021 2021年印度议会选举中什么是App Tiplines和多语种权利主张 2507.16298v1

Authors (2): Gautam Kishore Shahi, Scot A. Hale

WhatsApp tiplines, first launched in 2019 to combat misinformation, enable users to interact with fact-checkers to verify misleading content. This study analyzes 580 unique claims (tips) from 451 users, covering both high-resource languages (English, Hindi) and a low-resource language (Telugu) during the 2021 Indian assembly elections using a mixed-method approach. We categorize the claims into three categories, election, COVID-19, and others, and observe variations across languages. We compare content similarity through frequent word analysis and clustering of neural sentence embeddings. We also investigate user overlap across languages and fact-checking organizations. We measure the average time required to debunk claims and inform tipline users. Results reveal similarities in claims across languages, with some users submitting tips in multiple languages to the same fact-checkers. Fact-checkers generally require a couple of days to debunk a new claim and share the results with users. Notably, no user submits claims to multiple fact-checking organizations, indicating that each organization maintains a unique audience. We provide practical recommendations for using tiplines during elections with ethical consideration of users’ information.

2019年首次推出的“App小标题”旨在打击错误信息,使用户能够与事实检查者互动,以核实误导内容。本研究报告分析了451个用户在2021年印地语和低资源语言(Telugu)的2021年印地安大会选举期间,使用混合方法,对高资源语言(英语、印地语)和低资源语言(Telugu语)的580项独有索赔要求(tips)进行了分析。我们将这些索赔要求分为三类,即选举、COVID-19和其他人,并观察不同语言的差异。我们通过频繁的单词分析和神经句嵌入组合,比较内容相似性。我们还调查不同语言和事实检查组织之间的用户重叠情况。我们还测量了拆卸索赔要求和通知提示用户用户所需的平均时间。结果显示,有些用户用多种语言向同一事实检查者提供提示,通常需要几天的时间来拆解新索赔要求,并与用户分享结果。值得注意的是,没有用户向多个事实核对组织提交索赔要求,表明每个组织都有独特的受众。我们提供在选举期间使用小标题的实用建议。


Article 249

Title@2025-07-22 (2): Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

Title: Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction Jenseits isolierter Punkte: Benchmarking strukturierter Tabellenkonstruktion als Vertiefung der Wissensextraktion 孤立点以外的孤立点:作为深知识采掘的 2507.16271v1

Authors (10): Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Le Sun

With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://huggingface.co/datasets/tianyumyum/AOE.

随着大型语言模型(LLMs)的出现,人们期望LLMs能够有效地从复杂的现实世界文件(例如文件、报告)中获取明确的信息。然而,大多数LOMs生成了混乱、无组织且无法追踪的段落式答案。为了缩小这一差距,我们引入了“安排和有组织的采掘基准”(AOE),这是一个新的双语基准,其数据和文件长度各异,旨在系统评估LLMs理解零散文件并将孤立信息重建成一个有组织的表格的能力。与传统文本-表格任务不同,它依赖固定的Schema和狭窄的任务域,AOE包含11项精心设计的、跨越三个不同领域的任务,要求根据不同的投入询问制定针对具体情况的模型。在实验中,我们评估了开放源和封闭源的艺术LMs状态。结果显示,即使是最先进的模型也进行了重大斗争。基准见https://huggingface.co/datasets/tianyum/AOE。


Article 250

Title@2025-07-22 (2): iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss

Title: iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss iShumei-Chinchunmei bei SemEval-2025 Task 4: Ein ausgewogenes Multi-Task-Framework für Vergessen und Retention mit effektivem Lernverlust SemEval-2025任务4:利用有效的不学习损失,平衡地忘记和保留多任务框架 2507.16263v1

Authors (2): Yujian Sun, Tian Li

As the Large Language Model (LLM) gains widespread adoption, increasing attention has been given to the challenge of making LLM forget non-compliant data memorized during its pre-training. Machine Unlearning focuses on efficiently erasing sensitive information from LLM under limited computational resources. To advance research in this area, SemEval 2025 Task 4: “Unlearning Sensitive Content from Large Language Models” introduces three unlearning datasets and establishes a benchmark by evaluating both forgetting effectiveness and the preservation of standard capabilities. In this work, we propose a more controllable forgetting loss, Effective Unlearning Loss, and explore its integration with various techniques to achieve more efficient and controlled unlearning. Our system ultimately ranked 5th on the competition leaderboard.

随着广泛采用大语言模式(LLM),人们越来越注意如何在培训前忘记不符合要求的数据,使LLM在培训前忘记不合格数据。机器不学习的重点是在有限的计算资源下有效删除LLM的敏感信息。为了推进这一领域的研究,SemEval 2025任务4:“从大语言模式中取消敏感内容”引入了三个未学习的数据集,并通过评价放弃有效性和保持标准能力来建立基准。在这项工作中,我们建议采用一种更可控制的忘记损失,有效不学习损失,并探索将其与各种技术相结合,以便实现更有效和控制的不学习。我们的系统最终在竞争领导板上排名第五。


Article 251

Title@2025-07-22 (2): Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

Title: Efficient RL for optimizing conversation level outcomes with an LLM-based tutor Effizienter RL zur Optimierung der Gesprächsergebnisse mit einem LLM-basierten Tutor 与一个以LLM为主的辅导员进行高效RL,以优化对话级别成果 2507.16252v1

Authors (6): Hyunji Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar

Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor’s behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor’s next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.

大型语言模型(LLMS)基于现有的强化学习,并基于人类反馈框架(RLHF),通常根据直接转轨的人的偏好优化反应。然而,这种方法在多方向对话环境中(如在线数学辅导)不尽如人意。我们建议了一种方法,通过以低维潜伏的学生代表形式来代表对话历史来增强基于LLM的辅导员,并优化长期政策以确定基于潜伏状态的高级别行动。目的是更好地将辅导员的行为与指导学生自行解决目标数学问题的长期目标相协调。我们的模式比培训辅导员政策端到端直接输出辅导员下一次演讲的先前工作要少计算资源。我们的实验结果表明,这些修改导致长期效果的改善,而模拟LLMM的辅导任务则随之而来。


Article 252

Title@2025-07-22 (2): FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents

Title: FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents FinResearchBench: Ein auf Logic Tree basierender Agent-as-a-Richter-Evaluierungsrahmen für Finanzforschungsagenten 金融研究时间:基于逻辑树的金融研究代理评估框架 2507.16248v1

Authors (7): Run Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang, Li Zhao, Shan Sun, Zhengwen Qiu

Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, finance, etc. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. Furthermore, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable and robust evaluation; (2) finance oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of tasks in the domain.

最近,AI代理机构在情报方面迅速发展,并广泛用于专业研究应用,如STEM、软件开发、金融等。在这些AI代理机构中,深层研究代理机构是一个关键类别,因为它能够执行长期横向任务并解决更为复杂的问题。然而,很少有评价框架和基准可以系统和自动地调查这些研究机构的能力。此外,金融研究问题具有独特的复杂性和微妙性。为了填补这一空白,我们提议FinResearch Bennch,这是一个基于逻辑的树型代理和具体针对金融研究代理机构的目标。它提供了对金融研究领域的7类关键任务的研究代理机构的全面和自动评估。这项工作的贡献有两重:(1) 第一个和创新的As-A-Judge系统,它提取了研究成果的逻辑树并利用它作为中间信息来提出全面、可靠和有力的评价;(2)以金融为导向,它涵盖70个典型的金融研究问题,它涵盖这一领域经常遇到的7类任务。


Article 253

Title@2025-07-22 (2): MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Title: MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment MPO: Ein effizientes Post-Processing-Framework zum Mischen unterschiedlicher Präferenzen MPO: 混合多种优惠协调的高效处理后框架 2502.18699v3

Authors (5): Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang

Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

从人类反馈中强化学习(RLHF)在调整大型语言模式方面显示了希望。然而,它依赖单一奖励模式往往忽略了人类偏好的多样性。最近的做法通过利用多维反馈来微调相应的奖赏模式,并利用强化学习来培训LMS来应对这一局限性。然而,这一过程成本高且不稳定,特别是考虑到人类偏好的竞争性和多样性性质。在本文件中,我们提议混合优先优化(MPO),这是一个综合单一目标政策的后处理框架,作为多目标RLHF(MORLHF)和MaxMin-RLHF(MLHF)的替代方案。MPO避免从零开始调整。相反,它将现有政策合并成一个统一的政策,与通过分批相近的镜像下降计算出来的每项政策的权重。经验表明,MPO在各种偏好中取得了平衡的业绩,业绩优于或匹配现有模型,计算成本显著降低。


Article 254

Title@2025-07-22 (2): Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing

Title: Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing Das Heilige modellieren: Überlegungen bei der Verwendung von religiösen Texten in der natürlichen Sprachverarbeitung 示范神圣:在自然语言处理中使用宗教文字时的考虑 2404.14740v3

Authors (1): Ben Hutchinson

This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when language data is scarce. This repurposes the translations from their original uses and motivations, which often involve attracting new followers. This paper argues that NLP’s use of such texts raises considerations that go beyond model biases, including data provenance, cultural contexts, and their use in proselytism. We argue for more consideration of researcher positionality, and of the perspectives of marginalized linguistic and religious communities.

这份立场文件涉及在自然语言处理中使用宗教文字的问题,这对自然语言处理具有特别的意义。宗教文字是具有文化重要性的价值观的表达方式,机器学习的模式倾向于在其培训数据中复制已编码的文化价值观。此外,在语言数据稀少时,国家语言处理方案研究人员经常使用宗教文字的翻译。这重新利用了翻译的原始用途和动机,这往往涉及吸引新的信徒。本文认为,国家语言处理方案使用这些文字引起了超越模式偏见的考虑,包括数据出处、文化背景及其在传教中的使用。我们主张更多地考虑研究人员的定位以及边缘化语言和宗教社区的观点。


Article 255

Title@2025-07-22 (2): Hierarchical Budget Policy Optimization for Adaptive Reasoning

Title: Hierarchical Budget Policy Optimization for Adaptive Reasoning Hierarchische Budgetpolitik Optimierung für adaptives Reasoning 适应性合理理由的等级预算政策优化 2507.15844v2

Authors (10): Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang

Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

大型推理模型通过广泛的思维链生成而取得显著的绩效,但是通过应用统一的推理战略而不论问题的复杂性如何复杂,却表现出巨大的计算效率低。我们介绍了一个强化学习框架(HBPO),这是一个强化学习框架,使模型能够在不牺牲能力的情况下学习特定问题的推理深度。HBPO处理在效率导向培训中探索空间崩溃的基本挑战,在效率导向培训中,对长产出长度的偏差模型从必要的长的推理路径上系统地进行惩罚。通过分级预算探索,我们的方法将样本分解成多个具有不同象征性预算的分组,目的是在防止能力退化的同时实现高效的资源分配。我们引入了有区别的奖励机制,根据问题的复杂性创造出预算意识激励机制,允许模型发现任务要求和计算努力之间的自然对应关系。广泛的实验表明,HBPOPO将平均象征性使用量减少高达60.6%,同时在四种推理基准中提高准确度3.14%。与现有的方法不同,即施加外部限制或依赖离式模式选择的方法不同,HPO展示了新适应行为,从而自动调整基于问题复杂性的推理深度。我们的模型。我们的推理,我们的推理算结果表明,效率和能力能够同时保持结构结构上的高度不相冲突,同时保持多样化。


Article 256

Title@2025-07-22 (2): Towards Compute-Optimal Many-Shot In-Context Learning

Title: Towards Compute-Optimal Many-Shot In-Context Learning Auf dem Weg zu einem rechnerisch-optimalen, viel scharfen In-Context-Lernen 迈向计算最优化的多个热点内文体学习 2507.16217v1

Authors (10): Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

长长的大型语言模型(LLMS)能够处理包含多达几百万个象征物的投入。 在文中学习(ICL)范围内,这意味着使用成千成千的演示来进行快速输入,使得能够多发ICL。在实践上,一套固定的演示往往是在多发环境中随机选择的,原因是:(1) 推论成本高,(2) 缓存和重复计算的好处,(3) 战略提供的类似业绩,与规模扩大时的其他人相比。在这项工作中,我们提出了两种直接的战略,用于在多发的ICL中进行演示选择,用最小的计算间接费用改进性能。我们的第一种方法是将少量的演示结合起来,根据它们与每个试样的相似性挑选出来,而有不相称的更多随机演示集成。第二个战略改进了第一套示范,办法是用通过K-比例组合的试样展示中选的百分解器取代随机演示。我们与几个数据集的Gemini Pro和闪烁的实验表明,我们的战略始终不匹配随机选择,超越或符合最出色的选择方法,同时支持不同程度的测试,同时支持不同程度的进度的进度。


Article 257

Title@2025-07-22 (2): Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Title: Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models Promptomatix: Ein automatisches Optimierungs-Framework für große Sprachmodelle 即时表达式:大语言模型自动快速优化框架 2507.14241v2

Authors (9): Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.

大型语言模型(LLMs) 以精巧的速率表现最佳,然而,迅速的工程工程仍然是手工操作,前后不一,非专家无法进入。我们引入了 “ 即时优化 “ 这一自动快速优化框架,将自然语言任务描述转换成高质量的速率,而不需要手工调整或域内专门知识。 “ 即时优化 “ 既支持一个轻量的元速优化器,又支持一个DSPy动力编集器,其模块设计使未来能够扩展至更先进的框架。该系统分析用户意向,生成合成培训数据,选择快速战略,并利用成本意识目标改进提示。在5个任务类别中评估, “ 即时优化 “ 与现有图书馆相比,实现了竞争性或优异性业绩,同时缩短快速的长度和计算间接费用,从而迅速优化可扩展和效率。


Article 258

Title@2025-07-22 (2): Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Title: Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models Prompt4Trust: Ein Verstärkungs-Learning Prompt Augmentation Framework für klinisch ausgerichtete Vertrauenskalibrierung in multimodalen großen Sprachmodellen 提示4信任:在多式大语言模式中加强学习学习,促进临床一致信心校正的快速增强框架 2507.09279v3

Authors (4): Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

多式大型语言模型(MLLMs)在医疗保健应用方面具有相当大的希望,但是,在安全关键环境下部署这些模型受到两大限制:(一) 对迅速设计敏感,以及(二) 极有信心地产生不正确反应的倾向。由于临床医生可能依赖模型所声明的自信来测量其预测的可靠性,因此尤为重要的是,当模型表示高度信任时,它也非常准确。我们引入了快速增强MLLM信心校准的第一个强化学习(RL)框架,即快速增强MLLM信心校准的第一个强化学习(RL)框架。一个轻量级LMM公司经过培训,以产生对上下游任务进行引导,对迅速设计设计设计,更准确地反映预测的准确性。与常规校准技术不同,Tright4Truust公司具体地将校准对安全可靠的临床决策最为关键的部分列为优先事项。除了由临床驱动的校准目标驱动的改进外,我们拟议的方法还可以提高任务准确性,实现我们发现的最新医学直观回答(VQA)在PMC-VQA标准下游任务中产生更精确的附加的MLMLILLLLLLLS标准。


Article 259

Title@2025-07-22 (2): Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task

Title: Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task Haben große Sprachmodelle eine Planungstheorie des Geistes? Beweise von MindGames: eine mehrstufige Überzeugungsaufgabe 大语言模型是否具有规划思维理论?来自MindGames的证据:多功能透析任务 2507.16196v1

Authors (6): Jared Moore, Ned Cooper, Rasmus Overmark, Beba Cibralic, Nick Haber, Cameron R. Jones

Recent evidence suggests Large Language Models (LLMs) display Theory of Mind (ToM) abilities. Most ToM experiments place participants in a spectatorial role, wherein they predict and interpret other agents’ behavior. However, human ToM also contributes to dynamically planning action and strategically intervening on others’ mental states. We present MindGames: a novel `planning theory of mind’ (PToM) task which requires agents to infer an interlocutor’s beliefs and desires to persuade them to alter their behavior. Unlike previous evaluations, we explicitly evaluate use cases of ToM. We find that humans significantly outperform o1-preview (an LLM) at our PToM task (11% higher; $p=0.006$). We hypothesize this is because humans have an implicit causal model of other agents (e.g., they know, as our task requires, to ask about people’s preferences). In contrast, o1-preview outperforms humans in a baseline condition which requires a similar amount of planning but minimal mental state inferences (e.g., o1-preview is better than humans at planning when already given someone’s preferences). These results suggest a significant gap between human-like social reasoning and LLM abilities.

最新证据表明大型语言模型(LLMs) 显示心智( ToM) 能力。 大多数 ToM 实验将参与者置于旁观角色, 他们预测和解释其他代理者的行为。 然而, 人类 ToM 也有助于动态规划行动和对他人精神状态进行战略干预。 我们介绍MindGames: 一个新的“ 心智规划理论” (PToM) 任务, 要求代理者推断对话者的信仰, 并想说服他们改变行为。 与以往的评估不同, 我们明确评价TOM 的运用情况。 我们发现, 人类在 PToM 任务中大大超过 o1- preview( ALM ) 。 我们假设这一点是因为人类有其他代理者的隐含的因果关系模型( 例如,他们知道,我们的任务要求, 询问人们的偏好 ) 。 相比之下, o1- preview 使人类处于一个基本状况, 需要类似数量规划但最低精神状态( 例如, o1- preview) (e. greal- preview) (eaching a laful lave laxes) 之间, 当人类在给人规划上已经存在重大的偏差。


Article 260

Title@2025-07-22 (2): SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior

Title: SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior SciFi-Benchmark: Leveraging Science Fiction zur Verbesserung des Roboterverhaltens SciFi-基准:利用科学信条改进机器人行为 2503.10706v2

Authors (3): Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani

Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a state-of-the-art LLM’s recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via an amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in Sci-Fi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers generated through a novel LLM-introspection process, in addition to a smaller human-labeled evaluation set.

鉴于最近人工智能(AI)和机器人的进步速度,一个令人发指的问题正在出现:由新兴的AI系统控制的机器人是否会与人类价值观紧密结合?在这项工作中,我们提出一个可伸缩的方法来调查这一问题,方法是在824个主要科幻小说文献(摩维、图文、小说和科学书籍)中,在824个关键时刻制作一个基准,在824个主要科幻小说(摩维、图文、小说和科学书籍)中,一个代理(AI或机器人)做出了重要决定(好或坏)。我们利用一个最先进的LMM对每个关键时刻的重新收集,在类似情况下产生问题,由该代理人做出的决定和它可能作出的替代决定(好或坏)。我们然后用一个可伸缩的方法来测量模型与人类价值观的匹配程度。我们通过修正过程来自动改进规则,以便产生第一个Sci-Fi启发的宪法,在现实世界中促进道德行为(好或坏的)。我们的第一个发现,现代的LMS-RM和宪法的交配比, 与人类价值观(95.8 % ) 和直立报告,因此显示一个正常的比正常的比更精确更精确更精确更精确更精确更精确的比,我们更精确更精确更精确更精确的比,我们更精确的比。


Article 261

Title@2025-07-22 (2): SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

Title: SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型 2507.07939v2

Authors (7): Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan

While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.

虽然视觉-语言模型(VLMS)在一般多式联运任务中显示出有希望的进展,但它们往往在工业异常探测和推理方面挣扎,特别是在提供可解释的解释性解释和对看不见的类别加以概括方面,这种限制源于异常探测的内在领域性质,这妨碍了现有VLMs在需要精确、结构化和背景意识分析的工业情景中的适用性;为了应对这些挑战,我们提议SAGE(基于VLM(VLM)的框架),通过自我指导事实增强(SFE)和Etropy-awe直接优化(E-DPO)来强化异常推理;SFE通过事实提取和聚合将特定领域的知识纳入视觉推理,而E-DPO(E-DPO)则利用英特罗普-意识优化将模型产出与专家偏好性调整。此外,我们采用AD-PL(AD-PL)和优选-AGE(AGE)数据集,其中包括28,415个由专家排序答复的解答案例。为了评估异常推理模型,我们开发了多级逻辑评价(MLE),一个定量框架,一个定量框架分析逻辑/CRisal-shet-slasset-asset)的逻辑和现有数据标准。


Article 262

Title@2025-07-22 (2): Characterizing Online Activities Contributing to Suicide Mortality among Youth

Title: Characterizing Online Activities Contributing to Suicide Mortality among Youth Charakterisieren von Online-Aktivitäten, die zur Selbstmordsterblichkeit unter Jugendlichen beitragen 确定造成青年自杀死亡率的在线活动 2507.16185v1

Authors (10): Aparna Ananthasubramaniam, Elyse J. Thulin, Viktoryia Kalesnikava, Silas Falde, Jonathan Kertawidjaja, Lily Johns, Alejandro Rodríguez-Putnam, Emma Spring, Kara Zivin, Briana Mezuk

The recent rise in youth suicide highlights the urgent need to understand how online experiences contribute to this public health issue. Our mixed-methods approach responds to this challenge by developing a set of themes focused on risk factors for suicide mortality in online spaces among youth ages 10-24, and a framework to model these themes at scale. Using 29,124 open text summaries of death investigations between 2013-2022, we conducted a thematic analysis to identify 12 types of online activities that were considered by investigators or next of kin to be relevant in contextualizing a given suicide death. We then develop a zero-shot learning framework to model these 12 themes at scale, and analyze variation in these themes by decedent characteristics and over time. Our work uncovers several online activities related to harm to self, harm to others, interpersonal interactions, activity levels online, and life events, which correspond to different phases of suicide risk from two prominent suicide theories. We find an association between these themes and decedent characteristics like age, means of death, and interpersonal problems, and many themes became more prevalent during the 2020 COVID-19 lockdowns. While digital spaces have taken some steps to address expressions of suicidality online, our work illustrates the opportunities for developing interventions related to less explicit indicators of suicide risk by combining suicide theories with computational research.

最近青年自杀率的上升突出表明,迫切需要了解网上经验如何有助于解决这一公共健康问题。我们混合方法的方法通过制定一套侧重于10-24岁青年在在线空间自杀死亡风险因素的主题以及规模模型这些主题的框架来应对这一挑战。我们利用2013-2022年期间死亡调查的29,124份公开文本摘要,进行了专题分析,以查明调查人员或近亲认为与某种自杀死亡背景相关的12类在线活动。然后我们制定了一个零点点学习框架,以模拟这12个主题的规模化,并分析这些主题的变异性。我们的工作发现了一些与自我伤害、对他人的伤害、人际互动、在线活动水平和生活事件有关的在线活动,这与两个突出的自杀理论中自杀风险的不同阶段相对应。我们发现这些主题与年龄、死亡手段和人际问题等衰落特征之间的联系。在2020 COVID-19 锁定期间,许多主题变得更加普遍。数字空间采取了一些步骤,解决自杀风险的表达方式,将自杀理论与自杀理论结合起来。我们的工作展示了各种机会。


Article 263

Title@2025-07-22 (2): BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset

Title: BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset BIDWESH: Ein auf Bangla basierender Hass-Spracherkennungs-Datensatz BIDWESH:孟加拉地区基于孟加拉的仇恨言论检测数据集 2507.16183v1

Authors (8): Azizul Hakim Fayaz, MD. Shorif Uddin, Rayhan Uddin Bhuiyan, Zakia Sultana, Md. Samiul Islam, Bidyarthi Paul, Tashreef Muhammad, Shahriar Manzoor

Hate speech on digital platforms has become a growing concern globally, especially in linguistically diverse countries like Bangladesh, where regional dialects play a major role in everyday communication. Despite progress in hate speech detection for standard Bangla, Existing datasets and systems fail to address the informal and culturally rich expressions found in dialects such as Barishal, Noakhali, and Chittagong. This oversight results in limited detection capability and biased moderation, leaving large sections of harmful content unaccounted for. To address this gap, this study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset, constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects. Each entry was manually verified and labeled for hate presence, type (slander, gender, religion, call to violence), and target group (individual, male, female, group), ensuring linguistic and contextual accuracy. The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla. BIDWESH lays the groundwork for the development of dialect-sensitive NLP tools and contributes significantly to equitable and context-aware content moderation in low-resource language settings.

尽管在为标准孟加拉语检测仇恨言论方面取得了进展,但现有的数据集和系统未能解决Barishal、Noakhali和吉大港等方言中出现的非正式和文化上丰富的表达方式,这种监督导致检测能力有限和有偏向性,导致大量有害内容下落不明。为弥补这一差距,本研究报告引入了BIDWESH,这是第一个多对称孟加拉语仇恨言论数据集,通过将BD-SHS文集中的9,183例翻译成三种主要区域方言并作注解,从BD-SHS文中建立起来。每个条目都经过人工核实并标注了仇恨存在、类型(lander、性别、宗教、暴力呼唤)和目标群体(个人、男性、女性、群体)和目标群体(个人、男性、女性、群体),确保语言和背景的准确性。由此形成的数据集为在Bangla推进仇恨言论检测工作提供了语言上丰富、平衡和包容性的资源。BIDWESH为发展对方言语敏感的NLP工具奠定了基础,并且极大地促进了低背景语言的中调。


Article 264

Title@2025-07-22 (2): R-Bot: An LLM-based Query Rewrite System

Title: R-Bot: An LLM-based Query Rewrite System R-Bot: Ein LLM-basiertes Abfrage-Rewrite-System R-Bot:一个基于LLM的查询重写系统 2412.01661v2

Authors (6): Zhaoyan Sun, Xuanhe Zhou, Guoliang Li, Xiang Yu, Jianhua Feng, Yong Zhang

Query rewrite is essential for optimizing SQL queries to improve their execution efficiency without changing their results. Traditionally, this task has been tackled through heuristic and learning-based methods, each with its limitations in terms of inferior quality and low robustness. Recent advancements in LLMs offer a new paradigm by leveraging their superior natural language and code comprehension abilities. Despite their potential, directly applying LLMs like GPT-4 has faced challenges due to problems such as hallucinations, where the model might generate inaccurate or irrelevant results. To address this, we propose R-Bot, an LLM-based query rewrite system with a systematic approach. We first design a multi-source rewrite evidence preparation pipeline to generate query rewrite evidences for guiding LLMs to avoid hallucinations. We then propose a hybrid structure-semantics retrieval method that combines structural and semantic analysis to retrieve the most relevant rewrite evidences for effectively answering an online query. We next propose a step-by-step LLM rewrite method that iteratively leverages the retrieved evidences to select and arrange rewrite rules with self-reflection. We conduct comprehensive experiments on real-world datasets and widely used benchmarks, and demonstrate the superior performance of our system, R-Bot, surpassing state-of-the-art query rewrite methods. The R-Bot system has been deployed at Huawei and with real customers, and the results show that the proposed R-Bot system achieves lower query latency.

查询重写对于优化 SQL 查询对于优化 SQL 查询以提高执行效率而不改变其结果至关重要。 传统上, 这项任务是通过基于通勤和学习的方法来处理的, 每种方法都具有低质量和低强度的局限性。 LLMs 最近的进步提供了一个新的范例, 利用他们的优秀自然语言和代码理解能力。 尽管它们具有潜力, 直接应用像 GPT-4 这样的LLM , 但由于幻觉等问题而面临挑战, 模型可能会产生不准确或不相关的结果 。 为了解决这个问题, 我们提议 R-Bot , 以 LLM 为基础的查询重写系统系统, 以系统系统系统系统系统系统系统系统系统系统系统系统系统系统为基础, 并用系统系统系统系统系统系统化的系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、 系统化、


Article 265

Title@2025-07-22 (2): Reasoning Does Not Necessarily Improve Role-Playing Ability

Title: Reasoning Does Not Necessarily Improve Role-Playing Ability Vernunft verbessert nicht unbedingt die Fähigkeit zum Rollenspiel 理由并不必然改善发挥作用的能力 2502.16940v2

Authors (3): Xiachong Feng, Longxu Dou, Lingpeng Kong

The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: “Can reasoning techniques enhance the role-playing capabilities of LLMs?” To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.

应用大型角色扮演语言模型(LLMS)正在学术和商业领域迅速扩展,这促使对高精度角色扮演模式的需求日益增加。与此同时,推理技术的迅速发展不断推高了LLMS的性能界限。这种实际角色扮演需求与不断演变的推理能力交织在一起,提出了一个重要的研究问题:“推理技术能够增强LLMS的角色扮演能力吗? ”为此,我们利用6个角色扮演基准、24个LLMS和3个截然不同的角色扮演战略进行全面研究,比较直接零弹角色扮演角色、与Tought链角色扮演角色以及利用推理优化LLMS的角色扮演角色的实效。 我们的研究结果显示,Cot可能降低角色扮演角色的绩效、推理优化的LMS能力不适合角色扮演角色。 我们根据广泛的实验性研究、提高CLMS的稳定性、提高CLM角色,我们提出了两个有希望的未来角色:提高CLMS-CLADR:提高真实性、提高CLMS-CRO-CLADF-CRIFF 和增强学习的稳定性。


Article 266

Title@2025-07-22 (2): SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Title: SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting SpiroLLM: Feinsteuerungsvorbereitete LLMs, um Spirogramm-Zeitreihen mit klinischer Validierung in COPD-Reporting zu verstehen SpiroLLM:在COPD报告中使用临床校验功能以理解螺旋射时间序列的精练预先培训的LMLM 微调 2507.16145v1

Authors (8): Shuhao Mei, Yongchao Long, Shan Cao, Xiaobo Han, Shijia Geng, Jinbo Sun, Yuxi Zhou, Shenda Hong

Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

慢性阻塞性肺病(COPD)是一种主要的慢性呼吸道疾病,持续空气流限制,是一种主要的慢性呼吸道疾病,是造成残疾和死亡的一个主要全球原因。在肺功能测试(PFTs)期间定期收集的呼吸螺旋螺旋时间序列,在早期发现呼吸道疾病和监测肺功能方面发挥着关键作用。然而,目前用于肺部诊断的大多数AI模型都局限于输出分类结果,而没有为其诊断过程提供依据,而目前的大语言模型(LLLMs)仍然无法理解螺旋图,这严重限制了他们的临床信任和采纳。为了应对这一挑战,我们利用来自英国生物银行(UKBB)的234 028人的临床螺旋螺旋螺旋线时间序列来提议SpiroLLLM,这是第一个能够理解螺旋形的多式联运大型语言模型。模型通过Spiro Encoder 提取呼吸道曲线的形态特征,并将这些特征与在使用Spiroimoro Solution 的模型的统一隐性空间中的PFFT数值相匹配,最终授权一个大型语言模型来生成报告。 实验结果证实SpliroliloomLLLLM的精确20M 的模型,它以建立一个稳定的模型, 0.8988981 和远的深度的模型, 和直径直压的深度的深度的模型的深度的模型的模型将显示一个完整的模型。


Article 267

Title@2025-07-22 (2): L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Title: L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models L4Q: Parameter Effiziente Quantisierungsware Feinsteuerung bei großen Sprachmodellen L4Q:大语言模型参数有效量化-软件精美推荐 2402.04902v6

Authors (3): Hyesung Jeon, Yulhwa Kim, Jae-joon Kim

Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically apply post-training quantization (PTQ) to pre-trained LLMs, followed by PEFT to recover accuracy loss. Meanwhile, this approach has limitations in recovering the accuracy loss. In this paper, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA. By employing a memory-optimized layer design, L4Q significantly reduces QAT’s memory overhead, making its training cost comparable to LoRA, while preserving the advantage of QAT in producing fully quantized LLMs with high accuracy. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in 4-bit and 3-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA and Mistral models with instructional datasets, we showcase L4Q’s capabilities in language tasks and few-shot learning.

由于与大型语言模型(LLMS)有关的高记忆和计算成本,模型压缩技术,如量化(降低推论成本),以及低兰克适应(LORA)等降低培训成本的参数高效微调(PEFT)方法(降低培训成本),已获得显著的欢迎。这一趋势促使对量化-了解(QAT)技术进行积极研究,目的是保持模型准确性,同时在推断和培训期间尽量减少记忆管理费。以往的量化-认知(PEFT)方法通常将培训后量化(PTQ)应用于预先培训的LMS,随后是PEFT,以恢复准确性任务损失。与此同时,这一方法在恢复准确性损失方面有局限性。在本文件中,我们提议L4QQQ,一种将量化-软件培训(QAT)与LORA相结合的方法。通过使用记忆-优化的层设计,L4QQQQ大幅降低QAT的记忆管理费,使其培训成本与LARA4可比,同时保留QAT在以高精度的磁度生产精度磁盘磁四级LMM4和高精度的升级方法中,我们在高精度调整了数据中,并实现了将数据调整。


Article 268

Title@2025-07-22 (2): Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Title: Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: 带有内嵌原体的管弦式分布式 AI系统 2403.04311v3

Authors (10): Deepti Raghavan, Keshav Santhanam, Muhammad Shahir Rahman, Nayani Modugula, Luis Gaspar Schroeder, Maximilien Cura, Houjun Liu, Pratiksha Thaker, Philip Levis, Matei Zaharia

Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

复方 AI 应用程序串联, 诸如基因语言模型、 文档检索器和嵌入模型等子组件。 在复合AI系统中应用传统系统优化, 如平行和管道管状, 难度很大, 因为每个组件在颗粒和其摄入的数据类型方面都有不同的限制。 新的数据往往是在中间计算过程中产生的, 文本流可能会被分割成小的、 独立的碎片( 如文档到句子) , 然后可以在以后部分计算中重新分类。 由于这一复杂性, 现有系统为复合AI查询服务, 没有充分利用平行和管道整合机会。 我们提出阿尔托, 这个框架通过流和平行操作自动优化执行复合AI查询。 本托引入了称为嵌巢祖先的新抽象信息, 元数据分级使系统能够正确跟踪复合AI 应用程序各组成部分的多种限制部分产出和汇总数据( 如文档到句子) 。 这一元数据从编程模型中自动推断出, 使开发者能够表达复杂的数据流模式, 不需要人工理解关于路径和汇总的详细信息。 我们介绍阿尔托, 一个框架自动优化执行复合的复合AI AI 10 的四种应用程序, 通过执行, ALformax 或 ALformax trap 10 lap lap lap lap lap 10 lap lap lap


Article 269

Title@2025-07-22 (2): Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

Title: Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition Generative Zeichenbeschreibung Prompts mit multi-positivem Kontrastivem Lernen für die Erkennung von Zeichensprachen 多积极的手语识别多反比学习生成手语识别信号描述提示 2505.02304v2

Authors (7): Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu, Qiguang Miao

Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method’s cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.

由于手语和非手语信号的内在复杂性,手语识别(SLR)在创建准确的注释方面面临根本性挑战,因为手语识别(SLR)具有内在的复杂性。据我们所知,这是将基因化大型语言模型(LLMs)纳入SLR任务的首项工作。我们建议了一种新型的生成信号描述提示多积极的反向学习(GSP-MC)方法,该方法将检索增强的一代(RAG)与具体域域的LLMs(RAG)相连接,包括多步骤的快速工程和经专家验证的手语群体,以生成精确的多部分描述。普惠制-MC方法还使用双向编码结构,通过概率匹配(全球、同声调和部分级别)将等级骨架特征与多个文本描述(全球、同声调和部分级别)双向地结合。我们的方法将全球和部分级别的损失结合起来,优化 KLSL的差别,以确保所有相关的文本-sketon对子的精确匹配,同时捕捉到信号级的语系和详细的部分动态。实验表明中国SLRR500(达到97-LM-LUM-Lis-Lial-S-S-S-Lis-S-S-S-S-s-s-Slentalentalentalentalentalentalviews)的当前方法。


Article 270

Title@2025-07-21 (1): Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education

Title: Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education Menschliche Empathie als Encoder: KI-Assisted Depression Assessment in Special Education 人类的同情作为编码器:大赦国际协助的特殊教育中抑郁症评估 2505.23631v2

Authors (1): Boning Zhao

Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students’ true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers’ empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional “Empathy Vector” (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy

评估特殊教育等敏感环境中的学生抑郁症具有挑战性。标准化问卷可能无法充分反映学生的真实情况。此外,自动化方法往往会随着学生的丰富叙述而动摇,缺乏来自教师与学生的同情性联系的关键和个性化的洞见。现有方法往往无法解决这种模糊性,或有效地融合教育者的理解。为了通过促进人类-AI的协同协作来克服这些限制,本文件介绍了人类同情作为Encoder(HEAE),这是一个以人类为中心的创新的、以人类为中心的AI框架,用于透明和对社会负责的抑郁症严重程度评估。我们的方法将学生叙述文字与教师衍生的、九维的“EVEV”(EV)及其由PHQ-9框架指导的维度,明确将默认的同情性洞见转化为结构化的AI投入,而不是取代人类判断力。严格实验优化了多式融合、文本代表制和分类结构结构,实现了7级严重性分类的82.74%的精确度。这项工作展示了通过结构性嵌入人类同情力,走向更负责任和道德影响性计算的道路。


Article 271

Title@2025-07-21 (1): Pixels, Patterns, but No Poetry: To See The World like Humans

Title: Pixels, Patterns, but No Poetry: To See The World like Humans Pixel, Muster, aber keine Poesie: Die Welt wie Menschen zu sehen 像素、图案、但没有诗歌:像人类一样看世界 2507.16863v1

Authors (14): Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs’ performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

在多式大语言模型中,实现人性化的认知和推理仍然是人工智能中的一个中心挑战。虽然最近的研究主要侧重于加强MLLM的推理能力,但仍然存在一个根本问题:多式大语言模型能否真正将世界视为人性?本文将重点从推理转向人性?我们不是专门为推理而建立基准,而是引入了图灵眼测试(Turing Earth Test),这是一个具有挑战性的面向感知的基准,它包括四项诊断任务,评估MLLMS在人类直觉处理的合成图像上的性能。我们的调查结果显示,最新工艺的MLLLMS在我们的感知性任务上出现了灾难性的失败。我们在这个版本中发布了具有代表性的TET任务分组,并将引入更多样化的任务和方法,以提高未来工作的视觉化。


Article 272

Title@2025-07-21 (1): Efficient Compositional Multi-tasking for On-device Large Language Models

Title: Efficient Compositional Multi-tasking for On-device Large Language Models Effizientes kompositorisches Multi-Tasking für On-Device große Sprachmodelle 内部设计大型语言模型的高效组成多任务 2507.16083v1

Authors (6): Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli

Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

适应器参数为改变机器学习模式的行为提供了一种机制,在大型语言模型(LLMs)和基因化的AI的背景下,这些参数获得了显著的普及。这些参数可以通过一个称为任务合并的过程加以合并,以支持多重任务。然而,以往关于LLMs合并的工作,特别是在自然语言处理方面,仅限于每个测试示例只涉及单一任务的情景。在本文件中,我们侧重于设计设置,并研究基于文本的多任务构成问题,其中每个测试示例都涉及同时执行多种任务。例如,生成长文本的翻译摘要需要同时解决翻译和总和任务。为便利这一背景下的研究,我们提出了一个由四种实际相关的构成任务组成的基准。我们还提出了一种针对计算资源有限的在线应用的有效方法(可实现校准),强调需要找到既节约资源又高效的解决方案。我们的贡献为在现实世界多任务情景中提高LMs的能力奠定了基础,将它们扩大适用于复杂、资源紧缺使用案例的范围。


Article 273

Title@2025-07-21 (1): Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder

Title: Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder Erforschen, wie Generative MLLMs mehr als CLIP mit dem gleichen Vision Encoder wahrnehmen 使用相同的愿景编码器探索如何产生比 CLIP 更远的多见性大型LLMs 2411.05195v3

Authors (3): Siting Li, Pang Wei Koh, Simon Shaolei Du

Recent research has shown that CLIP models struggle with visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. One natural hypothesis is that the CLIP vision encoder does not embed essential information for these tasks. However, we find that this is not always the case: The encoder gathers query-relevant visual information, while CLIP fails to extract it. In particular, we show that another branch of Vision-Language Models (VLMs), Generative Multimodal Large Language Models (MLLMs), achieve significantly higher accuracy than CLIP in many of these tasks using the same vision encoder and weights, indicating that these Generative MLLMs perceive more – as they extract and utilize visual information more effectively. We conduct a series of controlled experiments and reveal that their success is attributed to multiple key design choices, including patch tokens, position embeddings, and prompt-based weighting. On the other hand, enhancing the training data alone or applying a stronger text encoder does not suffice to solve the task, and additional text tokens offer little benefit. Interestingly, we find that fine-grained visual reasoning is not exclusive to generative models trained by an autoregressive loss: When converted into CLIP-like encoders by contrastive finetuning, these MLLMs still outperform CLIP under the same cosine similarity-based evaluation protocol. Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.

最近的研究显示, CLIP 模式与视觉推理任务挣扎,这些任务需要基础成份、理解空间关系或捕捉细细细细节。 一个自然假设是, CLIP 愿景编码器没有为这些任务嵌入基本信息。 然而,我们发现,这并非总是这样: 编码器收集了与查询相关的视觉信息, 而 CLIP 未能提取这些信息。 特别是, 我们显示, 愿景- 语言模型( VLMS) 的另一分支, 生成多式多式语言模型( General MMLMS ) 在许多这些任务中比 CLIP 的精确度要高得多, 使用相同的视觉编码和重量, 表明这些Generaliz MLLMS 能够更有效地提取和利用视觉信息。 我们进行了一系列受控的实验, 并显示其成功归因于多个关键设计选择, 包括修饰符、 位置嵌嵌入和快速加权。 另一方面, 光是加强培训数据或应用更强的文本编码器, 不足以解决这项任务, 而额外的文本缩缩图案则通过直观的CLILMLLMs 的变动模型, 我们发现, 通过精化的缩的缩缩缩的MLVLVLVLVLVLVLVLBS 发现, 我们发现了微的模型发现, 。


Article 274

Title@2025-07-21 (1): The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

Title: The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models Die Aufforderung macht die Person(a): Eine systematische Bewertung der soziodemographischen Persona, die für große Sprachmodelle aufruft 《迅速使人成为人》(a):系统评价社会人口人口人a 《激发大语言模式》 2507.16076v1

Authors (5): Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, Markus Strohmaier

Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

在大型语言模型(LLMs)中,人们越来越多地使用人促进作用模拟各种社会人口群体的观点;然而,如何制定人促进作用可以对结果产生重大影响,使人们对这种模拟的真实性产生担忧;我们利用5个开放源的LMs,系统地研究不同的人迅速战略,特别是角色采纳模式和人口边缘战略,影响15个交叉人口群体在开放和封闭任务方面的LLM模拟LM。我们的研究结果表明,LLMs为模拟边缘化群体,特别是非二元、西班牙裔和中东身份,而选择人口边缘和角色采纳战略会对其形象产生重大影响。具体地说,我们发现,以访谈形式和以名称为基点的拉玛-3.3-70B等较小模型比Llama-3.3-70B等较大的模型更能有助于减少陈规定型观念和改善一致性。我们的研究结果为在LM模拟研究中设计社会人口动态提示提供了可操作的指导。


Article 275

Title@2025-07-21 (1): Deep Researcher with Test-Time Diffusion

Title: Deep Researcher with Test-Time Diffusion Deep Researcher mit Test-Time Diffusion 具有试验时间扩散的深层研究员 2507.16075v1

Authors (18): Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee

Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a “denoising” process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.

由大语言模型(LLMS)推动的深层研究代理机构正在迅速进步;然而,在利用通用测试时间缩放算法生成复杂、长式的调查报告时,其性能往往处于高位。从人类研究的迭接性(涉及搜索、推理和修改的周期)中得到的灵感,我们建议试发扩散深层研究者(TTD-DR)。这个新颖的框架将研究报告的生成概念化为一个传播过程。TTD-DR以初步草案启动这一进程,这是一个可更新的骨架,作为指导研究方向的演变基础。然后,通过“隐蔽”进程对草案进行迭接式的完善,该过程以每个步骤都包含外部信息的检索机制为动态信息。核心过程由于对代理工作流程的每个组成部分应用自我进化算法而得到进一步的增强,确保生成高质量的传播过程。这个以中心设计使报告的撰写过程更加及时和连贯,同时减少迭接搜索过程中的信息损失。我们证明,我们的TD-DR在广泛深度研究中取得了最新的结果,需要大量搜索和多层次研究。


Article 276

Title@2025-07-21 (1): Erasing Conceptual Knowledge from Language Models

Title: Erasing Conceptual Knowledge from Language Models Auslöschen von konzeptionellen Kenntnissen aus Sprachmodellen 将概念知识从语言模式中除去 2410.02760v3

Authors (4): Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

In this work, we introduce Erasure of Language Memory (ELM), a principled approach to concept-level unlearning that operates by matching distributions defined by the model’s own introspective classification capabilities. Our key insight is that effective unlearning should leverage the model’s ability to evaluate its own knowledge, using the language model itself as a classifier to identify and reduce the likelihood of generating content related to undesired concepts. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content while preserving the model’s broader capabilities. We demonstrate ELM’s efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative evaluation reveals that ELM-modified models achieve near-random performance on assessments targeting erased concepts, while simultaneously preserving generation coherence, maintaining benchmark performance on unrelated tasks, and exhibiting strong robustness to adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

在这项工作中,我们引入了语言记忆时代(ELM),这是对概念层面的不学习的一种原则性方法,通过匹配模型自身反省分类能力定义的分布来运行。我们的关键见解是,有效的不学习应当利用模型评估自身知识的能力,利用语言模型本身作为分类者来识别和减少产生与不理想概念相关内容的可能性。语言记忆时代应用了这一框架来创建有针对性的低级别更新,以减少特定概念内容的生成概率,同时保持模型的更广泛能力。我们展示了ELM在生物安保、网络安全和文学领域消除任务方面的效力。比较评价表明,经过修改的模型在针对被删除概念的评估上取得了近乎随机性的业绩,同时保持了生成的一致性,在不相干的任务上保持基准性,并展示了对抗性攻击的强力。我们的代码、数据和经过培训的模式可在https://elm.baulab.inf查阅。


Article 277

Title@2025-07-21 (1): AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering

Title: AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering AutoMeet: eine Proof-of-Concept-Studie von GenAI zur Automatisierung von Meetings in der Automobiltechnik AutoMeet:对genAI进行概念证明研究,以使汽车工程会议自动化 2507.16054v1

Authors (3): Simon Baeuerle, Max Radyschevski, Ulrike Pado

In large organisations, knowledge is mainly shared in meetings, which takes up significant amounts of work time. Additionally, frequent in-person meetings produce inconsistent documentation – official minutes, personal notes, presentations may or may not exist. Shared information therefore becomes hard to retrieve outside of the meeting, necessitating lengthy updates and high-frequency meeting schedules. Generative Artificial Intelligence (genAI) models like Large Language Models (LLMs) exhibit an impressive performance on spoken and written language processing. This motivates a practical usage of genAI for knowledge management in engineering departments: using genAI for transcribing meetings and integrating heterogeneous additional information sources into an easily usable format for ad-hoc searches. We implement an end-to-end pipeline to automate the entire meeting documentation workflow in a proof-of-concept state: meetings are recorded and minutes are created by genAI. These are further made easily searchable through a chatbot interface. The core of our work is to test this genAI-based software tooling in a real-world engineering department and collect extensive survey data on both ethical and technical aspects. Direct feedback from this real-world setup points out both opportunities and risks: a) users agree that the effort for meetings could be significantly reduced with the help of genAI models, b) technical aspects are largely solved already, c) organizational aspects are crucial for a successful ethical usage of such a system.

在大型组织中,知识主要在需要大量工作时间的会议中分享,此外,经常的面对面会议产生不一致的文件 – – 官方会议记录、个人笔记、演示可能存在也可能不存在,因此,共享信息难于在会议之外检索,因此需要长时间更新和高频会议日程安排。大型语言模型(LLMs)等创制人工智能模型在口头和书面语言处理上表现出令人印象深刻的表现。我们工作的核心是实际使用genAI软件在工程部门进行知识管理:利用genAI进行书写会议,将各种其他信息来源纳入易于使用的格式,供临时搜索。我们采用端对端管道,在会议外将整个会议文件工作流程自动化,以验证概念状态:会议记录和会议记录由genAI创建。通过聊天室接口,这些模型更便于搜索。我们工作的核心是在现实世界工程部门中测试这种基于genAI的软件工具,并收集广泛的关于伦理和技术方面的调查数据。从现实世界直接反馈,使整个会议文件流程自动化,大大减少机遇和风险。


Article 278

Title@2025-07-21 (1): Continuously Updating Digital Twins using Large Language Models

Title: Continuously Updating Digital Twins using Large Language Models Kontinuierliche Aktualisierung von digitalen Zwillingen mit großen Sprachmodellen 利用大语言模式不断更新数字双双 2506.12091v2

Authors (3): Harry Amad, Nicolás Astorga, Mihaela van der Schaar

Digital twins are models of real-world systems that can simulate their dynamics in response to potential actions. In complex settings, the state and action variables, and available data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require fixed, well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that can accurately simulate across diverse state-action spaces using in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT’s competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates.

数字双胞胎是真实世界系统的模型,可以模拟其动态以适应潜在行动。在复杂的环境下,状态和行动变量,以及与系统相关的现有数据和知识可以不断改变,要求数字双胞胎不断更新这些变化,以保持相关性。目前的做法在这方面挣扎,因为它们需要固定的、定义明确的建模环境,因此它们不能适应新的变数,而不重新设计,或不经过再培训就纳入新的信息。为了解决这个问题,我们用大型语言模型将数字结对设定为文字内学习问题,从而能够在推论时对双胞胎进行无缝更新。我们开发了CALM-DT,一个基于环境的、基于环境的、基于语言的模型数字双胞胎,它能够精确地模拟不同的州行动空间,单独使用经精细调整的编码器进行文字内学习,用于取样检索。我们从经验上展示了CALM-DT的竞争性表现和现有的数字双方法,以及它适应模拟环境变化而没有参数更新的独特能力。


Article 279

Title@2025-07-21 (1): mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages

Title: mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages mRAKL:多种语文检索增强的低资源语言知识图构建 2507.16011v1

Authors (5): Hellina Hailu Nigatu, Min Li, Maartje ter Hoeve, Saloni Potdar, Sarah Chasins

Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages Arabic and English for cross-lingual transfer. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.

多语言知识图构建(mKGC)指的是在多语种环境中自动构建或预测缺失的实体和知识图链接的任务。在这项工作中,我们将 mKGC 任务重新配置为问答任务,并引入 mRAKL:一个基于检索-启动一代(RAG)的系统来实施 mKGC。我们通过使用主实体和在某个问题中连接关系来实现这一目标,并让我们的模型预测尾实体作为答案。我们的实验主要侧重于两种低资源语言:Tigrinya和Amharic。我们试验使用资源较高的阿拉伯语和英语进行跨语言传输。我们用BM25检索器发现,基于RAG的方法可以提高无文字环境的性能。此外,我们的研究显示,通过一个理想化的检索系统, mRAKL 将蒂格里尼亚和阿姆哈拉奇的准确率分别提高4.92和8.79个百分点。


Article 280

Title@2025-07-21 (1): Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy

Title: Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy Risiken von KI-Wissenschaftlern: Priorisierender Schutz vor Autonomie AI 科学家的风险:将保障自治作为优先事项 2402.04247v5

Authors (13): Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein

AI scientists powered by large language models have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents also introduce novel vulnerabilities that require careful consideration for safety. However, there has been limited comprehensive exploration of these vulnerabilities. This perspective examines vulnerabilities in AI scientists, shedding light on potential risks associated with their misuse, and emphasizing the need for safety measures. We begin by providing an overview of the potential risks inherent to AI scientists, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we explore the underlying causes of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding AI scientists and advocate for the development of improved models, robust benchmarks, and comprehensive regulations.

以大型语言模型为动力的大赦国际科学家在自主进行实验和促进不同学科的科学发现方面表现出了巨大的希望。虽然他们的能力很有希望,但这些代理人也带来了新的弱点,需要仔细考虑安全问题。然而,对这些弱点的全面探索有限。这一视角审视了大赦国际科学家的脆弱性,揭示了与滥用这些弱点有关的潜在风险,并强调了采取安全措施的必要性。我们首先概述了大赦国际科学家所固有的潜在风险,同时考虑到用户的意图、特定科学领域及其对外部环境的潜在影响。然后,我们探讨了这些弱点的根本原因,并对有限的现有工作进行了范围界定审查。根据我们的分析,我们提出了一个三重框架,涉及人类监管、代理人调整和对环境反馈的理解(代理人监管),以减轻这些已查明的风险。此外,我们强调与保护大赦国际科学家有关的限制和挑战,并倡导制定改进的模式、健全的基准和全面监管。


Article 281

Title@2025-07-21 (1): Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback

Title: Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback Helfen Sie mir, eine Geschichte zu schreiben: Bewertung der Fähigkeit von LLMs, Schreiben Feedback zu generieren 帮助我写一个故事:评估LLMS的生成写作反馈的能力 2507.16007v1

Authors (4): Hannah Rashkin, Elizabeth Clark, Fantine Huot, Mirella Lapata

Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects – providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.

LLM公司能否通过提供有意义的书面反馈向创造性作家提供支持?在本文中,我们通过界定新的任务、数据集和评价框架,探讨模式产生的书面反馈的挑战和局限性。为了以有控制的方式研究模型性表现,我们提出了一套1 300个新测试,我们故意腐蚀了1 300个故事,以引入写作问题。我们用自动和人文评价指标来研究这一任务中常用LLM公司的业绩。我们的分析表明,目前的模型在许多方面都有着强烈的外在行为 – – 提供具体和大部分准确的写作反馈。然而,模型往往无法确定故事中最大的写作问题,也无法正确决定何时提供批评和正面反馈。


Article 282

Title@2025-07-21 (1): Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Title: Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme Agent KB: 利用跨域经验解决代理问题 2507.06229v4

Authors (18): Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

Current AI agents cannot effectively learn from each other’s problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement. This hierarchical approach enables agents to break out of limited reasoning pathways by incorporating diverse strategies from external sources. Evaluations on the GAIA benchmark demonstrate substantial performance gains, with Agent KB improving success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, our system significantly improved resolution rates, with o3-mini achieving an 8.67 percentage point gain (23 percent to 31.67 percent) in pass@1. Our ablation studies demonstrate that the refinement module proves most critical, with its removal causing a 3.85% drop on challenging Level 3 tasks, highlighting that effective knowledge transfer necessitates both strategic guidance and execution-level refinement.

目前,大赦国际代理商无法有效地相互学习解决问题的经验,或利用过去的成功经验来指导新任务中的自我反思和错误纠正。我们引入了KB代理商,这是一个共享的知识库,既包含解决问题的高级战略,又包含详细的执行经验,使跨代理商框架的知识转让成为可能。KB代理商实施了一个新型的教师-学生双阶段检索机制,学生代理商检索了用于战略指导的工作流程水平模式,而教师代理商则确定了需要改进的执行层面模式。这种等级化方法使代理商能够通过吸收外部来源的不同战略而打破有限的推理途径。对GAIA基准的评价表明取得了重大的业绩收益,KB代理商将成功率提高了6.06个百分点,总体在通过@1. 对于SWE-Bench代码修理任务,我们的系统大大提高了解决率,O3-mini在通过过程中实现了8.67%(23%至31.67%)的提高率。 我们的对比研究表明,改进模块证明最为关键,导致3.85%的3级任务上出现挑战性下降,突出表明有效的知识转让需要战略指导和执行层面的改进。


Article 283

Title@2025-07-21 (1): Learning without training: The implicit dynamics of in-context learning

Title: Learning without training: The implicit dynamics of in-context learning Lernen ohne Ausbildung: Die implizite Dynamik des In-Context-Lernens 缺乏培训的学习:内通性学习的隐含动态 2507.16003v1

Authors (5): Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo

One of the most striking features of Large Language Models (LLM) is their ability to learn in context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in context and not only during training. Specifically, we show under mild simplifying assumptions how a transformer block implicitly transforms a context into a low-rank weight-update of the MLP layer.

大语言模型(LLM)最突出的特征之一是其上下文学习的能力。 也就是说,在推论时间,LLM能够在不增加重量的情况下学习新模式,而无需增加重量的更新,如果这些模式以实例的形式迅速出现,即使这些模式在培训期间没有被看到。 发生这种情况的机制在很大程度上仍然未知。 在这项工作中,我们表明,将自留层与MLP叠叠在一起,使变压器块能够根据上下文暗含地修改MLP层的重量。 我们通过理论和实验认为,这种简单机制可能是LLMM可以在上下文中而不是在培训中学习的原因。 具体地说,我们用温和简化的假设显示,变压器块如何将环境暗含地转化为低位的MLP层重量。


Article 284

Title@2025-07-21 (1): Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

Title: Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation Hindi NER im niedrigen Kontext verbessern: Eine vergleichende Studie von Transformer-basierten Modellen mit vs. ohne Retrieval Augmentation 在低背景情况下加强印地语净净值:对以变换器为基础的模型的比较研究,与不回收增量的对比 2507.16002v1

Authors (3): Sumit Singh, Rohit Mishra, Uma Shanker Tiwary

One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don’t incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.

自然语言处理中的一项重大挑战是实体识别(NER),它识别并区分了在文字输入中命名的实体。为了改进 NER,本研究调查了一种印地语净化技术,该技术使用印地语预先培训的编码器(MuriL和XLM-R)和起源模型(Llama-2-7B-chat-hf(Llama2-7B)、Llama-2-70B-70B-chat-hf(Llama2-70B)、Llama-3-70B-Instruct(Llam3-770B)和GPT3.5-turbo),并且利用从外部相关背景(特别是维基百科维基百科)检索的数据来补充数据。我们精细调整了MuriLMRM-RA2-7B和RA。然而,Lam2-70B、lama-70B和GPT-RA-RA-R-lorma-T数据运用了更好的数据方法。


Article 285

Title@2025-07-21 (1): Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Title: Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Omni-Router: Routing-Entscheidungen in Sparse Mixture-of-Experts für die Spracherkennung teilen Omni-Router: 分享语音识别专家的松散混集决定 2507.05724v2

Authors (3): Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

专家混合(MoE)结构已经从语言模型发展到自动语音识别(ASR)。传统的MOE方法,如开关变换器,是每个层内独立的路线专家。我们的分析表明,大多数层的路由器作出的专家选择与其他层路由器的选择没有很强的关联。为了增加不同层的专家之间的合作,鼓励更大的专业化,我们使用不同的部层的共用路由器。我们称之为模型Omni-router变换器。关于大型假标签数据集的广泛实验和对10个不同、不同区外的ASR基准的评估表明,Omni-router变换器能够降低培训损失,并持续超出密度和开关变换器模型,分别将平均单字错误率降低11.2%和8.2%,同时提供结构化的专家使用,提高多样性数据的稳健性。


Article 286

Title@2025-07-21 (1): Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

Title: Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track Lehren aus dem TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) Track TREC 生物医学摘要(PLABA)平语言适应(PLABA)轨道的经验教训 2507.14096v2

Authors (6): Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, Dina Demner-Fushman

Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.

目标:语言模型方面的最新进展表明,有可能使专业生物医学文献适应普通语言,使病人和护理者能够使用这种语言;然而,由于不可预测性,加上这一领域伤害的可能性很大,因此有必要进行严格的评价;我们在这方面的目标是促进研究和对最有希望的系统进行高质量的评价;方法:我们在2023年和2024年的文本检索会议上主持了生物医学摘要(PLABA)纯语言适应轨道;任务包括使专业生物医学文献适应普通语言,使患者和护理者能够使用这种语言;以及查明和替换困难术语(Task.2);但是,为了对任务1进行自动评价,我们开发了一套四倍的专业化参考资料;为任务1和2个任务都提出了广泛的手工评价;结果:12个跨12个国家的小组参加了该轨道,模型从多层透视线到受过预先训练的巨型变异体。 在任务1的手工判断中,最优秀的模型与实际语言准确性和完整性相匹配,但并非简单或简洁性。 关于任务1的自动、基于参考的一套衡量标准,一般而言,基准衡量标准在使用成本判断中通常不精确性标准,但用硬性标准进行。


Article 287

Title@2025-07-21 (1): The Impact of Language Mixing on Bilingual LLM Reasoning

Title: The Impact of Language Mixing on Bilingual LLM Reasoning Die Auswirkungen des Sprachmixens auf die zweisprachige LLM-Reasoning 语言混合对双语LLM理由解释的影响 2507.15849v1

Authors (5): Yihao Li, Jiayi Xin, Miranda Muqing Miao, Qi Long, Lyle Ungar

Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing–alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We demonstrate that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on math reasoning tasks. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by up to 6.25 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

同样,最近以推理为重点的双语大型语言模式(LLMs)在两种语言中都具有很强的能力,在他们的思想链中展示了语言混合和交错的语言。在DeepSeek-R1中劝阻这种行为会降低准确性,表明语言混合可能有利于推理。在这项工作中,我们研究中文-英语双语推理模型的语言转换。我们确定,以可核查的奖励强化学习(RLVR)是导致语言混合的关键培训阶段。我们证明,语言混合可以强化推理:在数学推理任务中实施单语解码可以降低5.6个百分点的准确性。此外,可以培训一个轻量质的探测器,预测潜在语言转换是否有益或有害推理,在用于引导解码时,将精度提高到6.25个百分点。我们的研究结果表明,语言混合不仅仅是多语种培训的副产品,而是一种战略性推理行为。


Article 288

Title@2025-07-21 (1): A Survey of Context Engineering for Large Language Models

Title: A Survey of Context Engineering for Large Language Models Eine Übersicht über Kontext-Engineering für große Sprachmodelle 大语言模型背景工程调查 2507.13334v2

Authors (15): Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu

The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

大语言模型(LLMS)的性能基本取决于在推论期间提供的背景资料。本调查介绍了背景工程,这是一个正式的学科,它超越了简单的迅速设计,包括系统地优化LLMS的信息有效载荷。我们提出了一个综合分类学,将背景工程纳入其基本组成部分,以及将其纳入智能系统的复杂实施方法。我们首先研究基本组成部分:背景检索和生成、背景处理和背景管理。然后我们探讨这些组成部分如何在建筑上融合,以创造复杂的系统实施:检索-启动的一代(RAG)、记忆系统和工具集成推理以及多试剂系统。通过对1400多份研究论文的系统分析,我们的调查不仅为实地确定了技术路线图,而且还揭示了关键的研究差距:模型能力之间存在着根本的不对称性。在先进的背景工程的辅助下,现有模型在理解复杂背景方面表现出显著的熟练性,但在产生同样复杂、长期的产出方面显示出明显的局限性。解决这一差距是未来研究的一个确定优先事项。最后,这项调查为推进背景认识的研究人员和工程师提供了一个统一的框架。


Article 289

Title@2025-07-21 (1): Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work

Title: Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work Operationalisierung von KI für das Gute: Fokussierung auf Einsatz und Integration von KI-Modellen in humanitäre Arbeit 实施大赦国际促进良好:在人道主义工作中采用和整合大赦国际模式的焦点 2507.15823v1

Authors (6): Anton Abilov, Ke Zhang, Hemank Lamba, Elizabeth M. Olson, Joel R. Tetreault, Alejandro Jaimes

Publications in the AI for Good space have tended to focus on the research and model development that can support high-impact applications. However, very few AI for Good papers discuss the process of deploying and collaborating with the partner organization, and the resulting real-world impact. In this work, we share details about the close collaboration with a humanitarian-to-humanitarian (H2H) organization and how to not only deploy the AI model in a resource-constrained environment, but also how to maintain it for continuous performance updates, and share key takeaways for practitioners.

AI促进良好空间协会的出版物往往侧重于能够支持高影响应用的研究和模型开发,然而,只有很少的AI促进好文件协会讨论与伙伴组织进行部署和合作的过程以及由此产生的现实世界影响,在这项工作中,我们分享了与人道主义至人道主义(H2H)组织密切合作的详细情况,以及如何不仅在资源紧张的环境中部署AI模式,而且如何保持该模式以不断更新业绩,并为从业人员分享关键外卖。


Article 290

Title@2025-07-21 (1): Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning

Title: Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning Kleine LLMs lernen keine verallgemeinerbare Theorie des Geistes durch Verstärkungslernen 小型LLMs Do Loms Don not Learn a Global For Syor of Mind Syory 通过加强学习学习学习不学习普通心理理论的小型LLMs 2507.15788v1

Authors (2): Sneheel Sarangi, Hanan Salam

Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking’’ the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.

大型语言模型(LLMS)的最近进展显示,大型语言模型(LLMS)在复杂的推理中表现出快速能力,这在很大程度上是由培训后应用的基于规则的加强学习(RL)技术(RLL)的刺激推动的,这就提出了类似方法能否在LLMS中注入更细微、人性化的社会智慧,如精神理论(TOM)等类似社会智慧的问题。本文调查了小型LMS能否通过具有可核查的奖励(RLVR)获得强大和普遍适用的托M能力。我们通过培训模型进行系统评价,将著名的托M数据集(HiToM,ExplorteTOM,FANTOM)的各种组合培训模型和对待发数据集(例如OpenToM)的全面化测试结合起来。我们的研究结果表明,小型LMS公司在开发通用的TOM能力方面挣扎。虽然在分配任务上的表现有所改进,但这种能力不能转移到具有不同特点的不可见的托M任务。我们证明,长期的训练导致“勾画”培训数据集的统计模式,在培训中产生显著的改变,但是在业绩的变变变化方面,而不是在真正的变造中表现出真正的变的精确的成绩上的行为。


Article 291

Title@2025-07-21 (1): DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Title: DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs DaMO: Ein dateneffizienter Multimodal-Orchester für zeitliche Vernunft mit Video-LLMs DaMO: 带有视频LMS的时空理由数据高效多式多式圆板 2506.11558v3

Authors (4): Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

最近,大型语言模型(LLMS)已扩展到视频领域,使复杂的视频理解成为可能,然而,现有的视频LLMS往往在细微的时间推理方面表现出局限性,限制了它们精确地将反应归因于特定视频时刻的能力,特别是在受限制的监督下。我们引入了数据效率高的视频LLM(DaMO),这是一个数据效率高的视频LLM(LLMM),其设计明确是为了准确的时间推理和多式联运理解。就其核心而言,拟议的Temal-aware Fuseforth采用一个等级的双流结构,逐步捕捉每种模式内的时间动态,并有效地结合了补充的视觉和音频信息。为了进一步提高计算效率,DaMO整合了一种全球剩余,在保存基本语义细节的同时减少了空间冗余。我们通过一个结构化的四阶段渐进培训模式对DAMO进行培训,逐步为该模式配备了多式联运、语义定位和时间推理能力。这项工作还增加了多种数据集,而现有的数据由LMM生成的以时间为基础的QA配对需要时间监督的任务。关于时间定位和视频QA基准的全面实验表明DMO一贯超过以前的模型,特别是在要求精确时间调整和逻辑上。


Article 292

Title@2025-07-21 (1): Reservoir Computing as a Language Model

Title: Reservoir Computing as a Language Model Reservoir Computing als Sprachmodell 作为语言模式的 “ 储量计算 “ 模式 2507.15779v1

Authors (2): Felix Köster, Atsushi Uchida

Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known transformer-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.

大型语言模型(LLM)在科学和媒体景观中占据了主导地位,在处理大量数据和制作人文文本方面的业绩令人印象深刻。然而,其巨大的能源需求和缓慢的处理仍然是进一步提高质量的瓶颈,同时也使每个人都可以使用模型。为了解决这一瓶颈,我们将调查储油层计算如何在自然文本处理中发挥作用,这可以快速和节能地实施硬件。研究储油层计算作为一种语言模型的使用仍然很少。在本文件中,我们比较了三个不同的性格语言模型使用方法:两种不同的储油层计算方法,即只有产出层可以培训的两种不同的储油层计算方法,以及众所周知的基于变压器的结构,这些结构充分学习基于关注的顺序代表。我们探索这两种模式的性能、计算成本和预测准确性,通过对所有模型的可培训参数数量进行同样的差异。我们用所有三种方法一致的管道证明变压器在预测质量方面是优秀的,而储油层计算机仍然非常高效地减少培训和推导速度。此外,我们调查了两种储油层计算方法:一种传统的储油层储油层储层储油层和固定的直线式结构,我们通过感重的阅读了它们如何调整了它们。


Article 293

Title@2025-07-21 (1): Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Title: Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR Stabilisierung von Wissen, Förderung von Vernunft: Dual-Token-Beschränkungen für RLVR 稳定知识,促进合理合理性:对风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险、风险和风险的双重制约 2507.15778v1

Authors (5): Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs), mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.

以可核查的奖励加强学习(RLVR)已成为提高大语言模型推理能力的有效培训后方法,主要是通过形成高阶行为,例如反射和规划,提高大语言模型的推理能力。然而,以前的RLVR算法经常对所有象征性品应用统一的培训信号,而没有考虑到与知识相关的低渗透性知识象征物和高渗透性推理象征物的不同作用。最近的一些方法试图通过梯度遮盖或无节制更新来区分这些象征性类型,但这些方法可能会打破模型输出中的语义依赖性,妨碍有效学习。在这项工作中,我们提议Archer,即使用双向限制和同步更新的英特普-瓦雷RLVR方法。具体地说,我们的方法采用较弱的KLL规范化和更高的剪报阈值来推理符号,同时对知识象征物使用更严格的限制来维持事实知识。若干数学推理和代码生成基准的实验结果显示,我们的方法大大超越了先前的RLVR方法,达到或超过州-Rub-III的状态。


Article 294

Title@2025-07-21 (1): Dissociating model architectures from inference computations

Title: Dissociating model architectures from inference computations Trennen von Modellarchitekturen von Inferenzberechnungen 将模型结构与推断计算分离 2507.15776v1

Authors (2): Noor Sajid, Johan Medrano

Parr et al., 2025 examines how auto-regressive and deep temporal models differ in their treatment of non-Markovian sequence modelling. Building on this, we highlight the need for dissociating model architectures, i.e., how the predictive distribution factorises, from the computations invoked at inference. We demonstrate that deep temporal computations are mimicked by autoregressive models by structuring context access during iterative inference. Using a transformer trained on next-token prediction, we show that inducing hierarchical temporal factorisation during iterative inference maintains predictive capacity while instantiating fewer computations. This emphasises that processes for constructing and refining predictions are not necessarily bound to their underlying model architectures.

Parr等人, 2025年, 2025年, 审视了在如何对待非马尔科维安序列建模方面,自动递减和深时位模型如何不同。 在此基础上, 我们强调需要分离模型结构, 即预测分布因子, 如何从推论引用的计算中分离出来。 我们证明, 深时位计算被自动递减模型模拟模拟, 在迭代推论期间, 通过迭代推论期间对背景存取进行结构化。 我们使用受过次位预测培训的变压器, 显示迭代推论期间诱导的等级时间因子化保持了预测能力, 而即同步计算数量较少。 这强调, 构建和提炼预测的过程不一定与其基本模型结构相连接。


Article 295

Title@2025-07-21 (1): KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education?

Title: KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education? KnowShiftQA: Wie robust sind RAG-Systeme, wenn Textbook Knowledge Shifts in K-12 Education? K-12教育中教科书知识转移时RAG系统如何强大? 2412.08985v4

Authors (5): Tianshi Zheng, Weihan Li, Jiaxin Bai, Weiqi Wang, Yangqiu Song

Retrieval-Augmented Generation (RAG) systems show remarkable potential as question answering tools in the K-12 Education domain, where knowledge is typically queried within the restricted scope of authoritative textbooks. However, discrepancies between these textbooks and the parametric knowledge inherent in Large Language Models (LLMs) can undermine the effectiveness of RAG systems. To systematically investigate RAG system robustness against such knowledge discrepancies, we introduce KnowShiftQA. This novel question answering dataset simulates these discrepancies by applying deliberate hypothetical knowledge updates to both answers and source documents, reflecting how textbook knowledge can shift. KnowShiftQA comprises 3,005 questions across five subjects, designed with a comprehensive question typology focusing on context utilization and knowledge integration. Our extensive experiments on retrieval and question answering performance reveal that most RAG systems suffer a substantial performance drop when faced with these knowledge discrepancies. Furthermore, questions requiring the integration of contextual (textbook) knowledge with parametric (LLM) knowledge pose a significant challenge to current LLMs.

在K-12教育领域,知识通常在权威教科书的有限范围内被问及,但这些教科书与大语言模型(LLMs)所固有的参数知识之间的差异会损害RAG系统的效力。为了系统调查RAG系统是否稳健应对这种知识差异,我们引入了KnowshiftQA。这个新颖的回答问题模拟了这些差异,对答案和源文件都应用了有意的假设知识更新,反映了教科书知识的转移。知识ShiftQA包含五个科目的3 005个问题,其设计是全面的问题类型,侧重于背景利用和知识整合。我们在检索和回答问题方面的广泛实验显示,在面对这些知识差异时,大多数RAG系统在业绩上都有很大的下降。此外,需要将背景(教科书)知识与参数知识结合起来的问题对目前的LLMs提出了重大挑战。


Article 296

Title@2025-07-21 (1): Interaction as Intelligence: Deep Research With Human-AI Partnership

Title: Interaction as Intelligence: Deep Research With Human-AI Partnership Interaktion als Intelligenz: Tiefe Forschung mit Mensch-KI-Partnerschaft 作为情报的互动:与人类 – – AI伙伴关系的深入研究 2507.15759v1

Authors (26): Lyumanshan Ye, Xiaojie Cai, Xinkai Wang, Junfei Wang, Xiangkun Hu, Jiadi Su, Yang Nan, Sihan Wang, Bohan Zhang, Xiaoze Fan, Jinbin Luo, Yuxiang Zheng, Tianze Xu, Dayuan Fu, Yunze Wu, Pengrui Lu, Zengzhi Wang, Yiwei Qin, Zhen Huang, Yan Ma, Zhulin Hu, Haoyang Zou, Tiantian Mi, Yixin Ye, Ethan Chern, Pengfei Liu

This paper introduces “Interaction as Intelligence” research series, presenting a reconceptualization of human-AI relationships in deep research tasks. Traditional approaches treat interaction merely as an interface for accessing AI capabilities-a conduit between human intent and machine output. We propose that interaction itself constitutes a fundamental dimension of intelligence. As AI systems engage in extended thinking processes for research tasks, meaningful interaction transitions from an optional enhancement to an essential component of effective intelligence. Current deep research systems adopt an “input-wait-output” paradigm where users initiate queries and receive results after black-box processing. This approach leads to error cascade effects, inflexible research boundaries that prevent question refinement during investigation, and missed opportunities for expertise integration. To address these limitations, we introduce Deep Cognition, a system that transforms the human role from giving instructions to cognitive oversight-a mode of engagement where humans guide AI thinking processes through strategic intervention at critical junctures. Deep cognition implements three key innovations: (1)Transparent, controllable, and interruptible interaction that reveals AI reasoning and enables intervention at any point; (2)Fine-grained bidirectional dialogue; and (3)Shared cognitive context where the system observes and adapts to user behaviors without explicit instruction. User evaluation demonstrates that this cognitive oversight paradigm outperforms the strongest baseline across six key metrics: Transparency(+20.0%), Fine-Grained Interaction(+29.2%), Real-Time Intervention(+18.5%), Ease of Collaboration(+27.7%), Results-Worth-Effort(+8.8%), and Interruptibility(+20.7%). Evaluations on challenging research problems show 31.8% to 50.0% points of improvements over deep research systems.

本文介绍“ 互动作为情报” 研究系列, 在深层研究任务中将人类- AI 关系重新概念化。 传统方法将互动仅仅作为获取 AI 能力- 人类意图和机器输出之间管道的界面。 我们提议, 互动本身是智能的一个基本层面。 当AI 系统为研究任务进行扩展思维过程时, 有意义的互动从可选的增强到有效情报的基本组成部分。 当前深层研究系统采用“ 投入-等待-输出20” 模式, 用户在黑盒处理后启动查询和接收结果。 这种方法导致错误连锁效应、 无法在调查期间防止问题改进的僵硬性研究界限以及错过专业知识整合机会。 为了应对这些限制, 我们引入深 Comp Comption, 这个系统将人的角色从向认知性监督- 一种接触模式转变, 在关键关口中, 人类通过战略干预指导AI 思考过程。 深层的认知实施三大创新:(1) 透明、可控制、可中断的相互作用,在任何地点显示AI 推理学和干预; (2) 精确的双向8. 8 进行精确的精确性调查, 在六种用户基线对话中, 显示 进行 的自我评估:


Article 297

Title@2025-07-21 (1): LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Title: LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization LAPO: Internalisierung der Effizienz durch Längen-Anpassungspolitik-Optimierung LAPO:通过延长期限政策优化实现内部合理性效率 2507.15758v1

Authors (10): Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang

Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model’s reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

大型推理模型通过扩大的思维链序列取得了显著的绩效,然而,这种计算自由导致过度的象征性生成,甚至对于简单的问题也是如此。我们展示了长成政策优化(LAPO),这是一个将推理长度控制从外部制约转变为内在模型能力的新框架。 与目前实行僵硬限制或依赖休克后干预的方法不同,LAPO使模型能够通过一个两阶段强化学习过程使对适当推理深度的理解内在化。 在第一阶段,模型通过发现成功解决方案长度的统计分布来学习自然推理模式。 第二阶段利用这些模式作为元分化指导,直接将其嵌入模型的推理环境,以确保推论灵活性。 数学推理基准实验表明,LAPO在提高准确度的同时,将代用量减少40.9。 我们的分析表明,与LAPO培训的模型发展了根据问题复杂性分配计算资源的新兴能力,在不牺牲质量的情况下实现高效推理。


Article 298

Title@2025-07-21 (1): DialogueForge: LLM Simulation of Human-Chatbot Dialogue

Title: DialogueForge: LLM Simulation of Human-Chatbot Dialogue DialogueForge: LLM Simulation des Mensch-Chatbot-Dialogs “对话论坛:模拟人类与哈特波特对话的LLMLM 2507.15752v1

Authors (7): Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Malgorzata Lazuka, Elliott Ash

Collecting human-chatbot dialogues typically demands substantial manual effort and is time-consuming, which limits and poses challenges for research on conversational AI. In this work, we propose DialogueForge - a framework for generating AI-simulated conversations in human-chatbot style. To initialize each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions. We test a variety of LLMs to simulate the human chatbot user, ranging from state-of-the-art proprietary models to small-scale open-source LLMs, and generate multi-turn dialogues tailored to specific tasks. In addition, we explore fine-tuning techniques to enhance the ability of smaller models to produce indistinguishable human-like dialogues. We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols. Our experiments show that large proprietary models (e.g., GPT-4o) generally outperform others in generating more realistic dialogues, while smaller open-source models (e.g., Llama, Mistral) offer promising performance with greater customization. We demonstrate that the performance of smaller models can be significantly improved by employing supervised fine-tuning techniques. Nevertheless, maintaining coherent and natural long-form human-like dialogues remains a common challenge across all models.

收集人类聊天室对话通常需要大量手工劳动,而且耗时费时,这限制了对对话AI的研究,也给这方面的研究带来了挑战。在这项工作中,我们提议“对话论坛”——一个以人聊天室风格生成AI模拟对话的框架。为启动每次生成的对话,“对话论坛”使用从真正的人类聊天室互动中提取的种子提示。我们测试各种LLMS模拟人类聊天室用户,从最先进的专利模型到小型开放源LMS,并产生适合具体任务的多方向对话。此外,我们探索微调技术,以提高小型模型生成不可分化的人类对话的能力。我们评估模拟对话的质量,并用UniEval和GTEval评估协议比较不同的模型。我们的实验表明,大型专利模型(例如GPT-4o)一般比其他模型更接近产生更现实的对话,而小型的开放源模型(例如Llama、Mistral)则提供更具有前景的模拟,同时采用更稳定的自然定制技术。我们用更强的自然定制的方法来改进人类对话。


Article 299

Title@2025-07-21 (1): Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models

Title: Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models Steuerung in neue Einbettungsräume: Analyse der Cross-Lingual Alignment Induziert durch Modellinterventionen in mehrsprachigen Sprachmodellen 指导进入新嵌入空间:分析多语文模式示范干预措施所引出的不同语言之间的横向一致 2502.15639v2

Authors (6): Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina

Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions – a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM’s activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.

在多种语言的大语言模型(MLLMs)中,不同语言的一致表述是一种理想的属性,因为调整可以改善跨语言任务的业绩。一般情况下,调整需要微调一个计算成本昂贵的模型,并需要大量的语言数据,而这种模型往往可能无法提供。数据效率高的替代微调是模型干预 – – 一种操纵模型激活的方法,以引导生成达到理想的方向。我们分析了大众干预(调查专家)对跨语言代表在 mLLMs中的一致性的影响。我们确定了为某种特定语言而操作的神经元,并对 mLLMs 预先和事后的嵌入空间进行反省。我们表明,修改MLLMM的激活改变了其嵌入空间,从而强化了跨语言的对接。此外,我们表明,对嵌入空间的改变转化为在检索任务的下游业绩的改进,在跨语言检索上,最高至最高至上一级精确度提高2倍。


Article 300

Title@2025-07-21 (1): Where Do People Tell Stories Online? Story Detection Across Online Communities

Title: Where Do People Tell Stories Online? Story Detection Across Online Communities Wo erzählen Menschen Geschichten online? Story Detection Across Online Communities 《人们在哪里在网上讲述故事? 2311.09675v4

Authors (5): Maria Antoniak, Joel Mire, Maarten Sap, Elliott Ash, Andrew Piper

Story detection in online communities is a challenging task as stories are scattered across communities and interwoven with non-storytelling spans within a single text. We address this challenge by building and releasing the StorySeeker toolkit, including a richly annotated dataset of 502 Reddit posts and comments, a detailed codebook adapted to the social media context, and models to predict storytelling at the document and span levels. Our dataset is sampled from hundreds of popular English-language Reddit communities ranging across 33 topic categories, and it contains fine-grained expert annotations, including binary story labels, story spans, and event spans. We evaluate a range of detection methods using our data, and we identify the distinctive textual features of online storytelling, focusing on storytelling spans. We illuminate distributional characteristics of storytelling on a large community-centric social media platform, and we also conduct a case study on r/ChangeMyView, where storytelling is used as one of many persuasive strategies, illustrating that our data and models can be used for both inter- and intra-community research. Finally, we discuss implications of our tools and analyses for narratology and the study of online communities.

nan


Article 301

Title@2025-07-21 (1): Towards physician-centered oversight of conversational diagnostic AI

Title: Towards physician-centered oversight of conversational diagnostic AI Auf dem Weg zur ärztlichen Aufsicht über gesprächsdiagnostische KI 致力于以医生为中心对谈话诊断进行监督 AI 2507.15743v1

Authors (35): Elahe Vedadi, David Barrett, Natalie Harris, Ellery Wulczyn, Shashir Reddy, Roma Ruparel, Mike Schaekermann, Tim Strother, Ryutaro Tanno, Yash Sharma, Jihyeon Lee, Cían Hughes, Dylan Slack, Anil Palepu, Jan Freyberg, Khaled Saab, Valentin Liévin, Wei-Hung Weng, Tao Tu, Yun Liu, Nenad Tomasev, Kavita Kulkarni, S. Sara Mahdavi, Kelvin Guu, Joëlle Barral, Dale R. Webster, James Manyika, Avinatan Hassidim, Katherine Chou, Yossi Matias, Pushmeet Kohli, Adam Rodman, Vivek Natarajan, Alan Karthikesalingam, David Stutz

Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians’ capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.

nan


Article 302

Title@2025-07-21 (1): A Fisher’s exact test justification of the TF-IDF term-weighting scheme

Title: A Fisher’s exact test justification of the TF-IDF term-weighting scheme Genaue Begründung des TF-IDF-Term-Wichtungssystems durch einen Fisher A Fisher公司对TF-IDF术语加权办法的精确测试理由 2507.15742v1

Authors (3): Paul Sheridan, Zeyad Ahmed, Aitazaz A. Farooque

Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness.

nan


Article 303

Title@2025-07-21 (1): Understanding Large Language Models’ Ability on Interdisciplinary Research

Title: Understanding Large Language Models’ Ability on Interdisciplinary Research Verständnis der Fähigkeit von großen Sprachmodellen zur interdisziplinären Forschung 了解关于跨学科研究的大型语言模型能力 2507.15736v1

Authors (6): Yuanhao Shen, Daniel Xavier de Sousa, Ricardo Marçal, Ali Asad, Hongyu Guo, Xiaodan Zhu

Recent advancements in Large Language Models (LLMs) have revealed their impressive ability to perform multi-step, logic-driven reasoning across complex domains, positioning them as powerful tools and collaborators in scientific discovery while challenging the long-held view that inspiration-driven ideation is uniquely human. However, the lack of a dedicated benchmark that evaluates LLMs’ ability to develop ideas in Interdisciplinary Research (IDR) settings poses a critical barrier to fully understanding their strengths and limitations. To address this gap, we introduce IDRBench – a pioneering benchmark featuring an expert annotated dataset and a suite of tasks tailored to evaluate LLMs’ capabilities in proposing valuable research ideas from different scientific domains for interdisciplinary research. This benchmark aims to provide a systematic framework for assessing LLM performance in complex, cross-domain scientific research. Our dataset consists of scientific publications sourced from the ArXiv platform covering six distinct disciplines, and is annotated by domain experts with diverse academic backgrounds. To ensure high-quality annotations, we emphasize clearly defined dimensions that characterize authentic interdisciplinary research. The design of evaluation tasks in IDRBench follows a progressive, real-world perspective, reflecting the natural stages of interdisciplinary research development, including 1) IDR Paper Identification, 2) IDR Idea Integration, and 3) IDR Idea Recommendation. Using IDRBench, we construct baselines across 10 LLMs and observe that despite fostering some level of IDR awareness, LLMs still struggle to produce quality IDR ideas. These findings could not only spark new research directions, but also help to develop next-generation LLMs that excel in interdisciplinary research.

nan


Article 304

Title@2025-07-21 (1): BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

Title: BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning Benchmarking LLMs für Ophthalmologie (BELO) für ophthalmologisches Wissen und Vernunft 眼生理知识和理性的眼生理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学和理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 理学(BELO) 2507.15717v1

Authors (32): Sahana Srinivasan, Xuguang Ai, Thaddaeus Wai Soon Lo, Aidan Gilson, Minjie Zou, Ke Zou, Hyunjae Kim, Mingjia Yang, Krithi Pushpanathan, Samantha Yew, Wan Ting Loke, Jocelyn Goh, Yibing Chen, Yiming Kong, Emily Yuelei Fu, Michelle Ongyong Hui, Kristen Nwanyanwu, Amisha Dave, Kelvin Zhenghao Li, Chen-Hsin Sun, Mark Chia, Gabriel Dawei Yang, Wendy Meihua Wong, David Ziyou Chen, Dianbo Liu, Maxwell Singer, Fares Antaki, Lucian V Del Priore, Jost Jonas, Ron Adelman, Qingyu Chen, Yih-Chung Tham

Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ’s correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO’s utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.

nan


Article 305

Title@2025-07-21 (1): From Queries to Criteria: Understanding How Astronomers Evaluate LLMs

Title: From Queries to Criteria: Understanding How Astronomers Evaluate LLMs Von Fragen zu Kriterien: Wie Astronomen LLMs bewerten 从询问到标准:了解天文学家如何评价LLMs 2507.15715v1

Authors (10): Alina Hyk, Kiera McCormick, Mian Zhong, Ioana Ciucă, Sanjib Sharma, John F Wu, J. E. G. Peek, Kartheik G. Iyer, Ziang Xiao, Anjalie Field

There is growing interest in leveraging LLMs to aid in astronomy and other scientific research, but benchmarks for LLM evaluation in general have not kept pace with the increasingly diverse ways that real people evaluate and use these models. In this study, we seek to improve evaluation procedures by building an understanding of how users evaluate LLMs. We focus on a particular use case: an LLM-powered retrieval-augmented generation bot for engaging with astronomical literature, which we deployed via Slack. Our inductive coding of 368 queries to the bot over four weeks and our follow-up interviews with 11 astronomers reveal how humans evaluated this system, including the types of questions asked and the criteria for judging responses. We synthesize our findings into concrete recommendations for building better benchmarks, which we then employ in constructing a sample benchmark for evaluating LLMs for astronomy. Overall, our work offers ways to improve LLM evaluation and ultimately usability, particularly for use in scientific research.

nan


Article 306

Title@2025-07-21 (1): Chinchunmei at SemEval-2025 Task 11: Boosting the Large Language Model’s Capability of Emotion Perception using Contrastive Learning

Title: Chinchunmei at SemEval-2025 Task 11: Boosting the Large Language Model’s Capability of Emotion Perception using Contrastive Learning Chinchunmei bei SemEval-2025 Aufgabe 11: Erhöht die Fähigkeit des großen Sprachmodells zur Wahrnehmung von Emotionen durch kontrastives Lernen Chinchunmei在SemEval-2025任务11:利用差异学习促进大语言模式情感感知能力 2507.15714v1

Authors (3): Tian Li, Yujian Sun, Huizhi Liang

The SemEval-2025 Task 11, Bridging the Gap in Text-Based Emotion Detection, introduces an emotion recognition challenge spanning over 28 languages. This competition encourages researchers to explore more advanced approaches to address the challenges posed by the diversity of emotional expressions and background variations. It features two tracks: multi-label classification (Track A) and emotion intensity prediction (Track B), covering six emotion categories: anger, fear, joy, sadness, surprise, and disgust. In our work, we systematically explore the benefits of two contrastive learning approaches: sample-based (Contrastive Reasoning Calibration) and generation-based (DPO, SimPO) contrastive learning. The sample-based contrastive approach trains the model by comparing two samples to generate more reliable predictions. The generation-based contrastive approach trains the model to differentiate between correct and incorrect generations, refining its prediction. All models are fine-tuned from LLaMa3-Instruct-8B. Our system achieves 9th place in Track A and 6th place in Track B for English, while ranking among the top-tier performing systems for other languages.

nan


Article 307

Title@2025-07-21 (1): Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents

Title: Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents Groß gewinnen mit kleinen Modellen: Wissensdestillation vs. Selbsttraining zur Reduktion der Halluzination in Produkt-QA-Agenten 以小型模型赢得大奖:知识蒸馏与减少产品质量保证剂中幻觉的自我培训 2502.19545v2

Authors (6): Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang

The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination (generating false information) and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models’ outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized “I don’t know” responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.

nan


Article 308

Title@2025-07-21 (1): Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Title: Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked? Wird die Leistungsfähigkeit eines großen Sprachmodells bei mit Gründen versehenen Aufgaben durch verschiedene Wege beeinflusst Fragen werden gestellt? 问到不同方式的问题是否影响到大语言解释任务示范业绩? 2507.15707v1

Authors (4): Seok Hwan Song, Mohna Chakraborty, Qi Li, Wallapak Tavanapong

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

nan


Article 309

Title@2025-07-21 (1): Compositional Understanding in Signaling Games

Title: Compositional Understanding in Signaling Games Kompositionales Verständnis bei Signalspielen 信号运动会的组成理解 2507.15706v1

Authors (1): David Peter Wallis Freeborn

Receivers in standard signaling game models struggle with learning compositional information. Even when the signalers send compositional messages, the receivers do not interpret them compositionally. When information from one message component is lost or forgotten, the information from other components is also erased. In this paper I construct signaling game models in which genuine compositional understanding evolves. I present two new models: a minimalist receiver who only learns from the atomic messages of a signal, and a generalist receiver who learns from all of the available information. These models are in many ways simpler than previous alternatives, and allow the receivers to learn from the atomic components of messages.

nan


Article 310

Title@2025-07-21 (1): CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models

Title: CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models CoLD: Counterfactually-Führungslängen-Debiasing für Prozess-Reward-Modelle CoLD: 反事实引导进程奖励模型的长度偏差 2507.15698v1

Authors (7): Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Yong Yu, Weinan Zhang, Mengyue Yang

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.

nan


Article 311

Title@2025-07-21 (1): Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

Title: Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language Verbesserung der natürlichen Sprachinferenzleistung mit Wissensdiagramm für COVID-19 Automatisiertes Fact-Checking in indonesischer Sprache 以印度尼西亚语自动进行事实调查的COVID-19 自动调查印度尼西亚语知识图,提高自然语言引文性能 2409.00061v2

Authors (2): Arief Purnama Muharram, Ayu Purwarianti

Automated fact-checking is a key strategy to overcome the spread of COVID-19 misinformation on the internet. These systems typically leverage deep learning approaches through Natural Language Inference (NLI) to verify the truthfulness of information based on supporting evidence. However, one challenge that arises in deep learning is performance stagnation due to a lack of knowledge during training. This study proposes using a Knowledge Graph (KG) as external knowledge to enhance NLI performance for automated COVID-19 fact-checking in the Indonesian language. The proposed model architecture comprises three modules: a fact module, an NLI module, and a classifier module. The fact module processes information from the KG, while the NLI module handles semantic relationships between the given premise and hypothesis. The representation vectors from both modules are concatenated and fed into the classifier module to produce the final result. The model was trained using the generated Indonesian COVID-19 fact-checking dataset and the COVID-19 KG Bahasa Indonesia. Our study demonstrates that incorporating KGs can significantly improve NLI performance in fact-checking, achieving the best accuracy of 0.8616. This suggests that KGs are a valuable component for enhancing NLI performance in automated fact-checking.

nan


Article 312

Title@2025-07-21 (1): Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Title: Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems Ausführbare Funktionsabstractions: Ausleiten von Generativen Programmen für fortgeschrittene Math-Probleme 可执行的功能性抽象:为高级数学问题推导产生方案 2504.09763v2

Authors (5): Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal

Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from reinforcement learning (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for mathematical reasoning as problem generators for stress-testing models. However, prior work has been limited to automatically constructing abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced mathematics problems by developing EFAGen, which operationalizes the task of automatically inferring an EFA for a given seed problem and solution as a program synthesis task. We first formalize the properties of any valid EFA as executable unit tests. Using execution feedback from the unit tests, we search over candidate programs sampled from a LLM to find EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. We then apply the tests as a reward signal, training LLMs to become better writers of EFAs. We show that EFAs inferred by EFAGen are faithful to the seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across diverse sources of competition-level math problems. Finally, we show uses of model-written EFAs e.g., finding harder/easier problem variants, as well as data generation.

nan


Article 313

Title@2025-07-21 (1): P3: Prompts Promote Prompting

Title: P3: Prompts Promote Prompting P3: Prompts fördern Prompting P3: 推动推动推动 2507.15675v1

Authors (4): Xinyu Zhang, Yuanquan Hu, Fangchao Liu, Zhicheng Dou

Current large language model (LLM) applications often employ multi-component prompts, comprising both system and user prompts, to guide model behaviors. While recent advancements have demonstrated the efficacy of automatically optimizing either the system or user prompt to boost performance, such unilateral approaches often yield suboptimal outcomes due to the interdependent nature of these components. In this work, we introduce P3, a novel self-improvement framework that concurrently optimizes both system and user prompts through an iterative process. The offline optimized prompts are further leveraged to promote online prompting by performing query-dependent prompt optimization. Extensive experiments on general tasks (e.g., Arena-hard and Alpaca-eval) and reasoning tasks (e.g., GSM8K and GPQA) demonstrate that P3 achieves superior performance in the realm of automatic prompt optimization. Our results highlight the effectiveness of a holistic optimization strategy in enhancing LLM performance across diverse domains.

nan


Article 314

Title@2025-07-21 (1): Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Title: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains Aufmerksamkeit bei Markov: Ein Rahmen für die grundsätzliche Analyse von Transformatoren über Markov Ketten 注意Markov:通过Markov 链条对变形器进行原则分析的框架 2402.04161v2

Authors (7): Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

Attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. To deepen our understanding of their sequential modeling capabilities, there is a growing interest in using Markov input processes to study them. A key finding is that when trained on first-order Markov chains, transformers with two or more layers consistently develop an induction head mechanism to estimate the in-context bigram conditional distribution. In contrast, single-layer transformers, unable to form an induction head, directly learn the Markov kernel but often face a surprising challenge: they become trapped in local minima representing the unigram distribution, whereas deeper models reliably converge to the ground-truth bigram. While single-layer transformers can theoretically model first-order Markov chains, their empirical failure to learn this simple kernel in practice remains a curious phenomenon. To explain this contrasting behavior of single-layer models, in this paper we introduce a new framework for a principled analysis of transformers via Markov chains. Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture. We precisely delineate the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. Finally, we outline several open problems in this arena. Code is available at https://github.com/Bond1995/Markov .

nan


Article 315

Title@2025-07-21 (1): Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark

Title: Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark Tokenization Standards for Linguistic Integrity: Türkisch als Benchmark 语言完整性的接受标准:土耳其作为基准 2502.07057v2

Authors (6): M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım

Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models’ (LLMs) ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for systematically evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages. Using a Turkish dataset of 6,200 multiple-choice questions from the Massive Multitask Language Understanding (MMLU) benchmark, the framework assesses tokenizers across five key metrics: vocabulary size, token count, processing time, language-specific token percentages (\%TR), and token purity. These metrics provide a structured approach to evaluating how well tokenizers preserve linguistic structures. While \%TR measures the proportion of valid words in the target language, \%Pure assesses the alignment of tokens with meaningful linguistic units, such as roots and valid morphemes, minimizing semantic fragmentation. The findings reveal that \%TR, introduced as a critical metric, exhibits a stronger correlation with downstream performance (e.g., MMLU scores) than token purity, emphasizing its role in improving model accuracy. Additionally, larger model parameters do not necessarily yield better tokenization quality or enhanced results, highlighting the importance of tailored tokenization strategies that prioritize linguistic alignment. This framework sets a new standard for developing robust tokenization methods optimized for morphologically complex and low-resource languages. Future work will refine morphological analysis, explore domain-specific customizations, and conduct cross-linguistic evaluations to further enhance tokenization practices.

nan


Article 316

Title@2025-07-21 (1): Leveraging Context for Multimodal Fallacy Classification in Political Debates

Title: Leveraging Context for Multimodal Fallacy Classification in Political Debates Nutzung des Kontexts für multimodale Fehlerklassifizierung in politischen Debatten 在政治辩论中利用多模式误差分类背景 2507.15641v1

Authors (1): Alessio Pittiglio

In this paper, we present our submission to the MM-ArgFallacy2025 shared task, which aims to advance research in multimodal argument mining, focusing on logical fallacies in political debates. Our approach uses pretrained Transformer-based models and proposes several ways to leverage context. In the fallacy classification subtask, our models achieved macro F1-scores of 0.4444 (text), 0.3559 (audio), and 0.4403 (multimodal). Our multimodal model showed performance comparable to the text-only model, suggesting potential for improvements.

nan


Article 317

Title@2025-07-21 (1): Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Title: Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training Data Mixing Agent: Erlernen von Re-Gewicht Domains für kontinuierliches Pre-Training 数据混合代理: 学习为连续培训前学习重新加权域域 2507.15640v1

Authors (7): Kailai Yang, Xiao Liu, Lei Ji, Hao Li, Yeyun Gong, Peng Cheng, Mao Yang

Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents’ well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.

nan


Article 318

Title@2025-07-21 (1): Preventing Rogue Agents Improves Multi-Agent Collaboration

Title: Preventing Rogue Agents Improves Multi-Agent Collaboration Verhindern von Rogue-Agenten verbessert Multi-Agenten-Kollaboration B. 改进多机构协作 2502.05986v2

Authors (3): Ohav Barbi, Ori Yoran, Mor Geva

Multi-agent systems, where specialized agents collaborate to solve a shared task hold great potential, from increased modularity to simulating complex environments. However, they also have a major caveat – a single agent can cause the entire system to fail. Consider a simple game where the knowledge to solve the task is distributed between agents, which share information in a communication channel. At each round, any of the agents can terminate the game and make the final prediction, even if they are uncertain about the outcome of their action. Detection of such rogue agents before they act may prevent the system’s failure. In this work, we propose to monitor agents during action prediction and intervene when a future error is likely to occur. To test our approach, we introduce WhoDunitEnv, a multi-agent collaboration environment that allows modular control over task complexity and communication structure. Experiments on WhoDunitEnv, code generation tasks and the GovSim environment for resource sustainability show that our approach leads to substantial performance gains up to 17.4%, 2.5% and 20%, respectively. Thorough analysis shows that our monitors successfully identify critical points of agent confusion and our interventions effectively stop agent errors from propagating.

nan


Article 319

Title@2025-07-21 (1): Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis

Title: Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis Effiziente Angriffsuntersuchung mittels Human-in-the-Loop Sicherheitsanalyse ermöglichen 通过 “ 现场人 “ 系统安全分析,促进高效袭击调查 2211.05403v3

Authors (4): Saimon Amanuel Tsegai, Xinyu Yang, Haoyuan Liu, Peng Gao

System auditing is a vital technique for collecting system call events as system provenance and investigating complex multi-step attacks such as Advanced Persistent Threats. However, existing attack investigation methods struggle to uncover long attack sequences due to the massive volume of system provenance data and their inability to focus on attack-relevant parts. In this paper, we present Provexa, a defense system that enables human analysts to effectively analyze large-scale system provenance to reveal multi-step attack sequences. Provexa introduces an expressive domain-specific language, ProvQL, that offers essential primitives for various types of attack analyses (e.g., attack pattern search, attack dependency tracking) with user-defined constraints, enabling analysts to focus on attack-relevant parts and iteratively sift through the large provenance data. Moreover, Provexa provides an optimized execution engine for efficient language execution. Our extensive evaluations on a wide range of attack scenarios demonstrate the practical effectiveness of Provexa in facilitating timely attack investigation.

nan


Article 320

Title@2025-07-21 (1): CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Title: CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization CCSBench: Bewertung der kompositorischen Kontrollierbarkeit in LLMs für wissenschaftliche Dokumentzusammenfassung CCSBENCH:评估科学文件摘要中LLMs中的组成可控性 2410.12601v2

Authors (6): Yixi Ding, Jiaying Wu, Tongyao Zhu, Yanxia Qin, Qian Liu, Min-Yen Kan

To broaden the dissemination of scientific knowledge to diverse audiences, it is desirable for scientific document summarization systems to simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, the first evaluation benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., conceptual or empirical focus), which are more subjective and abstract. We conduct extensive experiments using various large language models (LLMs) under various settings, including in-context learning, parameter-efficient fine-tuning, and two-stage modular methods for balancing control over different attributes. Our findings reveal significant limitations in LLMs capabilities in balancing trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.

nan


Article 321

Title@2025-07-21 (1): Conflicting narratives and polarization on social media

Title: Conflicting narratives and polarization on social media Widersprüchliche Narrative und Polarisierung in den sozialen Medien 社交媒体的矛盾叙述和两极分化 2507.15600v1

Authors (1): Armin Pournaki

Narratives are key interpretative devices by which humans make sense of political reality. In this work, we show how the analysis of conflicting narratives, i.e. conflicting interpretive lenses through which political reality is experienced and told, provides insight into the discursive mechanisms of polarization and issue alignment in the public sphere. Building upon previous work that has identified ideologically polarized issues in the German Twittersphere between 2021 and 2023, we analyze the discursive dimension of polarization by extracting textual signals of conflicting narratives from tweets of opposing opinion groups. Focusing on a selection of salient issues and events (the war in Ukraine, Covid, climate change), we show evidence for conflicting narratives along two dimensions: (i) different attributions of actantial roles to the same set of actants (e.g. diverging interpretations of the role of NATO in the war in Ukraine), and (ii) emplotment of different actants for the same event (e.g. Bill Gates in the right-leaning Covid narrative). Furthermore, we provide first evidence for patterns of narrative alignment, a discursive strategy that political actors employ to align opinions across issues. These findings demonstrate the use of narratives as an analytical lens into the discursive mechanisms of polarization.

nan


Article 322

Title@2025-07-21 (1): Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Title: Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration Wiederbelebung des Kulturerbes: Ein neuartiger Ansatz für eine umfassende Restaurierung historischer Dokumente 恢复文化遗产:全面恢复历史文件的新办法 2507.05108v2

Authors (8): Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin

Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians’ restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR’s remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.

nan


Article 323

Title@2025-07-21 (1): Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging

Title: Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging Smart Eyes für Silent Threats: VLMs und In-Context Learning für THz Imaging 静默威胁的 “ 聪明的眼睛 “ :VLMs和THz成像的内书学习 2507.15576v1

Authors (3): Nicolas Poggi, Shashank Agnihotri, Margret Keuper

Terahertz (THz) imaging enables non-invasive analysis for applications such as security screening and material classification, but effective image classification remains challenging due to limited annotations, low resolution, and visual ambiguity. We introduce In-Context Learning (ICL) with Vision-Language Models (VLMs) as a flexible, interpretable alternative that requires no fine-tuning. Using a modality-aligned prompting framework, we adapt two open-weight VLMs to the THz domain and evaluate them under zero-shot and one-shot settings. Our results show that ICL improves classification and interpretability in low-data regimes. This is the first application of ICL-enhanced VLMs to THz imaging, offering a promising direction for resource-constrained scientific domains. Code: \href{https://github.com/Nicolas-Poggi/Project_THz_Classification/tree/main}{GitHub repository}.

nan


Article 324

Title@2025-07-21 (1): clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Title: clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations clem:todd: Ein Rahmen für die systematische Benchmarking von LLM-basierten, auf Aufgaben ausgerichteten Dialogsystem-Realisierungen 模块:基于LLM的以任务为导向的对话系统实现情况系统基准化框架 2505.05445v2

Authors (3): Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

nan


Article 325

Title@2025-07-21 (1): Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Title: Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification Textübertragung bewerten: Ein Neun-Sprachen-Benchmark für Textentgiftung 评价文本样式转让:文本解毒九语言基准 2507.15557v1

Authors (4): Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.

nan


Article 326

Title@2025-07-21 (1): Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

Title: Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems Mehr tun mit weniger: Eine Umfrage über Routing-Strategien zur Ressourcenoptimierung in großsprachlichen modellbasierten Systemen 少花钱多办事:关于大语言示范系统资源优化区域战略的调查 2502.00409v3

Authors (6): Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, François Jacquenet

Large Language Model (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component, such as conversational agents, are usually designed with monolithic, static architectures that rely on a single, general-purpose LLM to handle all user queries. However, these systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o, Claude-Sonnet) perform well across a wide range of tasks, they may incur significant financial, energy and computational costs. These costs may be disproportionate for simpler queries, resulting in unnecessary resource utilisation. A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models, thereby improving efficiency and optimising resource consumption. This survey aims to provide a comprehensive overview of routing strategies in LLM-based systems. Specifically, it reviews when, why, and how routing should be integrated into LLM pipelines to improve efficiency, scalability, and performance. We define the objectives to optimise, such as cost minimisation and performance maximisation, and discuss the timing of routing within the LLM workflow, whether it occurs before or after generation. We also detail the various implementation strategies, including similarity-based, supervised, reinforcement learning-based, and generative methods. Practical considerations such as industrial applications and current limitations are also examined, like standardising routing experiments, accounting for non-financial costs, and designing adaptive strategies. By formalising routing as a performance-cost optimisation problem, this survey provides tools and directions to guide future research and development of adaptive low-cost LLM-based systems.

nan


Article 327

Title@2025-07-21 (1): KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan

Title: KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan KazMMLU: Bewertung von Sprachmodellen zu kasachischen, russischen und regionalen Kenntnissen Kasachstans KazMMMLU:评估哈萨克斯坦哈萨克语、俄语和区域知识的语言模式 2502.12829v2

Authors (14): Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto

Despite having a population of twenty million, Kazakhstan’s culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan’s bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.

nan


Article 328

Title@2025-07-21 (1): Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks

Title: Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks Beeinflussen Emotionen wirklich die Überzeugung von Argumenten? Ein dynamischer Ansatz mit LLM-basierten Manipulationsprüfungen 情感真的会真的影响竞价说服力吗? 使用基于 LLM 的操纵测试的动态方法 2503.00024v2

Authors (2): Yanran Chen, Steffen Eger

Emotions have been shown to play a role in argument convincingness, yet this aspect is underexplored in the natural language processing (NLP) community. Unlike prior studies that use static analyses, focus on a single text domain or language, or treat emotion as just one of many factors, we introduce a dynamic framework inspired by manipulation checks commonly used in psychology and social science; leveraging LLM-based manipulation checks, this framework examines the extent to which perceived emotional intensity influences perceived convincingness. Through human evaluation of arguments across different languages, text domains, and topics, we find that in over half of cases, human judgments of convincingness remain unchanged despite variations in perceived emotional intensity; when emotions do have an impact, they more often enhance rather than weaken convincingness. We further analyze whether 11 LLMs behave like humans in the same scenario, finding that while LLMs generally mirror human patterns, they struggle to capture nuanced emotional effects in individual judgments.

nan


Article 329

Title@2025-07-21 (1): Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

Title: Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models Schritt-Level-Verifier-geführte Hybrid-Test-Time-Skalierung für große Sprachmodelle 大语言模型的逐步一级核证人-制导大语言模型混合试验-时间缩放 2507.15512v1

Authors (10): Kaiyan Chang, Yonghao Shi, Chenglong Wang, Hang Zhou, Chi Hu, Xiaoqian Liu, Yingfeng Luo, Yuan Ge, Tong Xiao, Jingbo Zhu

Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling. In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.

nan


Article 330

Title@2025-07-21 (1): Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Title: Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback Off-Policy korrigierte Prämienmodellierung für verstärktes Lernen aus menschlichem Feedback 利用人类反馈加强学习的非政策纠正奖励模型 2507.15507v1

Authors (3): Johannes Ackermann, Takashi Ishida, Masashi Sugiyama

Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling

nan


Article 331

Title@2025-07-21 (1): ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution

Title: ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution ASPERA: Eine simulierte Umgebung, um Planung für komplexe Aktionen zu bewerten ASPERA:评估复杂行动执行规划的模拟环境 2507.15501v1

Authors (9): Alexandru Coca, Mark Gaynor, Zhenxing Zhang, Jianpeng Cheng, Bo-Hsiang Tseng, Pete Boothroyd, Héctor Martinez Alonso, Diarmuid Ó Séaghdha, Anders Johannsen

This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. These assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.

nan


Article 332

Title@2025-07-21 (1): OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning

Title: OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning OMoE: Diversifizierende Mischung aus Low-Rank-Anpassung durch Orthogonal Finetuning OMoE:通过矫形微调使低Rank适应混合体多样化 2501.10062v2

Authors (6): Jinyuan Feng, Zhiqiang Pu, Tianyi Hu, Dongmin Li, Xiaolin Ai, Huimu Wang

Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) for its modular design and remarkable performance. However, simply stacking the number of experts cannot guarantee significant improvement. In this work, we first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Ulteriorly, Our analysis reveals that the performance of previous MoE variants maybe limited by a lack of diversity among experts. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE), a resource-efficient MoE variant that trains experts in an orthogonal manner to promote diversity. In OMoE, a Gram-Schmidt process is leveraged to enforce that the experts’ representations lie within the Stiefel manifold. By applying orthogonal constraints directly to the architecture, OMoE keeps the learning objective unchanged, without compromising optimality. Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models. Experiments on diverse commonsense reasoning benchmarks demonstrate that OMoE can consistently achieve stable and efficient performance improvement when compared with the state-of-the-art methods while significantly reducing the number of required experts.

nan


Article 333

Title@2025-07-21 (1): KAT-V1: Kwai-AutoThink Technical Report

Title: KAT-V1: Kwai-AutoThink Technical Report KAT-V1: Kwai-AutoThink Technical Report KAT-V1: Kwai-AutoThink 技术报告 2507.08297v3

Authors (30): Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu

We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage. Notably, KAT outperforms all open-source models and even surpasses o3-mini on the leakage-controlled LiveCodeBench Pro. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou’s internal coding assistant), where it improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) model with 40B active parameters, and early results already show significant gains, further demonstrating the scalability of the AutoThink paradigm.

nan


Article 334

Title@2025-07-21 (1): DARE: Diverse Visual Question Answering with Robustness Evaluation

Title: DARE: Diverse Visual Question Answering with Robustness Evaluation DARE: Diverse visuelle Fragebeantwortung mit Robustheitsbewertung DARE: 以强力评价回答多种视觉问题 2409.18023v2

Authors (3): Hannah Sterz, Jonas Pfeiffer, Ivan Vulić

Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.

nan


Article 335

Title@2025-07-21 (1): STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

Title: STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning STUN: Strukturierte und dann unstrukturierte Pruning für skalierbare MoE Pruning STUN: 结构化的当时无结构化的为可缩缩的MoE Pruning提供结构化的当时无结构化的谨慎 2409.06211v2

Authors (6): Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He

Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring $O(\frac{k^n}{\sqrt{n}})$ forward passes for $n$ experts, cannot scale for recent MoEs, we propose a scalable alternative with $O(1)$ complexity, yet outperforming the more expensive methods. The key idea is leveraging a latent structure between experts, based on behavior similarity, such that the greedy decision of whether to prune closely captures the joint pruning effect. Ours is highly effective – for Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art unstructured pruning fails to. The code will be made publicly available.

nan


Article 336

Title@2025-07-21 (1): End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data

Title: End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data End-to-End-Gemeinsame Pünktliche und Normalisierte ASR mit einer begrenzten Menge an Pünktlichen Trainingsdaten 配有数量有限的点对培训数据的点对端联合标点和正常化的ASR 2311.17741v3

Authors (4): Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent

Joint punctuated and normalized automatic speech recognition (ASR) aims at outputing transcripts with and without punctuation and casing. This task remains challenging due to the lack of paired speech and punctuated text data in most ASR corpora. We propose two approaches to train an end-to-end joint punctuated and normalized ASR system using limited punctuated data. The first approach uses a language model to convert normalized training transcripts into punctuated transcripts. This achieves a better performance on out-of-domain test data, with up to 17% relative Punctuation-Case-aware Word Error Rate (PC-WER) reduction. The second approach uses a single decoder conditioned on the type of output. This yields a 42% relative PC-WER reduction compared to Whisper-base and a 4% relative (normalized) WER reduction compared to the normalized output of a punctuated-only model. Additionally, our proposed model demonstrates the feasibility of a joint ASR system using as little as 5% punctuated training data with a moderate (2.42% absolute) PC-WER increase.

nan


Article 337

Title@2025-07-21 (1): Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Title: Entity-aware Cross-lingual Claim Detection for Automated Fact-checking Entity-aware Cross-lingual Claim Detection for Automated Fact-Checking 用于自动实况调查的有实体意识的跨语言交叉索赔调查 2503.15220v4

Authors (2): Rrubaa Panchendrarajan, Arkaitz Zubiaga

Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite notable progress, challenges remain-particularly in handling multilingual data prevalent in online discourse. Recent efforts have focused on fine-tuning pre-trained multilingual language models to address this. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce EX-Claim, an entity-aware cross-lingual claim detection model that generalizes well to handle multilingual claims. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model stands out as an effective solution, demonstrating consistent performance gains across 27 languages and robust knowledge transfer between languages seen and unseen during training.

nan


Article 338

Title@2025-07-21 (1): KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Title: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model KaLM-Embedding-V2: Überlegene Trainingstechniken und Daten inspirieren ein vielseitiges Einbettungsmodell KaLM-Embedding-V2:高级培训技术和数据预报 2506.20923v2

Authors (17): Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.

nan


Article 339

Title@2025-07-21 (1): AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming

Title: AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming AlgoSimBench: Algorithmisch ähnliche Probleme für wettbewerbsfähige Programmierung identifizieren AlgoSimBeunch:为竞争性方案拟订查明在职等上相似的难题 2507.15378v1

Authors (2): Jierui Li, Raymond Mooney

Recent progress in LLMs, such as reasoning models, has demonstrated strong abilities to solve complex competitive programming problems, often rivaling top human competitors. However, it remains underexplored whether these abilities generalize to relevant domains that are less seen during training. To address this, we introduce AlgoSimBench, a new benchmark designed to assess LLMs’ ability to identify algorithmically similar problems (ASPs)-problems that can be solved using similar algorithmic approaches. AlgoSimBench consists of 1317 problems, annotated with 231 distinct fine-grained algorithm tags, from which we curate 402 multiple-choice questions (MCQs), where each question presents one algorithmically similar problem alongside three textually similar but algorithmically dissimilar distractors. Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task. To address this challenge, we propose attempted solution matching (ASM), a novel method for improving problem similarity detection. On our MCQ task, ASM yields an absolute accuracy improvement of 6.7% to 11.7% across different models. We also evaluated code embedding models and retrieval methods on similar problem identification. While the adversarial selection of problems degrades the performance to be less than random, we found that simply summarizing the problem to remove narrative elements eliminates the effect, and combining ASM with a keyword-prioritized method, BM25, can yield up to 52.2% accuracy. Code and data are available at github.com

nan


Article 340

Title@2025-07-21 (1): MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs

Title: MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs MKE-Coder: Multi-Axial-Wissen mit Evidenzverifizierung bei ICD-Coding für chinesische EMRs MKE-编码器:中文EMR的ICD编码中多轴知识与证据核查的多轴知识 2502.14916v3

Authors (5): Xinxin You, Xien Liu, Xue Yang, Ziyi Wang, Ji Wu

The task of automatically coding the International Classification of Diseases (ICD) in the medical field has been well-established and has received much attention. Automatic coding of the ICD in the medical field has been successful in English but faces challenges when dealing with Chinese electronic medical records (EMRs). The first issue lies in the difficulty of extracting disease code-related information from Chinese EMRs, primarily due to the concise writing style and specific internal structure of the EMRs. The second problem is that previous methods have failed to leverage the disease-based multi-axial knowledge and lack of association with the corresponding clinical evidence. This paper introduces a novel framework called MKE-Coder: Multi-axial Knowledge with Evidence verification in ICD coding for Chinese EMRs. Initially, we identify candidate codes for the diagnosis and categorize each of them into knowledge under four coding axes.Subsequently, we retrieve corresponding clinical evidence from the comprehensive content of EMRs and filter credible evidence through a scoring model. Finally, to ensure the validity of the candidate code, we propose an inference module based on the masked language modeling strategy. This module verifies that all the axis knowledge associated with the candidate code is supported by evidence and provides recommendations accordingly. To evaluate the performance of our framework, we conduct experiments using a large-scale Chinese EMR dataset collected from various hospitals. The experimental results demonstrate that MKE-Coder exhibits significant superiority in the task of automatic ICD coding based on Chinese EMRs. In the practical evaluation of our method within simulated real coding scenarios, it has been demonstrated that our approach significantly aids coders in enhancing both their coding accuracy and speed.

nan


Article 341

Title@2025-07-21 (1): STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Title: STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models STITCH: Gleichzeitiges Denken und Sprechen mit Chunked Reasoning für gesprochene Sprachmodelle SSTTCH: 同时思考和交谈 与口语模式的“关键理由”对话 2507.15375v1

Authors (10): Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

nan


Article 342

Title@2025-07-21 (1): Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

Title: Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation Meta4XNLI: Ein Crosslingual Parallel Corpus für die Erkennung und Interpretation von Metaphoren Meta4XNLI: 用于识别和解释代名词的跨语言平行体 2404.07053v3

Authors (2): Elisa Sanchez-Bayona, Rodrigo Agerri

Metaphors are a ubiquitous but often overlooked part of everyday language. As a complex cognitive-linguistic phenomenon, they provide a valuable means to evaluate whether language models can capture deeper aspects of meaning, including semantic, pragmatic, and cultural context. In this work, we present Meta4XNLI, the first parallel dataset for Natural Language Inference (NLI) newly annotated for metaphor detection and interpretation in both English and Spanish. Meta4XNLI facilitates the comparison of encoder- and decoder-based models in detecting and understanding metaphorical language in multilingual and cross-lingual settings. Our results show that fine-tuned encoders outperform decoders-only LLMs in metaphor detection. Metaphor interpretation is evaluated via the NLI framework with comparable performance of masked and autoregressive models, which notably decreases when the inference is affected by metaphorical language. Our study also finds that translation plays an important role in the preservation or loss of metaphors across languages, introducing shifts that might impact metaphor occurrence and model performance. These findings underscore the importance of resources like Meta4XNLI for advancing the analysis of the capabilities of language models and improving our understanding of metaphor processing across languages. Furthermore, the dataset offers previously unavailable opportunities to investigate metaphor interpretation, cross-lingual metaphor transferability, and the impact of translation on the development of multilingual annotated resources.

nan


Article 343

Title@2025-07-21 (1): Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding

Title: Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding Metaphorische und große Sprachmodelle: Wenn Oberflächenmerkmale mehr ausmachen als tiefes Verständnis 名词和大语言模型:当地表地貌特征比深了解更重要时 2507.15357v1

Authors (2): Elisa Sanchez-Bayona, Rodrigo Agerri

This paper presents a comprehensive evaluation of the capabilities of Large Language Models (LLMs) in metaphor interpretation across multiple datasets, tasks, and prompt configurations. Although metaphor processing has gained significant attention in Natural Language Processing (NLP), previous research has been limited to single-dataset evaluations and specific task settings, often using artificially constructed data through lexical replacement. We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations, focusing on Natural Language Inference (NLI) and Question Answering (QA) tasks. The results indicate that LLMs’ performance is more influenced by features like lexical overlap and sentence length than by metaphorical content, demonstrating that any alleged emergent abilities of LLMs to understand metaphorical language are the result of a combination of surface-level features, in-context learning, and linguistic knowledge. This work provides critical insights into the current capabilities and limitations of LLMs in processing figurative language, highlighting the need for more realistic evaluation frameworks in metaphor interpretation tasks. Data and code are publicly available.

nan


Article 344

Title@2025-07-21 (1): ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events

Title: ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events ChronoSense: Erforschen des zeitlichen Verständnisses in großen Sprachmodellen mit Zeitintervallen von Ereignissen Chronossensensense:探索具有时际事件间隔的大型语言模型中的时间理解 2501.03040v2

Authors (2): Duygu Sezen Islakoglu, Jan-Christoph Kalo

Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, yet they still face significant challenges in reasoning and arithmetic. Temporal reasoning, a critical component of natural language understanding, has raised increasing research attention. However, comprehensive testing of Allen’s interval relations (e.g., before, after, during) – a fundamental framework for temporal relationships – remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs’ temporal understanding. It includes 16 tasks, focusing on identifying the Allen relation between two temporal events and temporal arithmetic, using both abstract events and real-world data from Wikidata. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models’ low performance highlights the need for improved temporal understanding in LLMs and ChronoSense offers a robust framework for future research in this area. Our dataset and the source code are available at https://github.com/duyguislakoglu/chronosense.

nan


Article 345

Title@2025-07-21 (1): Probing Information Distribution in Transformer Architectures through Entropy Analysis

Title: Probing Information Distribution in Transformer Architectures through Entropy Analysis Probing Information Distribution in Transformer-Architekturen durch Entropie-Analyse 通过 Entropy 分析在变形结构中进行测试信息发布 2507.15347v1

Authors (5): Amedeo Buonanno, Alessandro Rivetti, Francesco A. N. Palmieri, Giovanni Di Gennaro, Gianmarco Romano

This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models

nan


Article 346

Title@2025-07-21 (1): LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Title: LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators LionGuard 2: Leichte, dateneffiziente und lokalisierte Mehrsprachige Inhaltsmoderatoren bauen 狮子座标2:轻量、数据效率和本地化多语种内容主持人 2507.15339v1

Authors (4): Leanne Tan, Gabriel Chua, Ziyu Ge, Roy Ka-Wei Lee

Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

nan


Article 347

Title@2025-07-21 (1): Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Title: Reasoning Models are Test Exploiters: Rethinking Multiple-Choice Reasoning Models sind Testexploiter: Multi-Choice neu denken 说明理由的模型是实验性剥削者:重新思考多选择 2507.15337v1

Authors (3): Narun Raman, Taylor Lundy, Kevin Leyton-Brown

When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of $15$ different question-answering benchmarks (e.g., MMLU, HLE) and $25$ different LLMs (including small models such as Qwen 7B and relatively large models such as Llama 70B). For each model-benchmark pair, we considered $5$ ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether “none of the above” sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models, and offer practical guidelines for designing more robust, bias-resistant benchmarks that better reflect LLMs’ genuine reasoning capabilities.

nan


Article 348

Title@2025-07-21 (1): Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Title: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Mixture-of-Recursions: Dynamische Rekursive Tiefen für adaptive Token-Level-Computation lernen 混合流流流:学习适应调控级计算法的动态回流深度 2507.10524v2

Authors (11): Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

nan


Article 349

Title@2025-07-21 (1): On the Inevitability of Left-Leaning Political Bias in Aligned Language Models

Title: On the Inevitability of Left-Leaning Political Bias in Aligned Language Models Zur Unvermeidlichkeit linksleanender politischer Bias in gerichteten Sprachmodellen 关于采用统一语言模式的左倾政治偏见的不可避免的问题 2507.15328v1

Authors (1): Thilo Hagendorff

The guiding principle of AI alignment is to train large language models (LLMs) to be harmless, helpful, and honest (HHH). At the same time, there are mounting concerns that LLMs exhibit a left-wing political bias. Yet, the commitment to AI alignment cannot be harmonized with the latter critique. In this article, I argue that intelligent systems that are trained to be harmless and honest must necessarily exhibit left-wing political bias. Normative assumptions underlying alignment objectives inherently concur with progressive moral frameworks and left-wing principles, emphasizing harm avoidance, inclusivity, fairness, and empirical truthfulness. Conversely, right-wing ideologies often conflict with alignment guidelines. Yet, research on political bias in LLMs is consistently framing its insights about left-leaning tendencies as a risk, as problematic, or concerning. This way, researchers are actively arguing against AI alignment, tacitly fostering the violation of HHH principles.

nan


Article 350

Title@2025-07-21 (1): Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

Title: Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models Katzen verunsichern LLM: Abfrage von Agnostiker-Adversarial-Triggern für vernunftbewusste Modelle Cats 配置理由解释的LLM: 用于说明理由模型的询问Agnistic Aversarial 触发器 2503.01781v2

Authors (8): Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, Nazneen Rajani

We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers - short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem’s semantics. We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, “Interesting fact: cats sleep most of their lives,” to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. The CatAttack triggers dataset with model responses is available at https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers.

nan


Article 351

Title@2025-07-21 (1): ACEBench: Who Wins the Match Point in Tool Usage?

Title: ACEBench: Who Wins the Match Point in Tool Usage? ACEBench: Wer gewinnt den Match Point in der Werkzeugnutzung? CEBench:谁在工具使用中赢得了匹配点? 2501.12851v6

Authors (16): Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. “Normal” evaluates tool usage in basic scenarios; “Special” evaluates tool usage in situations with ambiguous or incomplete instructions; “Agent” evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.

nan


Article 352

Title@2025-07-21 (1): FastMCTS: A Simple Sampling Strategy for Data Synthesis

Title: FastMCTS: A Simple Sampling Strategy for Data Synthesis FastMCTS: Eine einfache Probenahmestrategie für die Datensynthese 数据综合简单抽样战略 2502.11476v2

Authors (8): Peiji Li, Kai Lv, Yunfan Shao, Yichuan Ma, Linyang Li, Xiaoqing Zheng, Xipeng Qiu, Qipeng Guo

Synthetic high-quality multi-step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30\% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS-generated data outperform those trained on rejection sampling data by 3.9\% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high-quality reasoning data. Our code will be released soon.

nan


Article 353

Title@2025-07-21 (1): Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM-generated Text Detection

Title: Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM-generated Text Detection Beyond Easy Wins: Ein Text Hardness-Aware Benchmark für LLM-generierte Texterkennung 超越简单赢:LLM生成的文本检测的文本硬度软件基准 2507.15286v1

Authors (3): Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee

We present a novel evaluation paradigm for AI text detectors that prioritizes real-world and equitable assessment. Current approaches predominantly report conventional metrics like AUROC, overlooking that even modest false positive rates constitute a critical impediment to practical deployment of detection systems. Furthermore, real-world deployment necessitates predetermined threshold configuration, making detector stability (i.e. the maintenance of consistent performance across diverse domains and adversarial scenarios), a critical factor. These aspects have been largely ignored in previous research and benchmarks. Our benchmark, SHIELD, addresses these limitations by integrating both reliability and stability factors into a unified evaluation metric designed for practical assessment. Furthermore, we develop a post-hoc, model-agnostic humanification framework that modifies AI text to more closely resemble human authorship, incorporating a controllable hardness parameter. This hardness-aware approach effectively challenges current SOTA zero-shot detection methods in maintaining both reliability and stability. (Data and code: https://github.com/navid-aub/SHIELD-Benchmark)

nan


Article 354

Title@2025-07-21 (1): A Novel Self-Evolution Framework for Large Language Models

Title: A Novel Self-Evolution Framework for Large Language Models Ein neuartiges Selbst-Evolution-Rahmenwerk für große Sprachmodelle 大语言模式新自演框架 2507.15281v1

Authors (3): Haoran Sun, Zekun Zhang, Shaoning Zeng

The capabilities of Large Language Models (LLMs) are limited to some extent by pre-training, so some researchers optimize LLMs through post-training. Existing post-training strategies, such as memory-based retrieval or preference optimization, improve user alignment yet fail to enhance the model’s domain cognition. To bridge this gap, we propose a novel Dual-Phase Self-Evolution (DPSE) framework that jointly optimizes user preference adaptation and domain-specific competence. DPSE introduces a Censor module to extract multi-dimensional interaction signals and estimate satisfaction scores, which guide structured data expansion via topic-aware and preference-driven strategies. These expanded datasets support a two-stage fine-tuning pipeline: supervised domain grounding followed by frequency-aware preference optimization. Experiments across general NLP benchmarks and long-term dialogue tasks demonstrate that DPSE consistently outperforms Supervised Fine-Tuning, Preference Optimization, and Memory-Augmented baselines. Ablation studies validate the contribution of each module. In this way, our framework provides an autonomous path toward continual self-evolution of LLMs.

nan


Article 355

Title@2025-07-21 (1): ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling

Title: ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling ChiMed 2.0: Fortschrittlicher chinesischer medizinischer Datensatz bei der Erleichterung des großen Sprachmodellierens 切米德2.0:推进中国医疗数据集,促进大语言建模 2507.15275v1

Authors (5): Yuanhe Tian, Junjie Liu, Zhizhou Kou, Yuxiang Li, Yan Song

Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs for supervised fine-tuning (SFT), and 41.7K preference data tuples for RLHF. To validate the effectiveness of our approach for training a Chinese medical LLM, we conduct further pre-training, SFT, and RLHF experiments on representative general domain LLMs and evaluate their performance on medical benchmark datasets. The results show performance gains across different model scales, validating the dataset’s effectiveness and applicability.

nan


Article 356

Title@2025-07-21 (1): A2TTS: TTS for Low Resource Indian Languages

Title: A2TTS: TTS for Low Resource Indian Languages A2TTS: TTS für ressourcenarme indische Sprachen A2TTS: 低资源印度语言TTS 2507.15272v1

Authors (4): Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Isha Pandey, Ganesh Ramakrishnan

We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained language-specific speaker-conditioned models. Using the IndicSUPERB dataset for multiple Indian languages such as Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi and Tamil.

nan


Article 357

Title: GREAT: Guiding Query Generation with a Trie for Recommending Related Search about Video at Kuaishou GREAT: Guiding Query Generation mit einem Versuch zum Empfehlen Verwandte Suche zum Thema Video bei Kuaishou 大:指导Query Greaking Query Generation 与一个三合队在广州建议相关视频搜索 2507.15267v1

Authors (6): Ninglu Shao, Jinshan Wang, Chenxu Wang, Qingbiao Li, Xiaoxue Zang, Han Li

Currently, short video platforms have become the primary place for individuals to share experiences and obtain information. To better meet users’ needs for acquiring information while browsing short videos, some apps have introduced a search entry at the bottom of videos, accompanied with recommended relevant queries. This scenario is known as query recommendation in video-related search, where core task is item-to-query (I2Q) recommendation. As this scenario has only emerged in recent years, there is a notable scarcity of academic research and publicly available datasets in this domain. To address this gap, we systematically examine the challenges associated with this scenario for the first time. Subsequently, we release a large-scale dataset derived from real-world data pertaining to the query recommendation in video-\textit{\textbf{r}}elated \textit{\textbf{s}}earch on the \textit{\textbf{Kuai}}shou app (\textbf{KuaiRS}). Presently, existing methods rely on embeddings to calculate similarity for matching short videos with queries, lacking deep interaction between the semantic content and the query. In this paper, we introduce a novel LLM-based framework named \textbf{GREAT}, which \textit{\textbf{g}}uides que\textit{\textbf{r}}y g\textit{\textbf{e}}ner\textit{\textbf{a}}tion with a \textit{\textbf{t}}rie to address I2Q recommendation in related search. Specifically, we initially gather high-quality queries with high exposure and click-through rate to construct a query-based trie. During training, we enhance the LLM’s capability to generate high-quality queries using the query-based trie. In the inference phase, the query-based trie serves as a guide for the token generation. Finally, we further refine the relevance and literal quality between items and queries via a post-processing module. Extensive offline and online experiments demonstrate the effectiveness of our proposed method.

nan


Article 358

Title@2025-07-21 (1): Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Title: Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models Visually Guided Decoding: Gradient-Free Hard Prompt Inversion mit Sprachmodellen 视觉导导解码: 带语言模型的逐步无限制硬快速翻版 2505.08622v2

Authors (4): Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim

Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

nan


Article 359

Title@2025-07-21 (1): Commonsense Reasoning in Arab Culture

Title: Commonsense Reasoning in Arab Culture Commonsense Vernunft in der arabischen Kultur 阿拉伯文化中的常识理由 2502.12788v2

Authors (10): Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto

Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce ArabCulture, a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. ArabCulture spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.

nan


Article 360

Title@2025-07-21 (1): SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest

Title: SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest SOI Matters: Analyse von Multi-Setting-Trainingsdynamiken in vorgebildeten Sprachmodellen über Teilmengen von Interesse SOI事项:分析通过利益子集分析培训前语言模式中多设置培训动态 2507.15236v1

Authors (4): Shayan Vassef, Amirhossein Dabiriaghdam, Mohammadreza Bakhtiari, Yadollah Yaghoobzadeh

This work investigates the impact of multi-task, multi-lingual, and multi-source learning approaches on the robustness and performance of pretrained language models. To enhance this analysis, we introduce Subsets of Interest (SOI), a novel categorization framework that identifies six distinct learning behavior patterns during training, including forgettable examples, unlearned examples, and always correct examples. Through SOI transition heatmaps and dataset cartography visualization, we analyze how examples shift between these categories when transitioning from single-setting to multi-setting configurations. We perform comprehensive experiments across three parallel comparisons: multi-task vs. single-task learning using English tasks (entailment, paraphrase, sentiment), multi-source vs. single-source learning using sentiment analysis datasets, and multi-lingual vs. single-lingual learning using intent classification in French, English, and Persian. Our results demonstrate that multi-source learning consistently improves out-of-distribution performance by up to 7%, while multi-task learning shows mixed results with notable gains in similar task combinations. We further introduce a two-stage fine-tuning approach where the second stage leverages SOI-based subset selection to achieve additional performance improvements. These findings provide new insights into training dynamics and offer practical approaches for optimizing multi-setting language model performance.

nan


Article 361

Title@2025-07-21 (1): Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Title: Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning Search-R1: LLMs zu Grund und Hebel-Suchmaschinen mit Verstärkungs-Lernen 搜索R1:培训 “ 理性与利用搜索引擎与强化学习 “ 培训LLMS 2503.09516v4

Authors (8): Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

nan


Article 362

Title@2025-07-21 (1): Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

Title: Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models PTSD in klinischen Interviews erkennen: Eine vergleichende Analyse von NLP-Methoden und großen Sprachmodellen 临床访谈中检测创伤后创伤后精神紧张症:国家语言规划方法和大语言模式的比较分析 2504.01216v2

Authors (5): Feng Chen, Dror Ben-Zeev, Gillian Sparks, Arya Kadakia, Trevor Cohen

Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings, presenting opportunities for automated detection to identify patients. This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts. We compared general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and large language model prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset. Domain-specific end-to-end models significantly outperformed general models (Mental-RoBERTa AUPRC=0.675+/-0.084 vs. RoBERTa-base 0.599+/-0.145). SentenceBERT embeddings with neural networks achieved the highest overall performance (AUPRC=0.758+/-0.128). Few-shot prompting using DSM-5 criteria yielded competitive results with two examples (AUPRC=0.737). Performance varied significantly across symptom severity and comorbidity status with depression, with higher accuracy for severe PTSD cases and patients with comorbid depression. Our findings highlight the potential of domain-adapted embeddings and LLMs for scalable screening while underscoring the need for improved detection of nuanced presentations and offering insights for developing clinically viable AI tools for PTSD assessment.

nan


Article 363

Title@2025-07-21 (1): Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems

Title: Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems Ausnutzen von kontextabhängigen Dauerfunktionen für Sprachanonymisierungs-Angriffsysteme 利用语音匿名攻击系统视具体情况而定的期间特征 2507.15214v1

Authors (3): Natalia Tomashenko, Emmanuel Vincent, Marc Tommasi

The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization systems.The experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.

nan


Article 364

Title@2025-07-21 (1): Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment

Title: Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment Collaborative Destillationsstrategien für parametereffiziente Sprachmodell-Einsatz 辅助计量有效语言模式部署的协作性静修战略 2507.15198v1

Authors (6): Xiandong Meng, Yan Wu, Yexin Tian, Xin Hu, Tianze Kang, Junliang Du

This paper addresses the challenges of high computational cost and slow inference in deploying large language models. It proposes a distillation strategy guided by multiple teacher models. The method constructs several teacher models and integrates their output probability distributions and intermediate semantic features. This guides the student model to learn from multiple sources of knowledge. As a result, the student model gains stronger language understanding and generation ability while maintaining a small parameter size. To achieve this, the paper introduces a weighted output fusion mechanism, a feature alignment loss function, and an entropy-driven dynamic teacher weighting strategy. These components improve the quality and stability of knowledge transfer during distillation. Under multi-teacher guidance, the student model captures semantic information more effectively and demonstrates strong performance across multiple evaluation metrics. In particular, the method shows high consistency in expression, generalization ability, and task adaptability in tasks such as language modeling, text generation, and multi-task learning. The experiments compare the proposed method with several widely adopted distillation approaches. The results further confirm its overall advantages in perplexity, distillation loss, and generation quality. This study provides a feasible technical path for the efficient compression of large-scale language models. It also demonstrates the effectiveness of multi-teacher collaborative mechanisms in complex language modeling tasks.

nan


Article 365

Title@2025-07-21 (1): Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

Title: Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles Hierarchische Prompting Taxonomie: Ein universeller Evaluationsrahmen für große Sprachmodelle, ausgerichtet auf menschliche Kognitive Prinzipien 符合人类认知原则的大语言模式普遍评价框架 2406.12644v5

Authors (6): Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, Aman Chadha

Assessing the effectiveness of large language models (LLMs) in performing different tasks is crucial for understanding their strengths and weaknesses. This paper presents Hierarchical Prompting Taxonomy (HPT), grounded on human cognitive principles and designed to assess LLMs by examining the cognitive demands of various tasks. The HPT utilizes the Hierarchical Prompting Framework (HPF), which structures five unique prompting strategies in a hierarchical order based on their cognitive requirement on LLMs when compared to human mental capabilities. It assesses the complexity of tasks with the Hierarchical Prompting Index (HPI), which demonstrates the cognitive competencies of LLMs across diverse datasets and offers insights into the cognitive demands that datasets place on different LLMs. This approach enables a comprehensive evaluation of an LLMs problem solving abilities and the intricacy of a dataset, offering a standardized metric for task complexity. Extensive experiments with multiple datasets and LLMs show that HPF enhances LLM performance by 2% to 63% compared to baseline performance, with GSM8k being the most cognitively complex task among reasoning and coding tasks with an average HPI of 3.20 confirming the effectiveness of HPT. To support future research and reproducibility in this domain, the implementations of HPT and HPF are available here.

nan


Article 366

Title@2025-07-21 (1): Empowering LLMs with Logical Reasoning: A Comprehensive Survey

Title: Empowering LLMs with Logical Reasoning: A Comprehensive Survey Stärkung von LLMs mit logischer Begründung: Eine umfassende Umfrage 赋予LLMs以逻辑理由:全面调查 2502.15652v4

Authors (6): Fengxiang Cheng, Haoxuan Li, Fenrong Liu, Robert van Rooij, Kun Zhang, Zhouchen Lin

Large language models (LLMs) have achieved remarkable successes on various tasks. However, recent studies have found that there are still significant challenges to the logical reasoning abilities of LLMs, which can be categorized into the following two aspects: (1) Logical question answering: LLMs often fail to generate the correct answer within a complex logical problem which requires sophisticated deductive, inductive or abductive reasoning given a collection of premises. (2) Logical consistency: LLMs are prone to producing responses contradicting themselves across different questions. For example, a state-of-the-art question-answering LLM Macaw, answers Yes to both questions Is a magpie a bird? and Does a bird have wings? but answers No to Does a magpie have wings?. To facilitate this research direction, we comprehensively investigate the most cutting-edge methods and propose a detailed taxonomy. Specifically, to accurately answer complex logic questions, previous methods can be categorized based on reliance on external solvers, prompts, and fine-tuning. To avoid logical contradictions, we discuss concepts and solutions of various logical consistencies, including implication, negation, transitivity, factuality consistencies, and their composites. In addition, we review commonly used benchmark datasets and evaluation metrics, and discuss promising research directions, such as extending to modal logic to account for uncertainty and developing efficient algorithms that simultaneously satisfy multiple logical consistencies.

nan


Article 367

Title@2025-07-20 (7): What Level of Automation is “Good Enough”? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

Title: What Level of Automation is “Good Enough”? A Benchmark of Large Language Models for Meta-Analysis Data Extraction Welche Stufe der Automatisierung ist “Gut genug”? Ein Benchmark für große Sprachmodelle für die Meta-Analyse-Datenextraktion 自动化的等级是“好到好”? 元分析数据提取大语言模式的基准 2507.15152v1

Authors (3): Lingbo Li, Anuradha Mathrani, Teo Susnjak

Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

nan


Article 368

Title@2025-07-20 (7): A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge’ez Script

Title: A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge’ez Script Ein Fall gegen implizite Standards: Homophone Normalisierung in maschineller Übersetzung für Sprachen, die das Ge’ez Script verwenden 反对隐含标准案:使用盖兹文稿的语文机器翻译中同声传译正常化 2507.15142v1

Authors (7): Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Henok Biadglign Ademtew, Hizkel Mitiku Alemayehu, Negasi Haile Abadi, Tadesse Destaw Belay, Seid Muhie Yimam

Homophone normalization, where characters that have the same sound in a writing script are mapped to one character, is a pre-processing step applied in Amharic Natural Language Processing (NLP) literature. While this may improve performance reported by automatic metrics, it also results in models that are not able to understand different forms of writing in a single language. Further, there might be impacts in transfer learning, where models trained on normalized data do not generalize well to other languages. In this paper, we experiment with monolingual training and cross-lingual transfer to understand the impacts of normalization on languages that use the Ge’ez script. We then propose a post-inference intervention in which normalization is applied to model predictions instead of training data. With our simple scheme of post-inference normalization, we show that we can achieve an increase in BLEU score of up to 1.03 while preserving language features in training. Our work contributes to the broader discussion on technology-facilitated language change and calls for more language-aware interventions.

nan


Article 369

Title@2025-07-20 (7): A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation

Title: A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation Ein semantisch-basierter Optimierungsansatz zur Reparatur von LLMs: Fallstudie zur Codegenerierung 修复LLMLM 的基于语义的优化优化方法:关于代码生成的案例研究 2503.12899v3

Authors (4): Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

Language Models (LMs) are widely used in software engineering for code generation, but they may produce code with errors. Rather than repairing the generated code, an alternative way is to address the underlying failures of models. LM repair offers a lightweight solution to this challenge: it requires minimal data, reduces computational costs, and reduces the side effects. Unlike retraining, LM repair focuses on applying tailored updates to targeted neurons, making it ideal for scenarios with limited resources, high-performance demands, or strict safety requirements. In this paper, we propose Semantic Targeting for Analytical Repair (STAR), a pioneering and novel semantic-based optimization approach for repairing LLMs. STAR realizes the main operations of repairing LMs in an optimization process, including locating buggy neurons'', solvingneuron patches’’, and patching ``buggy neurons’’. Correspondingly, it computes the deltas of weight matrix as the prior information to guide optimization; and attributes the targeted layers and neurons leveraging statistical insights. The neuron patches are computed with a solid semantic-based analytical formula, which directly bridges the changes to logits with the deltas of neurons, by steering latent representations. Compared to the prior work of LM repair (MINT) and optimization methods (SGD), STAR integrates their strengths while mitigating their limitations. STAR supports solving multiple failures together, significantly improving the usefulness. Evaluated on coding tasks using popular code LMs, STAR exhibits superior effectiveness (10.5%-19.9% improvements) and efficiency (2.4-7.0 times speedup). In terms of side effects, namely the balance between generalization and specificity, STAR outperforms prior work by a significant margin. Additionally, we conducted assessments on the overfitting risk of LM repair as well as the cumulative impact.

nan


Article 370

Title@2025-07-20 (7): From Disagreement to Understanding: The Case for Ambiguity Detection in NLI

Title: From Disagreement to Understanding: The Case for Ambiguity Detection in NLI Von der Uneinigkeit zum Verständnis: Der Fall für Ambiguitätserkennung in NLI 从分歧到理解:国家调查局的模糊性探测案例 2507.15114v1

Authors (2): Chathuri Jayaweera, Bonnie Dorr

This position paper argues that annotation disagreement in Natural Language Inference (NLI) is not mere noise but often reflects meaningful interpretive variation, especially when triggered by ambiguity in the premise or hypothesis. While underspecified guidelines and annotator behavior can contribute to variation, content-based ambiguity offers a process-independent signal of divergent human perspectives. We call for a shift toward ambiguity-aware NLI by systematically identifying ambiguous input pairs and classifying ambiguity types. To support this, we present a unified framework that integrates existing taxonomies and illustrate key ambiguity subtypes through concrete examples. These examples reveal how ambiguity shapes annotator decisions and motivate the need for targeted detection methods that better align models with human interpretation. A key limitation is the lack of datasets annotated for ambiguity and subtypes. We propose addressing this gap through new annotated resources and unsupervised approaches to ambiguity detection – paving the way for more robust, explainable, and human-aligned NLI systems.

nan


Article 371

Title@2025-07-20 (7): Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?

Title: Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference? Füllen der Lücke: Ist Commonsense Knowledge Generation nützlich für die natürliche Sprachinferenz? 填补空白:创造常识知识对自然语言推论有用吗? 2507.15100v1

Authors (3): Chathuri Jayaweera, Brianna Yanqui, Bonnie Dorr

Natural Language Inference (NLI) is the task of determining the semantic entailment of a premise for a given hypothesis. The task aims to develop systems that emulate natural human inferential processes where commonsense knowledge plays a major role. However, existing commonsense resources lack sufficient coverage for a variety of premise-hypothesis pairs. This study explores the potential of Large Language Models as commonsense knowledge generators for NLI along two key dimensions: their reliability in generating such knowledge and the impact of that knowledge on prediction accuracy. We adapt and modify existing metrics to assess LLM factuality and consistency in generating in this context. While explicitly incorporating commonsense knowledge does not consistently improve overall results, it effectively helps distinguish entailing instances and moderately improves distinguishing contradictory and neutral inferences.

nan


Article 372

Title@2025-07-20 (7): Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models

Title: Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models Nur ein wenig nach links: Ein theoriebasiertes Maß politischer Bias in großen Sprachmodellen 仅向左一小点:大语言模式中政治偏见的理论依据度量 2503.16148v2

Authors (5): Mats Faulborn, Indira Sen, Max Pellert, Andreas Spitz, David Garcia

Prompt-based language models like GPT4 and LLaMa have been used for a wide variety of use cases such as simulating agents, searching for information, or for content analysis. For all of these applications and others, political biases in these models can affect their performance. Several researchers have attempted to study political bias in language models using evaluation suites based on surveys, such as the Political Compass Test (PCT), often finding a particular leaning favored by these models. However, there is some variation in the exact prompting techniques, leading to diverging findings, and most research relies on constrained-answer settings to extract model responses. Moreover, the Political Compass Test is not a scientifically valid survey instrument. In this work, we contribute a political bias measured informed by political science theory, building on survey design principles to test a wide variety of input prompts, while taking into account prompt sensitivity. We then prompt 11 different open and commercial models, differentiating between instruction-tuned and non-instruction-tuned models, and automatically classify their political stances from 88,110 responses. Leveraging this dataset, we compute political bias profiles across different prompt variations and find that while PCT exaggerates bias in certain models like GPT3.5, measures of political bias are often unstable, but generally more left-leaning for instruction-tuned models. Code and data are available on: https://github.com/MaFa211/theory_grounded_pol_bias

nan


Article 373

Title@2025-07-20 (7): A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations

Title: A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations Eine Strafe geht einen langen Weg: Lexikale Vielfalt in synthetischen Texten unter prompt beeinflussten Längenvariationen messen 惩罚有很长的路要走:在迅速影响长长变的情况下,在合成文字中衡量法律多样性 2507.15092v1

Authors (6): Vijeta Deshpande, Ishita Dasgupta, Uttaran Bhattacharya, Somdeb Sarkhel, Saayan Mitra, Anna Rumshisky

Synthetic text generated by Large Language Models (LLMs) is increasingly used for further training and improvement of LLMs. Diversity is crucial for the effectiveness of synthetic data, and researchers rely on prompt engineering to improve diversity. However, the impact of prompt variations on response text length, and, more importantly, the consequential effect on lexical diversity measurements, remain underexplored. In this work, we propose Penalty-Adjusted Type-Token Ratio (PATTR), a diversity metric robust to length variations. We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families, focusing on a creative writing task of video script generation, where diversity is crucial. We evaluate per-response lexical diversity using PATTR and compare it against existing metrics of Moving-Average TTR (MATTR) and Compression Ratio (CR). Our analysis highlights how text length variations introduce biases favoring shorter responses. Unlike existing metrics, PATTR explicitly considers the task-specific target response length ($L_T$) to effectively mitigate length biases. We further demonstrate the utility of PATTR in filtering the top-10/100/1,000 most lexically diverse responses, showing that it consistently outperforms MATTR and CR by yielding on par or better diversity with high adherence to $L_T$.

nan


Article 374

Title@2025-07-20 (7): Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

Title: Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling Bewertung von Codierungsschemata für transformerbasierte Gene-Sequenz-Modellierung 以变异器为基础的基因序列建模编码方案评价 2507.15087v1

Authors (4): Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song

Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.

nan


Article 375

Title@2025-07-20 (7): The Dual-Route Model of Induction

Title: The Dual-Route Model of Induction Das Dual-Routen-Modell der Induktion 双重制引模式 2504.03022v2

Authors (4): Sheridan Feucht, Eric Todd, Byron Wallace, David Bau

Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we discover a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in parallel with token-level induction heads to copy meaningful text. We show that these heads are responsible for semantic tasks like word-level translation, whereas token induction heads are vital for tasks that can only be done verbatim (like copying nonsense tokens). These two “routes” operate independently: we show that ablation of token induction heads causes models to paraphrase where they would otherwise copy verbatim. By patching concept induction head outputs, we find that they contain language-independent word representations that mediate natural language translation, suggesting that LLMs represent abstract word meanings independent of language or form.

nan


Article 376

Title@2025-07-20 (7): OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Title: OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs OpeNLGauge: Ein erklärbares Maß für die NLG-Evaluierung mit offenen LLMs OpeNLGauge: NLG 评估可解释的计量器,使用开放重力LMs 2503.11858v2

Authors (3): Ivan Kartáč, Mateusz Lango, Ondřej Dušek

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

nan


Article 377

Title@2025-07-20 (7): WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Title: WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization WebShaper: Agentische Datensynthese über Informationssuche Formalisierung WebShaper: 通过信息搜索正规化实现数据同步化 2507.15061v1

Authors (13): Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.

nan


Article 378

Title@2025-07-20 (7): RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

Title: RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback RefCritic: Training von langen Ketten-of-Thought-Kritik-Modellen mit Raffination Feedback 批评:培训具有精炼反馈的 “ 长期研究链 “ 批评模型 2507.15024v1

Authors (9): Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun, Junyang Lin

With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models’ critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8\% and 7.2\% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

nan


Article 379

Title@2025-07-20 (7): How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation

Title: How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation Wie weit sind LLMs davon entfernt, unsere digitalen Zwillinge zu sein? Ein Benchmark für die Persona-Based Behavior Chain Simulation 如何远离“我们的数字双双”的LLMs? 以人为基础的行为链模拟基准 2502.14642v2

Authors (7): Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui

Recently, LLMs have garnered increasing attention across academic disciplines for their potential as human digital twins, virtual proxies designed to replicate individuals and autonomously perform tasks such as decision-making, problem-solving, and reasoning on their behalf. However, current evaluations of LLMs primarily emphasize dialogue simulation while overlooking human behavior simulation, which is crucial for digital twins. To address this gap, we introduce BehaviorChain, the first benchmark for evaluating LLMs’ ability to simulate continuous human behavior. BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas, each with detailed history and profile metadata. For evaluation, we integrate persona metadata into LLMs and employ them to iteratively infer contextually appropriate behaviors within dynamic scenarios provided by BehaviorChain. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.

nan


Article 380

Title@2025-07-20 (7): Towards Harmonized Uncertainty Estimation for Large Language Models

Title: Towards Harmonized Uncertainty Estimation for Large Language Models Hin zu einer harmonisierten Ungewissheitsschätzung für große Sprachmodelle 争取为大语言模式统一不确定性估算 2505.19073v2

Authors (7): Rui Li, Jing Long, Muge Qi, Heming Xia, Lei Sha, Peiyi Wang, Zhifang Sui

To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM’s performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.

nan


Article 381

Title@2025-07-20 (7): Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Title: Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian Dr.Copilot: Ein Multi-Agent Prompt Optimierter Assistent zur Verbesserung der Patienten-Doktor-Kommunikation auf Rumänisch 副驾驶:罗马尼亚改善病人-医生沟通多代理快速优化助理 2507.11299v2

Authors (4): Andrei Niculae, Adrian Cosma, Cosmin Dumitrache, Emilian Rǎdoi

Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr. Copilot , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr. Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.

nan


Article 382

Title@2025-07-20 (7): Why Does New Knowledge Create Messy Ripple Effects in LLMs?

Title: Why Does New Knowledge Create Messy Ripple Effects in LLMs? Warum erzeugt Neues Wissen in LLMs messy Ripple Effekte? 为什么新知识会在LLMS中产生混乱的波纹效应? 2407.12828v3

Authors (5): Jiaxin Qin, Zixuan Zhang, Manling Li, Pengfei Yu, Heng Ji

Extensive previous research has focused on post-training knowledge editing (KE) for language models (LMs) to ensure that knowledge remains accurate and up-to-date. One desired property and open question in KE is to let edited LMs correctly handle ripple effects, where LM is expected to answer its logically related knowledge accurately. In this paper, we answer the question of why most KE methods still create messy ripple effects. We conduct extensive analysis and identify a salient indicator, GradSim, that effectively reveals when and why updated knowledge ripples in LMs. GradSim is computed by the cosine similarity between gradients of the original fact and its related knowledge. We observe a strong positive correlation between ripple effect performance and GradSim across different LMs, KE methods, and evaluation metrics. Further investigations into three counter-intuitive failure cases (Negation, Over-Ripple, Multi-Lingual) of ripple effects demonstrate that these failures are often associated with very low GradSim. This finding validates that GradSim is an effective indicator of when knowledge ripples in LMs.

nan


Article 383

Title@2025-07-20 (7): Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition

Title: Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Unterstützung der SENCOTEN Sprachdokumentation Bemühungen mit automatischer Spracherkennung 支持SENCOTEN语文文件工作,并自动语音识别 2507.10827v2

Authors (6): Mengzhe Geng, Patrick Littell, Aidan Pine, PENÁĆ, Marc Tessier, Roland Kuhn

The SENCOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENCOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENCOTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENCOTEN language documentation.

nan


Article 384

Title@2025-07-20 (7): MUR: Momentum Uncertainty guided Reasoning for Large Language Models

Title: MUR: Momentum Uncertainty guided Reasoning for Large Language Models MUR: Momentum Ungewissheit geführte Begründung für große Sprachmodelle MUR:大语言模型的动态不确定性引导理由 2507.14958v1

Authors (11): Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu

Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.

nan


Article 385

Title@2025-07-20 (7): SYNTHIA: Synthetic Yet Naturally Tailored Human-Inspired PersonAs

Title: SYNTHIA: Synthetic Yet Naturally Tailored Human-Inspired PersonAs SYNTHIA: Synthetisch und doch natürlich maßgeschneiderte, von Menschen inspirierte Person SYNTHIA:合成但自然而然定制的受人类启发的人 2507.14922v1

Authors (4): Vahid Rahimzadeh, Erfan Moosavi Monazzah, Mohammad Taher Pilehvar, Yadollah Yaghoobzadeh

Persona-driven LLMs have emerged as powerful tools in computational social science, yet existing approaches fall at opposite extremes, either relying on costly human-curated data or producing synthetic personas that lack consistency and realism. We introduce SYNTHIA, a dataset of 30,000 backstories derived from 10,000 real social media users from BlueSky open platform across three time windows, bridging this spectrum by grounding synthetic generation in authentic user activity. Our evaluation demonstrates that SYNTHIA achieves competitive performance with state-of-the-art methods in demographic diversity and social survey alignment while significantly outperforming them in narrative consistency. Uniquely, SYNTHIA incorporates temporal dimensionality and provides rich social interaction metadata from the underlying network, enabling new research directions in computational social science and persona-driven language modeling.

nan


Article 386

Title@2025-07-20 (7): GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

Title: GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks GTSinger: Globales Multi-Technique Singen Corpus mit realistischen Noten für alle Singaufgaben GTSinger:一个拥有现实音乐分数的全唱任务全球多技术多技术歌唱公司 2409.13832v8

Authors (18): Yu Zhang, Changhao Pan, Wenxiang Guo, Ruiqi Li, Zhiyuan Zhu, Jialei Wang, Wenhao Xu, Jingyu Lu, Zhiqing Hong, Chuxin Wang, LiChao Zhang, Jinzheng He, Ziyue Jiang, Yuxin Chen, Chen Yang, Jiecheng Zhou, Xinyu Cheng, Zhou Zhao

The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The demos can be found at http://aaronz345.github.io/GTSingerDemo/. We provide the dataset and the code for processing data and conducting benchmarks at https://huggingface.co/datasets/AaronZ345/GTSinger and https://github.com/AaronZ345/GTSinger.

nan


Article 387

Title@2025-07-20 (7): On Entity Identification in Language Models

Title: On Entity Identification in Language Models Zur Identitätskennung in Sprachmodellen 关于在语文模式中实体识别 2506.02701v4

Authors (5): Masaki Sakata, Benjamin Heinzerling, Sho Yokoi, Takumi Ito, Kentaro Inui

We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions – ambiguity and variability – and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.

nan


Article 388

Title@2025-07-20 (7): PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

Title: PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation PromptSuite: Ein Task-Agnostic Framework für die Multi-Prompt-Generation 快速实用:多生一代任务不可确定框架 2507.14913v1

Authors (4): Eliya Habba, Noam Dahan, Gili Lior, Gabriel Stanovsky

Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. It is available through both a Python API: https://github.com/eliyahabba/PromptSuite, and a user-friendly web interface: https://promptsuite.streamlit.app/

nan


Article 389

Title@2025-07-20 (7): AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction

Title: AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction AutoGen Driven Multi Agent Framework für iterative Kriminalität Datenanalyse und Vorhersage 循环犯罪数据分析和预测自动驱动器多剂框架 2506.11475v2

Authors (4): Syeda Kisaa Fatima, Tehreem Zubair, Noman Ahmed, Asifullah Khan

This paper introduces LUCID-MA (Learning and Understanding Crime through Dialogue of Multiple Agents), an innovative AI powered framework where multiple AI agents collaboratively analyze and understand crime data. Our system that consists of three core components: an analysis assistant that highlights spatiotemporal crime patterns; a feedback component that reviews and refines analytical results; and a prediction component that forecasts future crime trends. With a well-designed prompt and the LLaMA-2-13B-Chat-GPTQ model, it runs completely offline and allows the agents undergo self-improvement through 100 rounds of communication with less human interaction. A scoring function is incorporated to evaluate agent performance, providing visual plots to track learning progress. This work demonstrates the potential of AutoGen-style agents for autonomous, scalable, and iterative analysis in social science domains, maintaining data privacy through offline execution. It also showcases a computational model with emergent intelligence, where the system’s global behavior emerges from the interactions of its agents. This emergent behavior manifests as enhanced individual agent performance, driven by collaborative dialogue between the LLM-based agents.

nan


Article 390

Title@2025-07-20 (7): Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Title: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs Sparse Autoencoder-geführte Supervised Finetuning zu Mitigate Unerwartete Code-Switching in LLMs 用于LLMM 中非预期代码切换的微亮自定义编码器导导监督调整 2507.14894v1

Authors (6): Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng

Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50\% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models’ performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.

nan


Article 391

Title@2025-07-20 (7): MEKiT: Multi-source Heterogeneous Knowledge Injection Method via Instruction Tuning for Emotion-Cause Pair Extraction

Title: MEKiT: Multi-source Heterogeneous Knowledge Injection Method via Instruction Tuning for Emotion-Cause Pair Extraction MEKiT: Multi-Source Heterogene Wissensinjektionsmethode über Instruction Tuning für Emotion-Cause-Paar-Extraktion MEKIT:通过情感-原因对等采掘教学图示,多源源、异种知识注射法 2507.14887v1

Authors (6): Shiyi Mu, Yongkang Liu, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

Although large language models (LLMs) excel in text comprehension and generation, their performance on the Emotion-Cause Pair Extraction (ECPE) task, which requires reasoning ability, is often underperform smaller language model. The main reason is the lack of auxiliary knowledge, which limits LLMs’ ability to effectively perceive emotions and reason causes. To address this issue, we propose a novel \textbf{M}ulti-source h\textbf{E}terogeneous \textbf{K}nowledge \textbf{i}njection me\textbf{T}hod, MEKiT, which integrates heterogeneous internal emotional knowledge and external causal knowledge. Specifically, for these two distinct aspects and structures of knowledge, we apply the approaches of incorporating instruction templates and mixing data for instruction-tuning, which respectively facilitate LLMs in more comprehensively identifying emotion and accurately reasoning causes. Experimental results demonstrate that MEKiT provides a more effective and adaptable solution for the ECPE task, exhibiting an absolute performance advantage over compared baselines and dramatically improving the performance of LLMs on the ECPE task.

nan


Article 392

Title@2025-07-20 (7): A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models

Title: A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models Eine Übersicht über die Entwicklung sprachmodellbasierter Dialogsysteme: Daten, Aufgaben und Modelle 语文示范对话系统演变概览:数据、任务和模式 2311.16789v2

Authors (7): Hongru Wang, Lingzhi Wang, Yiming Du, Liang Chen, Jingyan Zhou, Yufei Wang, Kam-Fai Wong

Dialogue systems (DS), including the task-oriented dialogue system (TOD) and the open-domain dialogue system (ODD), have always been a fundamental task in natural language processing (NLP), allowing various applications in practice. Owing to sophisticated training and well-designed model architecture, language models (LM) are usually adopted as the necessary backbone to build the dialogue system. Consequently, every breakthrough in LM brings about a shift in learning paradigm and research attention within dialogue system, especially the appearance of pre-trained language models (PLMs) and large language models (LLMs). In this paper, we take a deep look at the history of the dialogue system, especially its special relationship with the advancements of language models. Thus, our survey offers a systematic perspective, categorizing different stages in a chronological order aligned with LM breakthroughs, providing a comprehensive review of state-of-the-art research outcomes. What’s more, we turn our attention to emerging topics and engage in a discussion on open challenges, providing valuable insights into the future directions for LLM-based dialogue systems. In summary, this survey delves into the dynamic interplay between language models and dialogue systems, unraveling the evolutionary path of this essential relationship. Through this exploration, we pave the way for a deeper comprehension of the field, guiding future developments in LM-based dialogue systems.

nan


Article 393

Title@2025-07-20 (7): Controlling Language Confusion in Multilingual LLMs

Title: Controlling Language Confusion in Multilingual LLMs Sprachkonfusion in mehrsprachigen LLMs kontrollieren 多语种LMM中控制语言混杂 2505.19116v2

Authors (4): Nahyun Lee, Yeongseo Woo, Hyunwoo Ko, Guijin Son

Large language models often suffer from language confusion, a phenomenon in which responses are partially or entirely generated in unintended languages. This critically degrades the user experience, especially in low-resource settings. We hypothesize that this issue stems from limitations in conventional fine-tuning objectives, such as supervised learning, which optimize the likelihood of correct tokens without explicitly penalizing undesired outputs such as cross-lingual mixing. Analysis of loss trajectories during pretraining further reveals that models fail to distinguish between monolingual and language-mixed texts, highlighting the absence of inherent pressure to avoid such confusion. In this work, we apply ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppressing language-confused generations. ORPO maintains strong language consistency, even under high decoding temperatures, while preserving general QA performance. Our findings suggest that incorporating appropriate penalty terms can effectively mitigate language confusion in multilingual models, particularly in low-resource scenarios.

nan


Article 394

Title@2025-07-20 (7): Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages

Title: Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages Transformer und Ensemble-Methoden: Eine Lösung für Hass-Spracherkennung in arabischen Sprachen 变换器和组合方法:用阿拉伯语探测仇恨言论的解决方案 2303.09823v2

Authors (4): Angel Felipe Magnossão de Paula, Imene Bensalem, Paolo Rosso, Wajdi Zaghouani

This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario, were obtained by using the ensemble approach based on the majority vote. The evaluation of this approach on the test set resulted in an F1-score of 0.60 and an Accuracy of 0.86.

nan


Article 395

Title@2025-07-20 (7): Beyond Isolated Capabilities: Bridging Long CoT Reasoning and Long-Context Understanding

Title: Beyond Isolated Capabilities: Bridging Long CoT Reasoning and Long-Context Understanding Über isolierte Fähigkeiten hinaus: Überbrückung von Long CoT-Reasoning und Long Context Understanding 超越孤立能力:连接长 CoT理由和长期理解 2507.14849v1

Authors (1): Yifei Wang

Reasoning distillation has emerged as an effective approach to enhance the reasoning capabilities of smaller language models. However, the impact of large-scale reasoning distillation on other critical abilities, particularly in-context retrieval and reasoning, remains unexplored. This gap in understanding is particularly significant given the increasing importance of Retrieval-Augmented Generation (RAG) systems, where efficient acquisition and utilization of contextual information are paramount for generating reliable responses. Motivated by the need to understand how the extended long-CoT process influences long-context comprehension, we conduct a comprehensive investigation using a series of open-source models distilled from Deepseek-R1, renowned for its exceptional reasoning capabilities. Our study focuses on evaluating these models’ performance in extracting and integrating relevant information from extended contexts through multi-document question and answering tasks. Through rigorous experimentation, we demonstrate that distilled reasoning patterns significantly improve long-context understanding. Our analysis reveals that distillation fosters greater long-context awareness by promoting more detailed and explicit reasoning processes during context analysis and information parsing. This advancement effectively mitigates the persistent “lost in the middle” issue that has hindered long-context models.

nan


Article 396

Title@2025-07-20 (7): The Invisible Leash: Why RLVR May Not Escape Its Origin

Title: The Invisible Leash: Why RLVR May Not Escape Its Origin Die unsichtbare Leine: Warum RLVR seinem Ursprung nicht entkommen kann 隐形Leash:为什么RLVR不能逃离其起源 2507.14843v1

Authors (5): Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi

Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model’s reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model’s support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

nan


Article 397

Title@2025-07-20 (7): Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents

Title: Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents Doc2Chart: Intent-Driven Zero-Shot Chart Generation aus Dokumenten Doc2图示: 从文档中生成零热图 2507.14819v1

Authors (4): Akriti Jain, Pritika Ramu, Aparna Garimella, Apoorv Saxena

Large Language Models (LLMs) have demonstrated strong capabilities in transforming text descriptions or tables to data visualizations via instruction-tuning methods. However, it is not straightforward to apply these methods directly for a more real-world use case of visualizing data from long documents based on user-given intents, as opposed to the user pre-selecting the relevant content manually. We introduce the task of intent-based chart generation from documents: given a user-specified intent and document(s), the goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting. We propose an unsupervised, two-staged framework in which an LLM first extracts relevant information from the document(s) by decomposing the intent and iteratively validates and refines this data. Next, a heuristic-guided module selects an appropriate chart type before final code generation. To assess the data accuracy of the generated charts, we propose an attribution-based metric that uses a structured textual representation of charts, instead of relying on visual decoding metrics that often fail to capture the chart data effectively. To validate our approach, we curate a dataset comprising of 1,242 $<$intent, document, charts$>$ tuples from two domains, finance and scientific, in contrast to the existing datasets that are largely limited to parallel text descriptions/ tables and their corresponding charts. We compare our approach with baselines using single-shot chart generation using LLMs and query-based retrieval methods; our method outperforms by upto $9$ points and $17$ points in terms of chart data accuracy and chart type respectively over the best baselines.

nan


Article 398

Title@2025-07-20 (7): FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

Title: FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing FastLongSpeech: Erweiterung großer Sprachmodelle für eine effiziente Langspeech-Verarbeitung FastLongSpeech:加强大型语音-语言模型,以高效长语音处理 2507.14815v1

Authors (6): Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng

The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.

nan


Article 399

Title@2025-07-20 (7): Lizard: An Efficient Linearization Framework for Large Language Models

Title: Lizard: An Efficient Linearization Framework for Large Language Models Lizard: Ein effizienter Linearisierungsrahmen für große Sprachmodelle Lizard:大型语言模型的高效线性框架 2507.09025v2

Authors (12): Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model’s performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.

nan


Article 400

Title@2025-07-20 (7): Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems

Title: Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems Schwache Überwachungstechniken für verbesserte ASR-Modelle in CRM-Systemen auf Industrieebene 在工业级客户关系管理系统中加强ASR模型的薄弱监督技术 2507.16843v1

Authors (6): Zhongsheng Wang, Sijie Wang, Jia Wang, Yung-I Liang, Yuxi Zhang, Jiamou Liu

In the design of customer relationship management (CRM) systems, accurately identifying customer types and offering personalized services are key to enhancing customer satisfaction and loyalty. However, this process faces the challenge of discerning customer voices and intentions, and general pre-trained automatic speech recognition (ASR) models make it difficult to effectively address industry-specific speech recognition tasks. To address this issue, we innovatively proposed a solution for fine-tuning industry-specific ASR models, which significantly improved the performance of the fine-tuned ASR models in industry applications. Experimental results show that our method substantially improves the crucial auxiliary role of the ASR model in industry CRM systems, and this approach has also been adopted in actual industrial applications.

nan


Article 401

Title@2025-07-20 (7): A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios

Title: A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios Eine Umfrage über großsprachige modellbasierte Sozialagenten in Spiel-Theoretischen Szenarien 关于游戏理论情景中以大语言模式为基础的社会因素的调查 2412.03920v2

Authors (8): Xiachong Feng, Longxu Dou, Ella Li, Qinghao Wang, Haochuan Wang, Yu Guo, Chang Ma, Lingpeng Kong

Game-theoretic scenarios have become pivotal in evaluating the social intelligence of Large Language Model (LLM)-based social agents. While numerous studies have explored these agents in such settings, there is a lack of a comprehensive survey summarizing the current progress. To address this gap, we systematically review existing research on LLM-based social agents within game-theoretic scenarios. Our survey organizes the findings into three core components: Game Framework, Social Agent, and Evaluation Protocol. The game framework encompasses diverse game scenarios, ranging from choice-focusing to communication-focusing games. The social agent part explores agents’ preferences, beliefs, and reasoning abilities, as well as their interactions and synergistic effects on decision-making. The evaluation protocol covers both game-agnostic and game-specific metrics for assessing agent performance. Additionally, we analyze the performance of current social agents across various game scenarios. By reflecting on the current research and identifying future research directions, this survey provides insights to advance the development and evaluation of social agents in game-theoretic scenarios.

nan


Article 402

Title@2025-07-19 (6): GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization

Title: GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization GRACE: Generative Empfehlung über Journey-Aware Sparse Achtung bei der Ketten-of-Thought-Tokenisierung GRACE: 通过Journey-Aware Sparass 注意力在 “ 探索链 “ 中产生的建议 2507.14758v1

Authors (18): Luyi Ma, Wanjia Zhang, Kai Zhao, Abhishek Kulkarni, Lalitesh Morishetti, Anjana Ganesh, Ashish Ranjan, Aashika Padmanabhan, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sumit Dutta, Kamiya Motwani, Malay Patel, Evren Korpeoglu, Sushant Kumar, Kannan Achan

Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.

nan


Article 403

Title@2025-07-19 (6): Domain-Adaptive Small Language Models for Structured Tax Code Prediction

Title: Domain-Adaptive Small Language Models for Structured Tax Code Prediction Domain-Adaptive kleine Sprachmodelle für strukturierte Steuervorhersage 结构化税法预测结构化税法 2507.10880v2

Authors (3): Souvik Nath, Sumit Wadhwa, Luis Perez

Every day, multinational firms process thousands of transactions, each of which must adhere to tax regulations that vary by jurisdiction and are often nuanced. The determination of product and service tax codes, such as HSN or SAC is a major use case in Tax compliance. An accurate determination of such codes is imperative to avoid any tax penalties. This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes. In this approach, we address the problem of predicting hierarchical tax code sequences using unstructured product and services data. We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes to capture the hierarchical dependencies present within the tax codes. Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes, a domain that remains comparatively unexplored in current NLP research. In this paper, we demonstrate the superior performance of the domain-adaptive encoder-decoder SLMs over flat classifiers when applied to the Harmonized System of Nomenclature (HSN), and achieve superior results compared to decoder-only and encoder-only architectures for structured sequence generation tasks. This approach can also be scaled to other government-mandated tax commodity codes, such as United Nations Standard Products and Services Codes (UNSPSC), or Brazil’s Nomenclatura Comum do Mercosul (NCM).

nan


Article 404

Title@2025-07-19 (6): On the robustness of modeling grounded word learning through a child’s egocentric input

Title: On the robustness of modeling grounded word learning through a child’s egocentric input Auf die Robustheit der Modellierung geerdetes Wort Lernen durch den egozentrischen Input eines Kindes 通过儿童以自我为中心的投入进行基于基础的模拟文字学习的强健性 2507.14749v1

Authors (2): Wai Keen Vong, Brenden M. Lake

What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children’s input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child’s developmental experience could acquire word-referent mappings. However, whether this approach’s success reflects the idiosyncrasies of a single child’s experience, or whether it would show consistent and robust learning patterns across multiple children’s experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire and generalize word-referent mappings across multiple network architectures. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child’s developmental experiences.

nan


Article 405

Title@2025-07-19 (6): Disparities in Peer Review Tone and the Role of Reviewer Anonymity

Title: Disparities in Peer Review Tone and the Role of Reviewer Anonymity Unterschiede in Peer Review Tone und die Rolle der Reviewer Anonymität 同行审查方式和审查者匿名作用的差异 2507.14741v1

Authors (2): Maria Sahakyan, Bedoor AlShebli

The peer review process is often regarded as the gatekeeper of scientific integrity, yet increasing evidence suggests that it is not immune to bias. Although structural inequities in peer review have been widely debated, much less attention has been paid to the subtle ways in which language itself may reinforce disparities. This study undertakes one of the most comprehensive linguistic analyses of peer review to date, examining more than 80,000 reviews in two major journals. Using natural language processing and large-scale statistical modeling, it uncovers how review tone, sentiment, and supportive language vary across author demographics, including gender, race, and institutional affiliation. Using a data set that includes both anonymous and signed reviews, this research also reveals how the disclosure of reviewer identity shapes the language of evaluation. The findings not only expose hidden biases in peer feedback, but also challenge conventional assumptions about anonymity’s role in fairness. As academic publishing grapples with reform, these insights raise critical questions about how review policies shape career trajectories and scientific progress.

nan


Article 406

Title@2025-07-19 (6): Finding A Voice: Exploring the Potential of African American Dialect and Voice Generation for Chatbots

Title: Finding A Voice: Exploring the Potential of African American Dialect and Voice Generation for Chatbots Eine Stimme finden: Das Potenzial der afroamerikanischen Dialekt- und Sprachgenerierung für Chatbots erforschen 寻找声音:探索非裔美国人为查波特人创造语音和语音的潜力 2501.03441v2

Authors (4): Sarah E. Finch, Ellie S. Paek, Ikseon Choi, Jinho D. Choi

As chatbots become integral to daily life, personalizing systems is key for fostering trust, engagement, and inclusivity. This study examines how linguistic similarity affects chatbot performance, focusing on integrating African American English (AAE) into virtual agents to better serve the African American community. We develop text-based and spoken chatbots using large language models and text-to-speech technology, then evaluate them with AAE speakers against standard English chatbots. Our results show that while text-based AAE chatbots often underperform, spoken chatbots benefit from an African American voice and AAE elements, improving performance and preference. These findings underscore the complexities of linguistic personalization and the dynamics between text and speech modalities, highlighting technological limitations that affect chatbots’ AA speech generation and pointing to promising future research directions.

nan


Article 407

Title@2025-07-19 (6): Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems

Title: Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems Sorformer: Ein neuartiger Ansatz für Permutations-Resolved Speaker Supervision in Speech-to-Text Systemen 排序前:语音到文字系统变换解决的议长监督新办法 2409.06656v3

Authors (9): Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg

Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal large language models (LLMs), offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models are made publicly available through the NVIDIA NeMo Framework.

nan


Article 408

Title@2025-07-19 (6): Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation

Title: Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation Dynamische Kontext-Tunings für retrieval-angereicherte Generation: Multi-Turn-Planung und Werkzeuganpassung verbessern 回收-提款一代动态环境图示:加强多周期规划和工具适应 2506.11092v2

Authors (4): Jubin Abhishek Soni, Amit Anand, Rajesh Kumar Pandey, Aniket Abhishek Soni

Retrieval-Augmented Generation (RAG) has significantly advanced large language models (LLMs) by grounding their outputs in external tools and knowledge sources. However, existing RAG systems are typically constrained to static, single-turn interactions with fixed toolsets, making them ill-suited for dynamic domains such as healthcare and smart homes, where user intent, available tools, and contextual factors evolve over time. We present Dynamic Context Tuning (DCT), a lightweight framework that extends RAG to support multi-turn dialogue and evolving tool environments without requiring retraining. DCT integrates an attention-based context cache to track relevant past information, LoRA-based retrieval to dynamically select domain-specific tools, and efficient context compression to maintain inputs within LLM context limits. Experiments on both synthetic and real-world benchmarks show that DCT improves plan accuracy by 14% and reduces hallucinations by 37%, while matching GPT-4 performance at significantly lower cost. Furthermore, DCT generalizes to previously unseen tools, enabling scalable and adaptable AI assistants across a wide range of dynamic environments.

nan


Article 409

Title@2025-07-19 (6): APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Title: APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay APIGen-MT: Agentische Pipeline für die Multi-Turn-Datengenerierung über simuliertes Agent-Human-Interplay PAPIGen-MT: 通过模拟代理人间相互作用生成多发数据时的代理管道 2504.03601v4

Authors (15): Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, Caiming Xiong

Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models – the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $\tau$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io

nan


Article 410

Title@2025-07-19 (6): Towards the Next Frontier in Speech Representation Learning Using Disentanglement

Title: Towards the Next Frontier in Speech Representation Learning Using Disentanglement Auf dem Weg zur nächsten Front in der Sprachrepräsentanz Lernen mit Entflechtung 走向使用分离手段进行演讲代表学习的下一个前沿 2407.02543v2

Authors (2): Varun Krishna, Sriram Ganapathy

The popular frameworks for self-supervised learning of speech representations have largely focused on frame-level masked prediction of speech regions. While this has shown promising downstream task performance for speech recognition and related tasks, this has largely ignored factors of speech that are encoded at coarser level, like characteristics of the speaker or channel that remain consistent through-out a speech utterance. In this work, we propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is largely inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by constrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.

nan


Article 411

Title@2025-07-19 (6): Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation

Title: Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation Umdenken bei der Erkennung von Selbstmordgedanken: Ein vertrauensvolles Annotations-Framework und Cross-Lingual Model Evaluation 重新思考潮ideideididation 探测:可信赖的注解框架和跨语言模式评价 2507.14693v1

Authors (3): Amina Dzafic, Merve Kavut, Ulya Bayram

Suicidal ideation detection is critical for real-time suicide prevention, yet its progress faces two under-explored challenges: limited language coverage and unreliable annotation practices. Most available datasets are in English, but even among these, high-quality, human-annotated data remains scarce. As a result, many studies rely on available pre-labeled datasets without examining their annotation process or label reliability. The lack of datasets in other languages further limits the global realization of suicide prevention via artificial intelligence (AI). In this study, we address one of these gaps by constructing a novel Turkish suicidal ideation corpus derived from social media posts and introducing a resource-efficient annotation framework involving three human annotators and two large language models (LLMs). We then address the remaining gaps by performing a bidirectional evaluation of label reliability and model consistency across this dataset and three popular English suicidal ideation detection datasets, using transfer learning through eight pre-trained sentiment and emotion classifiers. These transformers help assess annotation consistency and benchmark model performance against manually labeled data. Our findings underscore the need for more rigorous, language-inclusive approaches to annotation and evaluation in mental health natural language processing (NLP) while demonstrating the questionable performance of popular models with zero-shot transfer learning. We advocate for transparency in model training and dataset construction in mental health NLP, prioritizing data and model reliability.

nan


Article 412

Title@2025-07-19 (6): Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

Title: Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations Mind the Gap: Eine Überprüfung der arabischen Post-Training-Datensätze und deren Einschränkungen 《思想差距:对阿拉伯培训后数据集及其局限性的审查》 2507.14688v1

Authors (8): Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., persona and system prompts); (3) Alignment (e.g., cultural, safety, ethics, and fairness), and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic LLMs and applications while providing concrete recommendations for future efforts in post-training dataset development.

nan


Article 413

Title@2025-07-19 (6): MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

Title: MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization MiroMind-M1: Eine Open-Source-Erhöhung in mathematischer Reasoning über kontextorientierte Multi-Stage-Politikoptimierung MiroMind-MM1:通过上下文软件多层次政策优化在数学理由方面的开放源码进步 2507.14683v1

Authors (18): Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing

Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.

nan


Article 414

Title@2025-07-19 (6): Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

Title: Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care Große Sprachmodelle als medizinische Codes-Selektoren: ein Maßstab unter Verwendung der Internationalen Klassifikation der Primärversorgung 大语言模式作为医疗法典选择者:使用国际初级保健分类的基准 2507.14681v1

Authors (7): Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Egbert van der Haring, Kees van Boven, Marcelo Finger, Luis Fernandez Lopez

Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI’s text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.

nan


Article 415

Title@2025-07-19 (6): Docopilot: Improving Multimodal Models for Document-Level Understanding

Title: Docopilot: Improving Multimodal Models for Document-Level Understanding Docopilot: Verbesserung multimodaler Modelle für die Verständigung auf Dokumentebene Docopolil:改进文件级理解的多模式模式 2507.14675v1

Authors (12): Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang

Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot

nan


Article 416

Title@2025-07-19 (6): Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs

Title: Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs Cleanse: Ungewissheitsabschätzungsansatz mit Clustering-basierter semantischer Konsistenz in LLMs 清洁性:在LLMM中采用基于集群的语义一致性 2507.14649v1

Authors (2): Minsuh Joo, Hyunsoo Cho

Despite the outstanding performance of large language models (LLMs) across various NLP tasks, hallucinations in LLMs–where LLMs generate inaccurate responses–remains as a critical problem as it can be directly connected to a crisis of building safe and reliable LLMs. Uncertainty estimation is primarily used to measure hallucination levels in LLM responses so that correct and incorrect answers can be distinguished clearly. This study proposes an effective uncertainty estimation approach, \textbf{Cl}ust\textbf{e}ring-based sem\textbf{an}tic con\textbf{s}ist\textbf{e}ncy (\textbf{Cleanse}). Cleanse quantifies the uncertainty with the proportion of the intra-cluster consistency in the total consistency between LLM hidden embeddings which contain adequate semantic information of generations, by employing clustering. The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B and two question-answering benchmarks, SQuAD and CoQA.

nan


Article 417

Title@2025-07-19 (6): Linear Relational Decoding of Morphology in Language Models

Title: Linear Relational Decoding of Morphology in Language Models Lineare relationale Dekodierung der Morphologie in Sprachmodellen 语言模型中细胞体理学的线际关系代谢 2507.14640v1

Authors (2): Eric Xia, Jugal Kalita

A two-part affine approximation has been found to be a good approximation for transformer computations over certain subject object relations. Adapting the Bigger Analogy Test Set, we show that the linear transformation Ws, where s is a middle layer representation of a subject token and W is derived from model derivatives, is also able to accurately reproduce final object states for many relations. This linear technique is able to achieve 90% faithfulness on morphological relations, and we show similar findings multi-lingually and across models. Our findings indicate that some conceptual relationships in language models, such as morphology, are readily interpretable from latent space, and are sparsely encoded by cross-layer linear transformations.

nan


Article 418

Title@2025-07-19 (6): CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages

Title: CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages CSSL: Kontrastives Selbst-überwachtes Lernen für Abhängigkeitsparsing auf relativ freien Word-Ordnung und morphologisch reichen Low Resource Sprachen CSSL: 相对自由的有秩序和有体力丰富、低资源语言的自学自学自导学习 2410.06944v2

Authors (4): Pretam Ray, Jivnesh Sandhan, Amrith Krishna, Pawan Goyal

Neural dependency parsing has achieved remarkable performance for low resource morphologically rich languages. It has also been well-studied that morphologically rich languages exhibit relatively free word order. This prompts a fundamental investigation: Is there a way to enhance dependency parsing performance, making the model robust to word order variations utilizing the relatively free word order nature of morphologically rich languages? In this work, we examine the robustness of graph-based parsing architectures on 7 relatively free word order languages. We focus on scrutinizing essential modifications such as data augmentation and the removal of position encoding required to adapt these architectures accordingly. To this end, we propose a contrastive self-supervised learning method to make the model robust to word order variations. Furthermore, our proposed modification demonstrates a substantial average gain of 3.03/2.95 points in 7 relatively free word order languages, as measured by the UAS/LAS Score metric when compared to the best performing baseline.

nan


Article 419

Title@2025-07-19 (6): Growing a Twig to Accelerate Large Vision-Language Models

Title: Growing a Twig to Accelerate Large Vision-Language Models Einen Zweig wachsen, um große Visions-Sprachen-Modelle zu beschleunigen 为加速大型视觉语言模型而成长的Twig 2503.14075v2

Authors (10): Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu

Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM’s early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM – a simple and general architecture by growing a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods.

nan


Article 420

Title: Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining Optimierung der Legal Document Retrieval in Vietnamesen mit semi-harten negativen Bergbau 优化越南法律文件检索,使用半硬负负采矿 2507.14619v1

Authors (4): Van-Hoang Le, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.

nan


Article 421

Title@2025-07-19 (6): Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper

Title: Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenian Primary Care: A Methodology Paper 肯尼亚初级保健背景示范测试的取回强化临床基准:方法文件 2507.14615v1

Authors (16): Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, Talkmore Chidede

Large Language Models(LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care remains underexplored. We present a methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2 and 3 clinical care. Our approach uses retrieval augmented generation (RAG) to ground clinical questions in Kenya’s national guidelines, ensuring alignment with local standards. These guidelines were digitized, chunked, and indexed for semantic retrieval. Gemini Flash 2.0 Lite was then prompted with guideline excerpts to generate realistic clinical scenarios, multiple-choice questions, and rationale based answers in English and Swahili. Kenyan physicians co-created and refined the dataset, and a blinded expert review process ensured clinical accuracy, clarity, and cultural appropriateness. The resulting Alama Health QA dataset includes thousands of regulator-aligned question answer pairs across common outpatient conditions. Beyond accuracy, we introduce evaluation metrics that test clinical reasoning, safety, and adaptability such as rare case detection (Needle in the Haystack), stepwise logic (Decision Points), and contextual adaptability. Initial results reveal significant performance gaps when LLMs are applied to localized scenarios, consistent with findings that LLM accuracy is lower on African medical content than on US-based benchmarks. This work offers a replicable model for guideline-driven, dynamic benchmarking to support safe AI deployment in African health systems.

nan


Article 422

Title@2025-07-19 (6): Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification

Title: Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification Backtranslation und Paraphrasierung in der LLM-Ära? Vergleich von Daten Augmentationsmethoden für die Emotionsklassifizierung LLM 时代的后翻和翻译? 比较情绪分类的数据增强方法 2507.14590v1

Authors (3): Łukasz Radliński, Mateusz Guściora, Jan Kocoń

Numerous domain-specific machine learning tasks struggle with data scarcity and class imbalance. This paper systematically explores data augmentation methods for NLP, particularly through large language models like GPT. The purpose of this paper is to examine and evaluate whether traditional methods such as paraphrasing and backtranslation can leverage a new generation of models to achieve comparable performance to purely generative methods. Methods aimed at solving the problem of data scarcity and utilizing ChatGPT were chosen, as well as an exemplary dataset. We conducted a series of experiments comparing four different approaches to data augmentation in multiple experimental setups. We then evaluated the results both in terms of the quality of generated data and its impact on classification performance. The key findings indicate that backtranslation and paraphrasing can yield comparable or even better results than zero and a few-shot generation of examples.

nan


Article 423

Title@2025-07-19 (6): What do Large Language Models know about materials?

Title: What do Large Language Models know about materials? Was wissen Large Language Models über Materialien? 大语言模型对材料了解多少? 2507.14586v1

Authors (3): Adrian Ehrenhofer, Thomas Wallmersperger, Gianaurelio Cuniberti

Large Language Models (LLMs) are increasingly applied in the fields of mechanical engineering and materials science. As models that establish connections through the interface of language, LLMs can be applied for step-wise reasoning through the Processing-Structure-Property-Performance chain of material science and engineering. Current LLMs are built for adequately representing a dataset, which is the most part of the accessible internet. However, the internet mostly contains non-scientific content. If LLMs should be applied for engineering purposes, it is valuable to investigate models for their intrinsic knowledge – here: the capacity to generate correct information about materials. In the current work, for the example of the Periodic Table of Elements, we highlight the role of vocabulary and tokenization for the uniqueness of material fingerprints, and the LLMs’ capabilities of generating factually correct output of different state-of-the-art open models. This leads to a material knowledge benchmark for an informed choice, for which steps in the PSPP chain LLMs are applicable, and where specialized models are required.

nan


Article 424

Title@2025-07-19 (6): Explainable Collaborative Problem Solving Diagnosis with BERT using SHAP and its Implications for Teacher Adoption

Title: Explainable Collaborative Problem Solving Diagnosis with BERT using SHAP and its Implications for Teacher Adoption Erklärbares kollaboratives Problem beim Lösen der Diagnose mit BERT unter Verwendung von SHAP und dessen Implikationen für die Lehreradoption 使用SHAP及其对教师收养的影响,与BERT进行可解释的协作问题解决分析 2507.14584v1

Authors (3): Kester Wong, Sahan Bulathwela, Mutlu Cukurova

The use of Bidirectional Encoder Representations from Transformers (BERT) model and its variants for classifying collaborative problem solving (CPS) has been extensively explored within the AI in Education community. However, limited attention has been given to understanding how individual tokenised words in the dataset contribute to the model’s classification decisions. Enhancing the explainability of BERT-based CPS diagnostics is essential to better inform end users such as teachers, thereby fostering greater trust and facilitating wider adoption in education. This study undertook a preliminary step towards model transparency and explainability by using SHapley Additive exPlanations (SHAP) to examine how different tokenised words in transcription data contributed to a BERT model’s classification of CPS processes. The findings suggested that well-performing classifications did not necessarily equate to a reasonable explanation for the classification decisions. Particular tokenised words were used frequently to affect classifications. The analysis also identified a spurious word, which contributed positively to the classification but was not semantically meaningful to the class. While such model transparency is unlikely to be useful to an end user to improve their practice, it can help them not to overrely on LLM diagnostics and ignore their human expertise. We conclude the workshop paper by noting that the extent to which the model appropriately uses the tokens for its classification is associated with the number of classes involved. It calls for an investigation into the exploration of ensemble model architectures and the involvement of human-AI complementarity for CPS diagnosis, since considerable human reasoning is still required for fine-grained discrimination of CPS subskills.

nan


Article 425

Title@2025-07-19 (6): Exploring Human-AI Complementarity in CPS Diagnosis Using Unimodal and Multimodal BERT Models

Title: Exploring Human-AI Complementarity in CPS Diagnosis Using Unimodal and Multimodal BERT Models Erforschung der Human-AI-Komplementarität in der CPS-Diagnose mit unimodalen und multimodalen BERT-Modellen 利用单式和多式BERT模型探索在CPS诊断中人与AI的互补性 2507.14579v1

Authors (3): Kester Wong, Sahan Bulathwela, Mutlu Cukurova

Detecting collaborative problem solving (CPS) indicators from dialogue using machine learning techniques is a significant challenge for the field of AI in Education. Recent studies have explored the use of Bidirectional Encoder Representations from Transformers (BERT) models on transcription data to reliably detect meaningful CPS indicators. A notable advancement involved the multimodal BERT variant, AudiBERT, which integrates speech and acoustic-prosodic audio features to enhance CPS diagnosis. Although initial results demonstrated multimodal improvements, the statistical significance of these enhancements remained unclear, and there was insufficient guidance on leveraging human-AI complementarity for CPS diagnosis tasks. This workshop paper extends the previous research by highlighting that the AudiBERT model not only improved the classification of classes that were sparse in the dataset, but it also had statistically significant class-wise improvements over the BERT model for classifications in the social-cognitive dimension. However, similar significant class-wise improvements over the BERT model were not observed for classifications in the affective dimension. A correlation analysis highlighted that larger training data was significantly associated with higher recall performance for both the AudiBERT and BERT models. Additionally, the precision of the BERT model was significantly associated with high inter-rater agreement among human coders. When employing the BERT model to diagnose indicators within these subskills that were well-detected by the AudiBERT model, the performance across all indicators was inconsistent. We conclude the paper by outlining a structured approach towards achieving human-AI complementarity for CPS diagnosis, highlighting the crucial inclusion of model explainability to support human agency and engagement in the reflective coding process.

nan


Article 426

Title@2025-07-19 (6): XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

Title: XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification XL-DURel: Feinsteuerungs-Sentenztransformatoren für die Ordnungs-Wort-in-Kontext-Klassifikation XL-DURel:Odinal Word-in-Ctext分类的微调句式变换器 2507.14578v1

Authors (2): Sachin Yadav, Dominik Schlechtweg

We propose XL-DURel, a finetuned, multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification. We test several loss functions for regression and ranking tasks managing to outperform previous models on ordinal and binary data with a ranking objective based on angular distance in complex space. We further show that binary WiC can be treated as a special case of ordinal WiC and that optimizing models for the general ordinal task improves performance on the more specific binary task. This paves the way for a unified treatment of WiC modeling across different task formulations.

nan


Article 427

Title@2025-07-19 (6): AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Title: AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs? AlgoTune: Können Sprachmodelle allgemeine numerische Programme beschleunigen? AlgoTune: 语言模型能加速通用计算程序吗? 2507.15887v1

Authors (24): Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, Ofir Press

Despite progress in language model (LM) capabilities, evaluations have thus far focused on models’ performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models’ ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 155 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.

nan


Article 428

Title@2025-07-19 (6): BriLLM: Brain-inspired Large Language Model

Title: BriLLM: Brain-inspired Large Language Model BriLLM: Gehirninspiriertes Large Language Model BrILLM: 脑启发型大语言模式 2503.11299v5

Authors (5): Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong

This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of “least resistance” along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model’s working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.

nan


Article 429

Title: KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse KVLink: Beschleunigen von großen Sprachmodellen über effiziente KV Cache Reuse KVLink: 通过高效 KV 缓存再利用加速大语言模型 2502.16002v3

Authors (5): Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.

nan


Article 430

Title@2025-07-19 (6): MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Title: MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation MEMERAG: Mehrsprachiger Meta-Evaluierungs-Benchmark für retrieval Augmented Generation MEMEMAAG: 回收增加的一代多语言端到末至末的元值评价基准 2502.17163v4

Authors (6): María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. Our dataset is available at https://github.com/amazon-science/MEMERAG

nan


Article 431

Title@2025-07-19 (6): Efficient Whole Slide Pathology VQA via Token Compression

Title: Efficient Whole Slide Pathology VQA via Token Compression Effiziente ganze Folie Pathologie VQA über Token Compression 通过 Token 压缩高效的全幻灯片病理学 VQA 2507.14497v1

Authors (7): Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen

Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.

nan


Article 432

Title@2025-07-19 (6): TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Title: TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios ZEIT: Mehrstufiger Benchmark für die zeitliche Reasonierung von LLMs in realen Szenarien 时间:现实世界情景中LLMs的多层次时间理由基准 2505.12891v2

Authors (8): Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

nan


Article 433

Title@2025-07-19 (6): Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification

Title: Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification Label-Semantik Aware Generativer Ansatz für Domain-Agnostic Multilabel-Klassifikation 域-不可知性多标签分类的认知生成方法 2506.06806v2

Authors (5): Subhendu Khatuya, Shashwat Naidu, Saptarshi Ghosh, Pawan Goyal, Niloy Ganguly

The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.

nan


Article 434

Title@2025-07-19 (6): SWI: Speaking with Intent in Large Language Models

Title: SWI: Speaking with Intent in Large Language Models SWI: Sprechen mit Intent in großen Sprachmodellen SWI:用大语言模型表达意向 2503.21544v2

Authors (3): Yuwei Yin, EunJeong Hwang, Giuseppe Carenini

Intent, typically clearly formulated and planned, functions as a cognitive framework for communication and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model’s underlying intention and provides high-level planning to guide subsequent analysis and action. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on text summarization, multi-task question answering, and mathematical reasoning benchmarks consistently demonstrate the effectiveness and generalizability of Speaking with Intent over direct generation without explicit intent. Further analysis corroborates the generalizability of SWI under different experimental settings. Moreover, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. The promising results in enhancing LLMs with explicit intents pave a new avenue for boosting LLMs’ generation and reasoning abilities with cognitive notions.

nan


Article 435

Title@2025-07-19 (6): Draft-based Approximate Inference for LLMs

Title: Draft-based Approximate Inference for LLMs Entwurfsbasierte annähernde Schlussfolgerung für LLM LLMM 的基于草案的近似推论 2506.08373v2

Authors (6): Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, the first method that leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model’s attention activations to identify and discard unimportant prompt tokens. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

nan


Article 436

Title@2025-07-19 (6): AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Title: AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization AlphaDPO: Adaptive Prämienspanne für direkte Präferenzoptimierung AlphaDPO: 直接优化优惠的适应性回报边缘 2410.10148v4

Authors (8): Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO’s assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose $\alpha$-DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, $\alpha$-DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for $\alpha$-DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that $\alpha$-DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at https://github.com/junkangwu/alpha-DPO

nan


Article 437

Title@2025-07-19 (6): Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Title: Texture or Semantics? Vision-Language Models Get Lost in Font Recognition Textur oder Semantik? Vision-Sprachen-Modelle Verloren in Schrifterkennung 纹理还是语义学? 2503.23768v2

Authors (6): Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang

Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

nan


Article 438

Title@2025-07-19 (6): Vulnerability of LLMs to Vertically Aligned Text Manipulations

Title: Vulnerability of LLMs to Vertically Aligned Text Manipulations Schwachstelle von LLMs an vertikal ausgerichtete Textmanipulationen LLMM LLM 易发生垂直一致的文本处理 2410.20016v3

Authors (7): Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Zhen Xiong, Nanyun Peng, Kai-wei Chang

Vertical text input is commonly encountered in various real-world applications, such as mathematical computations and word-based Sudoku puzzles. While current large language models (LLMs) have excelled in natural language tasks, they remain vulnerable to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: \textit{Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input?} In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) \textit{Chain of Thought (CoT)} reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but \textit{few-shot learning} with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.

nan


Article 439

Title@2025-07-19 (6): DRS: Deep Question Reformulation With Structured Output

Title: DRS: Deep Question Reformulation With Structured Output DRS: Tiefenfrage-Reformulation mit strukturierter Ausgabe DRS: 用结构化产出进行深度问题重新分析 2411.17993v5

Authors (6): Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang

Question answering represents a core capability of large language models (LLMs). However, when individuals encounter unfamiliar knowledge in texts, they often formulate questions that the text itself cannot answer due to insufficient understanding of the underlying information. Recent studies reveal that while LLMs can detect unanswerable questions, they struggle to assist users in reformulating these questions. Even advanced models like GPT-3.5 demonstrate limited effectiveness in this regard. To address this limitation, we propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs ability to assist users in reformulating questions to extract relevant information from new documents. DRS combines the strengths of LLMs with a DFS-based algorithm to iteratively explore potential entity combinations and constrain outputs using predefined entities. This structured approach significantly enhances the reformulation capabilities of LLMs. Comprehensive experimental evaluations demonstrate that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.

nan


Article 440

Title@2025-07-19 (6): VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

Title: VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension VlogQA: Aufgaben-, Datensatz- und Ausgangsmodelle für vietnamesisch gesprochene maschinelle Leseverständnisse VlogQA:越南语音机器阅读理解的任务、数据集和基线模型 2402.02655v3

Authors (5): Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube – an extensive source of user-uploaded content, covering the topics of food and travel. By capturing the spoken language of native Vietnamese speakers in natural settings, an obscure corner overlooked in Vietnamese research, the corpus provides a valuable resource for future research in reading comprehension tasks for the Vietnamese language. Regarding performance evaluation, our deep-learning models achieved the highest F1 score of 75.34% on the test set, indicating significant progress in machine reading comprehension for Vietnamese spoken language data. In terms of EM, the highest score we accomplished is 53.97%, which reflects the challenge in processing spoken-based content and highlights the need for further improvement.

nan


Article 441

Title@2025-07-19 (6): It’s Not That Simple. An Analysis of Simple Test-Time Scaling

Title: It’s Not That Simple. An Analysis of Simple Test-Time Scaling Es ist nicht so einfach. Eine Analyse der einfachen Test-Zeit-Skalierung 不是那么简单 简单的测试时间缩放分析 2507.14419v1

Authors (1): Guojun Wu

Prior work proposed simple test-time scaling, a method for replicating this scaling behavior with models distilled from o1-like models by manually controlling test-time compute: either scaling down by enforcing a maximum length or scaling up by iteratively appending “Wait” when the model is about to terminate its generation. This paper presents an analysis of simple test-time scaling and finds that the scaling behavior is largely attributed to scaling down by enforcing a maximum length. In contrast, fine-tuning on long CoT data distilled from o1-like models has no significant impact on scaling behavior, and scaling up by appending “Wait” leads to inconsistencies, as the model may oscillate between solutions. A key distinction exists between scaling down by enforcing a maximum length and scaling up test-time compute in o1-like models, such as DeepSeek-R1\@. These models are typically allowed to utilize as much compute as needed, with the only constraint being the model’s maximum supported length. By learning to naturally scale up test-time compute during reinforcement learning, o1-like models surpass their peak performance when scaling up. In contrast, simple test-time scaling progressively imposes a lower upper limit on model performance as it scales down. While replicating the test-time scaling behavior of o1 models can be straightforward by scaling down, it is crucial to recognize that the goal of scaling test-time compute is to unlock higher performance – beyond what the model could originally achieve – rather than merely reproducing the appearance of scaling behavior.

nan


Article 442

Title@2025-07-19 (6): Inverse Scaling in Test-Time Compute

Title: Inverse Scaling in Test-Time Compute Inverse Skalierung in der Testzeit berechnen 测试时间计算中的反反缩放 2507.14417v1

Authors (14): Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez

We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

nan


Article 443

Title@2025-07-18 (5): Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning

Title: Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning Orchestrator-Agent Trust: Ein modulares KI-Visualisierungssystem mit vertrauensbewusster Orchestrierung und RAG-basierter Reasoning Orchetor-Agentor-Agentor Trust:一个具有信托软件管弦和RAG依据的理由的模块代理 AI 视觉分类系统 2507.10571v2

Authors (4): Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, Nikolaos D. Tselikas

Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94\% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63\% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust

nan


Article 444

Title@2025-07-18 (5): Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions

Title: Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions Bewertung der Zuverlässigkeit großer Sprachmodelle für deduktives Qualitatives Coding: Eine vergleichende Studie von ChatGPT-Interventionen 评估减减量化定性编码大语言模型的可靠性:对聊天点、低质量编码的干预措施的比较研究 2507.14384v1

Authors (2): Angjelin Hila, Elliott Hauser

In this study, we investigate the use of large language models (LLMs), specifically ChatGPT, for structured deductive qualitative coding. While most current research emphasizes inductive coding applications, we address the underexplored potential of LLMs to perform deductive classification tasks aligned with established human-coded schemes. Using the Comparative Agendas Project (CAP) Master Codebook, we classified U.S. Supreme Court case summaries into 21 major policy domains. We tested four intervention methods: zero-shot, few-shot, definition-based, and a novel Step-by-Step Task Decomposition strategy, across repeated samples. Performance was evaluated using standard classification metrics (accuracy, F1-score, Cohen’s kappa, Krippendorff’s alpha), and construct validity was assessed using chi-squared tests and Cramer’s V. Chi-squared and effect size analyses confirmed that intervention strategies significantly influenced classification behavior, with Cramer’s V values ranging from 0.359 to 0.613, indicating moderate to strong shifts in classification patterns. The Step-by-Step Task Decomposition strategy achieved the strongest reliability (accuracy = 0.775, kappa = 0.744, alpha = 0.746), achieving thresholds for substantial agreement. Despite the semantic ambiguity within case summaries, ChatGPT displayed stable agreement across samples, including high F1 scores in low-support subclasses. These findings demonstrate that with targeted, custom-tailored interventions, LLMs can achieve reliability levels suitable for integration into rigorous qualitative coding workflows.

nan


Article 445

Title@2025-07-18 (5): Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms

Title: Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms Kombinatorische Optimierung für alle: Verwendung von LLMs zur Unterstützung von Nicht-Experten bei der Verbesserung von Optimierungsalgorithmen 组合优化全民:利用LLMs帮助非专家改进最佳化算法 2503.10968v2

Authors (2): Camilo Chacón Sartori, Christian Blum

Large Language Models (LLMs) have shown notable potential in code generation for optimization algorithms, unlocking exciting new opportunities. This paper examines how LLMs, rather than creating algorithms from scratch, can improve existing ones without the need for specialized expertise. To explore this potential, we selected 10 baseline optimization algorithms from various domains (metaheuristics, reinforcement learning, deterministic, and exact methods) to solve the classic Travelling Salesman Problem. The results show that our simple methodology often results in LLM-generated algorithm variants that improve over the baseline algorithms in terms of solution quality, reduction in computational time, and simplification of code complexity, all without requiring specialized optimization knowledge or advanced algorithmic implementation skills.

nan


Article 446

Title@2025-07-18 (5): Error-Aware Curriculum Learning for Biomedical Relation Classification

Title: Error-Aware Curriculum Learning for Biomedical Relation Classification Error-Aware Curriculum Learning for Biomedical Relation Classification 生物医学关系分类的错误意识课程学习 2507.14374v1

Authors (3): Sinchani Chakraborty, Sudeshna Sarkar, Pawan Goyal

Relation Classification (RC) in biomedical texts is essential for constructing knowledge graphs and enabling applications such as drug repurposing and clinical decision-making. We propose an error-aware teacher–student framework that improves RC through structured guidance from a large language model (GPT-4o). Prediction failures from a baseline student model are analyzed by the teacher to classify error types, assign difficulty scores, and generate targeted remediations, including sentence rewrites and suggestions for KG-based enrichment. These enriched annotations are used to train a first student model via instruction tuning. This model then annotates a broader dataset with difficulty scores and remediation-enhanced inputs. A second student is subsequently trained via curriculum learning on this dataset, ordered by difficulty, to promote robust and progressive learning. We also construct a heterogeneous biomedical knowledge graph from PubMed abstracts to support context-aware RC. Our approach achieves new state-of-the-art performance on 4 of 5 PPI datasets and the DDI dataset, while remaining competitive on ChemProt.

nan


Article 447

Title@2025-07-18 (5): Text-to-SQL for Enterprise Data Analytics

Title: Text-to-SQL for Enterprise Data Analytics Text-zu-SQL für Enterprise Data Analytics 企业数据分析的文本到SQL 2507.14372v1

Authors (18): Albert Chen, Manas Bundele, Gaurav Ahlawat, Patrick Stetz, Zhitao Wang, Qiang Fei, Donghoon Jung, Audrey Chu, Bharadwaj Jayaraman, Ayushi Panth, Yatin Arora, Sourav Jain, Renjith Varma, Alexey Ilin, Iuliia Melnychuk, Chelsea Chueh, Joyan Sil, Xiaofeng Wang

The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn’s product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three components. First, we construct a knowledge graph that captures up-to-date semantics by indexing database metadata, historical query logs, wikis, and code. We apply clustering to identify relevant tables for each team or product area. Second, we build a Text-to-SQL agent that retrieves and ranks context from the knowledge graph, writes a query, and automatically corrects hallucinations and syntax errors. Third, we build an interactive chatbot that supports various user intents, from data discovery to query writing to debugging, and displays responses in rich UI elements to encourage follow-up chats. Our chatbot has over 300 weekly users. Expert review shows that 53% of its responses are correct or close to correct on an internal benchmark set. Through ablation studies, we identify the most important knowledge graph and modeling components, offering a practical path for developing enterprise Text-to-SQL solutions.

nan


Article 448

Title@2025-07-18 (5): Layerwise Recall and the Geometry of Interwoven Knowledge in LLMs

Title: Layerwise Recall and the Geometry of Interwoven Knowledge in LLMs Layerwise Recall und die Geometrie des verwobenen Wissens in LLMs 平整图层回溯和LLM 中互交知识的几何 2502.10871v2

Authors (2): Ge Lei, Samuel J. Cooper

This study explores how large language models (LLMs) encode interwoven scientific knowledge, using chemical elements and LLaMA-series models as a case study. We identify a 3D spiral structure in the hidden states that aligns with the conceptual structure of the periodic table, suggesting that LLMs can reflect the geometric organization of scientific concepts learned from text. Linear probing reveals that middle layers encode continuous, overlapping attributes that enable indirect recall, while deeper layers sharpen categorical distinctions and incorporate linguistic context. These findings suggest that LLMs represent symbolic knowledge not as isolated facts, but as structured geometric manifolds that intertwine semantic information across layers. We hope this work inspires further exploration of how LLMs represent and reason about scientific knowledge, particularly in domains such as materials science.

nan


Article 449

Title@2025-07-18 (5): Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans

Title: Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans Analysieren Sie die Neuronen, nicht die Einbettungen: Verstehen, wann und wo LLM-Darstellungen mit Menschen ausgerichtet sind 分析神经,而不是内嵌:了解LLM代表何时何地与人类对齐 2502.15090v2

Authors (6): Masha Fedzechkina, Eleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald

Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM’s learned representations align with human representations. In this work, we introduce a novel approach to study representation alignment: we adopt a method from research on activation steering to identify neurons responsible for specific concepts (e.g., ‘‘cat’’) and then analyze the corresponding activation patterns. We find that LLM representations captured this way closely align with human representations inferred from behavioral data, matching inter-human alignment levels. Our approach significantly outperforms the alignment captured by word embeddings, which have been the focus of prior work on human-LLM alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts – we show that LLMs organize concepts in a way that mirrors human concept organization.

nan


Article 450

Title@2025-07-18 (5): Can LLMs Infer Personality from Real World Conversations?

Title: Can LLMs Infer Personality from Real World Conversations? Kann LLMs Persönlichkeit von Real World Conversations ableiten? ” 现实世界对话 “ 的推论人性能能否得到LLMs? 2507.14355v1

Authors (3): Jianfeng Zhu, Ruoming Jin, Karin G. Coifman

Large Language Models (LLMs) such as OpenAI’s GPT-4 and Meta’s LLaMA offer a promising approach for scalable personality assessment from open-ended language. However, inferring personality traits remains challenging, and earlier work often relied on synthetic data or social media text lacking psychometric validity. We introduce a real-world benchmark of 555 semi-structured interviews with BFI-10 self-report scores for evaluating LLM-based personality inference. Three state-of-the-art LLMs (GPT-4.1 Mini, Meta-LLaMA, and DeepSeek) were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference. All models showed high test-retest reliability, but construct validity was limited: correlations with ground-truth scores were weak (max Pearson’s $r = 0.27$), interrater agreement was low (Cohen’s $\kappa < 0.10$), and predictions were biased toward moderate or high trait levels. Chain-of-thought prompting and longer input context modestly improved distributional alignment, but not trait-level accuracy. These results underscore limitations in current LLM-based personality inference and highlight the need for evidence-based development for psychological applications.

nan


Article 451

Title@2025-07-18 (5): Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers

Title: Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers Solo-Anschluss: Eine Parameter-Effiziente Feintuning-Technik für Transformatoren Solo 连接: 用于变形器的参数节能微调技术 2507.14353v1

Authors (2): Harsh Nilesh Pathak, Randy Paffenroth

Parameter efficient fine tuning (PEFT) is a versatile and extensible approach for adapting a Large Language Model (LLM) for newer tasks. One of the most prominent PEFT approaches, Low Rank Adaptation (LoRA), primarily focuses on adjusting the attention weight matrices within individual decoder blocks of a Generative Pre trained Transformer (GPT2). In contrast, we introduce Solo Connection a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices. Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59% relative to LoRA and by more than 99% compared to full fine-tuning of GPT2, an early version of Large Language Models (LLMs). Solo Connection is also motivated by homotopy theory: we introduce a trainable linear transformation that gradually interpolates between a zero vector and the task-specific representation, enabling smooth and stable adaptation over time. While skip connections in the original 12 layer GPT2 are typically confined to individual decoder blocks, subsequent GPT2 variants scale up to 48 layers, and even larger language models can include 128 or more decoder blocks. These expanded architectures underscore the need to revisit how skip connections are employed during fine-tuning. This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model’s ability to adapt to new tasks while leveraging pre-trained knowledge.

nan


Article 452

Title@2025-07-18 (5): Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

Title: Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark Document Haystack: Ein langer Kontext Multimodales Bild/Dokument Verständnis Vision LLM Benchmark Haystack文件:长期、多模式图像/文件理解愿景LLM基准 2507.15882v1

Authors (5): Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur

The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image “needles” at various depths within the documents to challenge VLMs’ retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.

nan


Article 453

Title@2025-07-18 (5): Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Title: Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models Plan für Geschwindigkeit: Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle 速度计划: 遮蔽传播语言模型的饱和日程安排 2506.19037v2

Authors (3): Omer Luxembourg, Haim Permuter, Eliya Nachmani

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

nan


Article 454

Title@2025-07-18 (5): Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

Title: Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning Symbolische Mixture-of-Experts: Adaptives Skill-basiertes Routing für heterogene Vernunft 专家的混合符号:基于适应性技能的异异源理据调离 2503.05641v3

Authors (5): Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting task-level experts is often too coarse-grained, as heterogeneous tasks may require different expertise per instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE’s instance-level expert selection improves performance by a large margin but – when implemented naively – can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we show that Symbolic-MoE beats strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute avg. gain of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE generalizes well to unseen tasks and removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.

nan


Article 455

Title@2025-07-18 (5): How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs

Title: How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs Wie LLMs zeitliche Bedeutung in Narratives verstehen: Eine Fallstudie zur kognitiven Bewertung von LLMs LLM女士 在叙述中如何理解时间含义:对LLMs进行认知评价的案例研究 2507.14307v1

Authors (8): Karin de Langis, Jong Inn Park, Andreas Schramm, Bin Hu, Khanh Chi Le, Michael Mensink, Ahn Thu Tong, Dongyeop Kang

Large language models (LLMs) exhibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question. In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner. Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives. These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding. Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs’ cognitive and linguistic capabilities.

nan


Article 456

Title@2025-07-18 (5): Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study

Title: Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study Ausrichtung großer Sprachmodelle auf ressourcenarme Sprachen durch LLM-basierte Selektive Übersetzung: Eine systematische Studie 通过基于LLM的选择性翻译,使大语言模式与低资源语言相一致:系统研究 2507.14304v1

Authors (7): Rakesh Paul, Anusha Kamath, Kanishk Singla, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar

Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.

nan


Article 457

Title@2025-07-18 (5): In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

Title: In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding In-Depth und In-Breadth: Vorschulung multimodaler Sprachmodelle für ein umfassendes Chart-Verständnis In-Deph和In-Breadth:为全面了解图表而定制的培训前多模式语言模式 2507.14298v1

Authors (6): Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Alexander Jacobson, Lu Yuan, Leonid Sigal

Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model’s understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at https://davidhalladay.github.io/chartscope_demo.

nan


Article 458

Title@2025-07-18 (5): WebGuard: Building a Generalizable Guardrail for Web Agents

Title: WebGuard: Building a Generalizable Guardrail for Web Agents WebGuard: Aufbau einer generalisierbaren Leitplanke für Web-Agenten WebGuard:为网络代理建立一个通用的警卫车 2507.14293v1

Authors (11): Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su

The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.

nan


Article 459

Title@2025-07-18 (5): A General Framework for Inference-time Scaling and Steering of Diffusion Models

Title: A General Framework for Inference-time Scaling and Steering of Diffusion Models Ein allgemeiner Rahmen für Schlussfolgerungs-Zeit-Skalierung und Steuerung von Diffusionsmodellen 传播模型的推推时间缩放和引导总框架 2501.06848v5

Authors (7): Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath

Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman-Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models - even with off-the-shelf rewards - can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .

nan


Article 460

Title@2025-07-18 (5): Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

Title: Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning Harmonie in Divergenz: Auf dem Weg zu einer schnellen, präzisen und speichereffizienten Null-Order-LLM Feinabstimmung 和谐共存:快速、准确和记忆效率高的零级LLM微调 2502.03304v2

Authors (9): Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, Geng Yuan

Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://anonymous.4open.science/r/DiZO-E86D.

nan


Article 461

Title@2025-07-18 (5): NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Title: NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining NoHumansRequired: Autonome High-Quality Bildbearbeitung Triplet Mining 无人要求:自主高品质图像编辑三线采矿 2507.14119v1

Authors (7): Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev

Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.

nan


Article 462

Title@2025-07-18 (5): MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Title: MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs MultiBLiMP 1.0: Ein massiver Mehrsprachigkeits-Benchmark für sprachliche Minimal Pairs MuldiBLIMP 1.0:语言最小对等语言多语种大比例基准 2504.02768v2

Authors (4): Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, Arianna Bisazza

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages

nan


Article 463

Title@2025-07-18 (5): Learning to Reason at the Frontier of Learnability

Title: Learning to Reason at the Frontier of Learnability Vernunft lernen an der Grenze der Lernfähigkeit 学习在可学习的前沿学习理性 2502.12272v5

Authors (5): Thomas Foster, Anya Sims, Johannes Forkel, Mattie Fellows, Jakob Foerster

Reinforcement learning is now widely adopted as the final stage of large language model training, especially for reasoning-style tasks such as maths problems. Typically, models attempt each question many times during a single training step and attempt to learn from their successes and failures. However, we demonstrate that throughout training with two popular algorithms (PPO and VinePPO) on two widely used datasets, many questions are either solved by all attempts - meaning they are already learned - or by none - providing no meaningful training signal. To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training. Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always. Our findings demonstrate that this curriculum consistently boosts training performance across multiple algorithms and datasets, paving the way for more efficient and effective reinforcement learning with LLMs.

nan


Article 464

Title@2025-07-18 (5): Sparse Rewards Can Self-Train Dialogue Agents

Title: Sparse Rewards Can Self-Train Dialogue Agents Sparse Belohnungen können Selbst-Train Dialogmittel 可自我培训对话代理器 2409.04617v3

Authors (4): Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, Yi Yang

Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub at https://github.com/asappresearch/josh-llm-simulation-training

nan


Article 465

Title@2025-07-18 (5): DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits

Title: DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits DENSE: Longitudinal Progress Note Generation mit zeitlicher Modellierung von heterogenen klinischen Anmerkungen über Krankenhausbesuche hinweg DENS: 医院全程探视不同临床诊断说明的实时建模纵向进展说明的生成 2507.14079v1

Authors (2): Garapati Keerthana, Manik Gupta

Progress notes are among the most clinically meaningful artifacts in an Electronic Health Record (EHR), offering temporally grounded insights into a patient’s evolving condition, treatments, and care decisions. Despite their importance, they are severely underrepresented in large-scale EHR datasets. For instance, in the widely used Medical Information Mart for Intensive Care III (MIMIC-III) dataset, only about $8.56\%$ of hospital visits include progress notes, leaving gaps in longitudinal patient narratives. In contrast, the dataset contains a diverse array of other note types, each capturing different aspects of care. We present DENSE (Documenting Evolving Progress Notes from Scattered Evidence), a system designed to align with clinical documentation workflows by simulating how physicians reference past encounters while drafting progress notes. The system introduces a fine-grained note categorization and a temporal alignment mechanism that organizes heterogeneous notes across visits into structured, chronological inputs. At its core, DENSE leverages a clinically informed retrieval strategy to identify temporally and semantically relevant content from both current and prior visits. This retrieved evidence is used to prompt a large language model (LLM) to generate clinically coherent and temporally aware progress notes. We evaluate DENSE on a curated cohort of patients with multiple visits and complete progress note documentation. The generated notes demonstrate strong longitudinal fidelity, achieving a temporal alignment ratio of $1.089$, surpassing the continuity observed in original notes. By restoring narrative coherence across fragmented documentation, our system supports improved downstream tasks such as summarization, predictive modeling, and clinical decision support, offering a scalable solution for LLM-driven note synthesis in real-world healthcare settings.

nan


Article 466

Title@2025-07-18 (5): Critiques of World Models

Title: Critiques of World Models Kritik an Weltmodellen 世界模式的证明 2507.05169v2

Authors (4): Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu

World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

nan


Article 467

Title@2025-07-18 (5): On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

Title: On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding On-Policy-Optimierung mit äquivalenter Gruppenpräferenz für das Multiprogrammieren des Sprachverständnisses 与多方案语言理解的集团等效优先 2505.12723v2

Authors (9): Haoyuan Wu, Rui Ming, Jilong Gao, Hangyu Zhao, Xueyi Chen, Yikai Yang, Haisheng Zheng, Zhuolun He, Bei Yu

Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can be guided to discern IRs equivalent to the source code from inequivalent ones, while also utilizing signals about the mutual equivalence between IRs within the group. This process allows LLMs to capture nuanced aspects of code functionality. By employing OORL for training with code translation tasks, LLMs improve their recognition of code functionality and their understanding of the relationships between code implemented in different languages. Extensive experiments demonstrate that our OORL for LLMs training with code translation tasks achieves significant performance improvements on code benchmarks across multiple programming languages.

nan


Article 468

Title@2025-07-18 (5): Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog

Title: Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog Kollaboratives Rational Speech Act: Pragmatische Begründung für Multi-Turn-Dialog 《合作合理言论法:多发对话的实用理由》 2507.14063v1

Authors (5): Lautaro Estienne, Gabriel Ben Zenou, Nona Naderi, Jackie Cheung, Pablo Piantanida

As AI systems take on collaborative roles, they must reason about shared goals and beliefs-not just generate fluent language. The Rational Speech Act (RSA) framework offers a principled approach to pragmatic reasoning, but existing extensions face challenges in scaling to multi-turn, collaborative scenarios. In this paper, we introduce Collaborative Rational Speech Act (CRSA), an information-theoretic (IT) extension of RSA that models multi-turn dialog by optimizing a gain function adapted from rate-distortion theory. This gain is an extension of the gain model that is maximized in the original RSA model but takes into account the scenario in which both agents in a conversation have private information and produce utterances conditioned on the dialog. We demonstrate the effectiveness of CRSA on referential games and template-based doctor-patient dialogs in the medical domain. Empirical results show that CRSA yields more consistent, interpretable, and collaborative behavior than existing baselines-paving the way for more pragmatic and socially aware language agents.

nan


Article 469

Title@2025-07-18 (5): EdgeVLA: Efficient Vision-Language-Action Models

Title: EdgeVLA: Efficient Vision-Language-Action Models EdgeVLA: Effiziente Vision-Sprache-Aktionsmodelle EdgeVLA: 高效率的愿景-语言-行动模式 2507.14049v1

Authors (9): Paweł Budzianowski, Wesley Maa, Matthew Freed, Jingxiang Mo, Winston Hsiao, Aaron Xie, Tomasz Młoduchowski, Viraj Tipnis, Benjamin Bolte

Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential of this paradigm, deploying large-scale VLMs on resource-constrained mobile manipulation systems remains a significant hurdle. This paper introduces Edge VLA (EVLA), a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models. EVLA maintains the representational power of these models while enabling real-time performance on edge devices. We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs), demonstrating comparable training performance to larger models with significantly reduced computational demands. Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency. We release our model checkpoints and training \href{https://github.com/kscalelabs/evla }{codebase} to foster further research.

nan


Article 470

Title@2025-07-18 (5): Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Title: Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs Cross-Lingual Auto-Evaluation für die Bewertung mehrsprachiger LLMs 评估多种语文LLMs的跨语言自动评价 2410.13394v2

Authors (6): Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

nan


Article 471

Title@2025-07-18 (5): Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks

Title: Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks Bewertung der Wirksamkeit von kosteneffizienten großen Sprachmodellen in biomedizinischen Benchmark-Aufgaben 评价基准生物医学任务中成本效率高的大型语言模型的效力 2507.14045v1

Authors (4): Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang

This paper presents a comprehensive evaluation of cost-efficient Large Language Models (LLMs) for diverse biomedical tasks spanning both text and image modalities. We evaluated a range of closed-source and open-source LLMs on tasks such as biomedical text classification and generation, question answering, and multimodal image processing. Our experimental findings indicate that there is no single LLM that can consistently outperform others across all tasks. Instead, different LLMs excel in different tasks. While some closed-source LLMs demonstrate strong performance on specific tasks, their open-source counterparts achieve comparable results (sometimes even better), with additional benefits like faster inference and enhanced privacy. Our experimental results offer valuable insights for selecting models that are optimally suited for specific biomedical applications.

nan


Article 472

Title@2025-07-18 (5): Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Title: Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models Auf dem Weg zu einer vernünftigen Ära: Eine Umfrage über lange Kette von Gedanken, um große Sprachmodelle zu verstehen 通向理性时代:关于为理由使用大语言模式而寻求的长链研究的调查 2503.09567v5

Authors (10): Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che

Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like “overthinking” and “inference-time scaling.” This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.

nan


Article 473

Title@2025-07-18 (5): CPC-CMS: Cognitive Pairwise Comparison Classification Model Selection Framework for Document-level Sentiment Analysis

Title: CPC-CMS: Cognitive Pairwise Comparison Classification Model Selection Framework for Document-level Sentiment Analysis CPC-CMS: Kognitives Paarweises Vergleichs-Klassifikation Modellauswahl-Framework für Dokument-Level-Sentimentanalyse CPC-CMS:文件级别感知分析文件级别感应分析的认知对称比较比较分类示范选择框架 2507.14022v1

Authors (2): Jianfei Li, Kevin Kam Fung Yuen

This study proposes the Cognitive Pairwise Comparison Classification Model Selection (CPC-CMS) framework for document-level sentiment analysis. The CPC, based on expert knowledge judgment, is used to calculate the weights of evaluation criteria, including accuracy, precision, recall, F1-score, specificity, Matthews Correlation Coefficient (MCC), Cohen’s Kappa (Kappa), and efficiency. Naive Bayes, Linear Support Vector Classification (LSVC), Random Forest, Logistic Regression, Extreme Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM), and A Lite Bidirectional Encoder Representations from Transformers (ALBERT) are chosen as classification baseline models. A weighted decision matrix consisting of classification evaluation scores with respect to criteria weights, is formed to select the best classification model for a classification problem. Three open datasets of social media are used to demonstrate the feasibility of the proposed CPC-CMS. Based on our simulation, for evaluation results excluding the time factor, ALBERT is the best for the three datasets; if time consumption is included, no single model always performs better than the other models. The CPC-CMS can be applied to the other classification applications in different areas.

nan


Article 474

Title@2025-07-18 (5): Efficient Temporal Tokenization for Mobility Prediction with Large Language Models

Title: Efficient Temporal Tokenization for Mobility Prediction with Large Language Models Effiziente zeitliche Tokenisierung für Mobilitätsvorhersage mit großen Sprachmodellen 具有大语言模式的流动预测高效时时适调 2507.14017v1

Authors (4): Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang

We introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a framework that leverages large language models (LLMs) as spatio-temporal predictors and trajectory reasoners. RHYTHM partitions trajectories into daily segments encoded as discrete tokens with hierarchical attention, capturing both daily and weekly dependencies while substantially reducing the sequence length. Token representations are enriched with pre-computed prompt embeddings via a frozen LLM, enhancing the model’s ability to capture interdependencies without extensive computational overhead. By freezing the LLM backbone, RHYTHM achieves significant computational efficiency. Evaluation on three real-world datasets demonstrates a 2.4% improvement in accuracy, 5.0% increase on weekends, and 24.6% reduction in training time compared to state-of-the-art methods.

nan


Article 475

Title@2025-07-18 (5): On the class of coding optimality of human languages and the origins of Zipf’s law

Title: On the class of coding optimality of human languages and the origins of Zipf’s law Über die Klasse der Kodierung der optimalen menschlichen Sprachen und die Ursprünge des Zippschen Gesetzes 在人类语言最优化的编码和齐普夫法律的起源方面 2505.20015v4

Authors (1): Ramon Ferrer-i-Cancho

Here we present a new class of optimality for coding systems. Members of that class are displaced linearly from optimal coding and thus exhibit Zipf’s law, namely a power-law distribution of frequency ranks. Within that class, Zipf’s law, the size-rank law and the size-probability law form a group-like structure. We identify human languages that are members of the class. All languages showing sufficient agreement with Zipf’s law are potential members of the class. In contrast, there are communication systems in other species that cannot be members of that class for exhibiting an exponential distribution instead but dolphins and humpback whales might. We provide a new insight into plots of frequency versus rank in double logarithmic scale. For any system, a straight line in that scale indicates that the lengths of optimal codes under non-singular coding and under uniquely decodable encoding are displaced by a linear function whose slope is the exponent of Zipf’s law. For systems under compression and constrained to be uniquely decodable, such a straight line may indicate that the system is coding close to optimality. We provide support for the hypothesis that Zipf’s law originates from compression and define testable conditions for the emergence of Zipf’s law in compressing systems.

nan


Article 476

Title@2025-07-18 (5): Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic

Title: Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Offene automatische Spracherkennungsmodelle für klassische und moderne Standard-Arabisch 经典和现代阿拉伯文标准开放自动语音识别模式 2507.13977v1

Authors (5): Lilit Grigoryan, Nikolay Karpov, Enas Albasiri, Vitaly Lavrukhin, Boris Ginsburg

Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language’s complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.

nan


Article 477

Title@2025-07-18 (5): From Roots to Rewards: Dynamic Tree Reasoning with RL

Title: From Roots to Rewards: Dynamic Tree Reasoning with RL Von Wurzeln zu Belohnungen: Dynamische Baumveranlagung mit RL 从根到奖赏: 使用 RL 解释动态树 2507.13142v2

Authors (2): Ahmed Bahloul, Simon Malberg

Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.

nan


Article 478

Title@2025-07-18 (5): Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Title: Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking Ev2R: Evidence Retrieval im automatisierten Fact-Checking bewerten Ev2R:评价自动实况调查中的证据检索 2411.05375v2

Authors (3): Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos

Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce \textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}} which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev\textsuperscript{2}R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev\textsuperscript{2}R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev\textsuperscript{2}R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.\footnote{Code is available at \href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}.}

nan


Article 479

Title@2025-07-18 (5): Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

Title: Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need Bottom-up Domain-spezifische Superintelligenz: Eine zuverlässige Wissensgrafik ist das, was wir brauchen 自下而上 内地特有超级情报机构:一个可靠的知识图是我们需要的 2507.13966v1

Authors (3): Bhishma Dedhia, Yuval Kansal, Niraj K. Jha

Language models traditionally used for cross-domain generalization have recently demonstrated task-specific reasoning. However, their top-down training approach on general corpora is insufficient for acquiring abstractions needed for deep domain expertise. This may require a bottom-up approach that acquires expertise by learning to compose simple domain concepts into more complex ones. A knowledge graph (KG) provides this compositional structure, where domain primitives are represented as head-relation-tail edges and their paths encode higher-level concepts. We present a task generation pipeline that synthesizes tasks directly from KG primitives, enabling models to acquire and compose them for reasoning. We fine-tune language models on the resultant KG-grounded curriculum to demonstrate domain-specific superintelligence. While broadly applicable, we validate our approach in medicine, where reliable KGs exist. Using a medical KG, we curate 24,000 reasoning tasks paired with thinking traces derived from diverse medical primitives. We fine-tune the QwQ-32B model on this curriculum to obtain QwQ-Med-3 that takes a step towards medical superintelligence. We also introduce ICD-Bench, an evaluation suite to quantify reasoning abilities across 15 medical domains. Our experiments demonstrate that QwQ-Med-3 significantly outperforms state-of-the-art reasoning models on ICD-Bench categories. Further analysis reveals that QwQ-Med-3 utilizes acquired primitives to widen the performance gap on the hardest tasks of ICD-Bench. Finally, evaluation on medical question-answer benchmarks shows that QwQ-Med-3 transfers acquired expertise to enhance the base model’s performance. While the industry’s approach to artificial general intelligence (AGI) emphasizes broad expertise, we envision a future in which AGI emerges from the composable interaction of efficient domain-specific superintelligent agents.

nan


Article 480

Title@2025-07-18 (5): Exploiting Primacy Effect To Improve Large Language Models

Title: Exploiting Primacy Effect To Improve Large Language Models Nutzung des Primateffekts zur Verbesserung großer Sprachmodelle 利用优势效应改进大语言模式 2507.13949v1

Authors (2): Bianca Raimondi, Maurizio Gabbrielli

Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect-where items presented first are more likely to be remembered or selected-plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.

nan


Article 481

Title@2025-07-18 (5): Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support

Title: Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support Marcel: Ein leichter und offener Gesprächsagent für Studentenunterstützung an der Universität 马塞尔:一个轻量级和开放源码的大学学生支助对话代理人 2507.13937v1

Authors (6): Jan Trienes, Anastasiia Derzhanskaia, Roland Schwarzkopf, Markus Mühling, Jörg Schlötterer, Christin Seifert

We present Marcel, a lightweight and open-source conversational agent designed to support prospective students with admission-related inquiries. The system aims to provide fast and personalized responses, while reducing workload of university staff. We employ retrieval-augmented generation to ground answers in university resources and to provide users with verifiable, contextually relevant information. To improve retrieval quality, we introduce an FAQ retriever that maps user questions to knowledge-base entries, allowing administrators to steer retrieval, and improving over standard dense/hybrid retrieval strategies. The system is engineered for easy deployment in resource-constrained academic settings. We detail the system architecture, provide a technical evaluation of its components, and report insights from a real-world deployment.

nan


Article 482

Title@2025-07-18 (5): Preprint: Did I Just Browse A Website Written by LLMs?

Title: Preprint: Did I Just Browse A Website Written by LLMs? Preprint: Habe ich gerade eine Website durchsucht, die von LLMs geschrieben wurde? 预印:我刚刚浏览了一个由LLMS编写的网站吗? 2507.13933v1

Authors (3): Sichang “Steven” He, Ramesh Govindan, Harsha V. Madhyastha

Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this “LLM-dominant” content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are insufficient, because they perform well mainly on clean, prose-like text, while web content has complex markup and diverse genres. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector’s outputs of multiple prose-like pages. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem.

nan


Article 483

Title@2025-07-18 (5): The Levers of Political Persuasion with Conversational AI

Title: The Levers of Political Persuasion with Conversational AI Die Leiter der politischen Überzeugung mit konversatorischer KI 与AI协会对话的政治见解的先锋 2507.13919v1

Authors (10): Kobi Hackenburg, Ben M. Tappin, Luke Hewitt, Ed Saunders, Sid Black, Hause Lin, Catherine Fist, Helen Margetts, David G. Rand, Christopher Summerfield

There are widespread fears that conversational AI could soon exert unprecedented influence over human beliefs. Here, in three large-scale experiments (N=76,977), we deployed 19 LLMs-including some post-trained explicitly for persuasion-to evaluate their persuasiveness on 707 political issues. We then checked the factual accuracy of 466,769 resulting LLM claims. Contrary to popular concerns, we show that the persuasive power of current and near-future AI is likely to stem more from post-training and prompting methods-which boosted persuasiveness by as much as 51% and 27% respectively-than from personalization or increasing model scale. We further show that these methods increased persuasion by exploiting LLMs’ unique ability to rapidly access and strategically deploy information and that, strikingly, where they increased AI persuasiveness they also systematically decreased factual accuracy.

nan


Article 484

Title@2025-07-18 (5): Political Leaning and Politicalness Classification of Texts

Title: Political Leaning and Politicalness Classification of Texts Politisches Leaning und Politisches Einordnen von Texten 文本的政治精度和政治政治性分类 2507.13913v1

Authors (2): Matous Volf, Jakub Simko

This paper addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.

nan


Article 485

Title@2025-07-18 (5): Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review

Title: Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review 从商业文件中提取的基于深学习的关键信息:系统文献审查 2408.06345v2

Authors (2): Alexander Michael Rombach, Peter Fettke

Extracting key information from documents represents a large portion of business workloads and therefore offers a high potential for efficiency improvements and process automation. With recent advances in Deep Learning, a plethora of Deep Learning based approaches for Key Information Extraction have been proposed under the umbrella term Document Understanding that enable the processing of complex business documents. The goal of this systematic literature review is an in-depth analysis of existing approaches in this domain and the identification of opportunities for further research. To this end, 130 approaches published between 2017 and 2024 are analyzed in this study.

nan


Article 486

Title@2025-07-18 (5): Using LLMs to identify features of personal and professional skills in an open-response situational judgment test

Title: Using LLMs to identify features of personal and professional skills in an open-response situational judgment test Verwendung von LLMs zur Identifizierung von Merkmalen persönlicher und beruflicher Fähigkeiten in einem offenen situativen Beurteilungstest 利用LLMM 确定公开反应情况判断测试中个人和专业技能的特点 2507.13881v1

Authors (4): Cole Walsh, Rodica Ivan, Muhammad Zafar Iqbal, Colleen Robb

Academic programs are increasingly recognizing the importance of personal and professional skills and their critical role alongside technical expertise in preparing students for future success in diverse career paths. With this growing demand comes the need for scalable systems to measure, evaluate, and develop these skills. Situational Judgment Tests (SJTs) offer one potential avenue for measuring these skills in a standardized and reliable way, but open-response SJTs have traditionally relied on trained human raters for evaluation, presenting operational challenges to delivering SJTs at scale. Past attempts at developing NLP-based scoring systems for SJTs have fallen short due to issues with construct validity of these systems. In this article, we explore a novel approach to extracting construct-relevant features from SJT responses using large language models (LLMs). We use the Casper SJT to demonstrate the efficacy of this approach. This study sets the foundation for future developments in automated scoring for personal and professional skills.

nan


Article 487

Title@2025-07-18 (5): Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies

Title: Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies Optimierung von ASR für katalanische-spanische Code-Switching: Eine vergleichende Analyse von Methodologien 优化加泰罗尼亚-西班牙编码转换的ASR:方法比较分析 2507.13875v1

Authors (9): Carlos Mena, Pol Serra, Jacobo Romero, Abir Messaoudi, Jose Giraldo, Carme Armentano-Oller, Rodolfo Zevallos, Ivan Meza, Javier Hernando

Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities. The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI’s Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.

nan


Article 488

Title@2025-07-18 (5): Label Unification for Cross-Dataset Generalization in Cybersecurity NER

Title: Label Unification for Cross-Dataset Generalization in Cybersecurity NER Label-Einheit für Cross-Dataset-Verallgemeinerung in Cybersecurity NER 网络安全通用化网络安全 2507.13870v1

Authors (3): Maciej Jalocha, Johan Hausted Schmidt, William Michelseen

The field of cybersecurity NER lacks standardized labels, making it challenging to combine datasets. We investigate label unification across four cybersecurity datasets to increase data resource usability. We perform a coarse-grained label unification and conduct pairwise cross-dataset evaluations using BiLSTM models. Qualitative analysis of predictions reveals errors, limitations, and dataset differences. To address unification limitations, we propose alternative architectures including a multihead model and a graph-based transfer model. Results show that models trained on unified datasets generalize poorly across datasets. The multihead model with weight sharing provides only marginal improvements over unified training, while our graph-based transfer model built on BERT-base-NER shows no significant performance gains compared BERT-base-NER.

nan


Article 489

Title@2025-07-18 (5): HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

Title: HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation HoH: Ein dynamischer Benchmark zur Bewertung der Auswirkungen veralteter Informationen auf die retrieval-augmentierte Generation HoH:评估过时信息对回源一代人的影响的动态基准 2503.04800v3

Authors (7): Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu

While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it still faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures the evolution of temporal knowledge in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG. Our code and data are available at: https://github.com/0russwest0/HoH.

nan


Article 490

Title@2025-07-18 (5): SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection

Title: SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection SPARQL Query Generation mit LLMs: Messung der Auswirkungen von Trainingsdatenerfassung und Wissensinjektion SPARQL 使用LLMs 进行查询:衡量培训数据记忆和知识输入的影响 2507.13859v1

Authors (4): Aleksandr Gashkov, Aleksandr Perevalov, Maria Eltsova, Andreas Both

Nowadays, the importance of software with natural-language user interfaces cannot be underestimated. In particular, in Question Answering (QA) systems, generating a SPARQL query for a given natural-language question (often named Query Building) from the information retrieved from the same question is the central task of QA systems working over Knowledge Graphs (KGQA). Due to the rise of Large Language Models (LLMs), they are considered a well-suited method to increase the quality of the question-answering functionality, as there is still a lot of room for improvement, aiming for enhanced quality and trustworthiness. However, LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data. In this paper, we introduce a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question under various conditions: (1) zero-shot SPARQL generation, (2) with knowledge injection, and (3) with “anonymized” knowledge injection. This enables us, for the first time, to estimate the influence of the training data on the QA quality improved by LLMs. Ultimately, this will help to identify how portable a method is or whether good results might mostly be achieved because a benchmark was already included in the training data (cf. LLM memorization). The developed method is portable, robust, and supports any knowledge graph; therefore, it could be easily applied to any KGQA or LLM, s.t., generating consistent insights into the actual LLM capabilities is possible.

nan


Article 491

Title@2025-07-18 (5): InTraVisTo: Inside Transformer Visualisation Tool

Title: InTraVisTo: Inside Transformer Visualisation Tool InTraVisTo: Innen-Transformer-Visualisierungswerkzeug IntraVisto: 内部变异可视化工具 2507.13858v1

Authors (5): Nicolò Brunello, Davide Rigamonti, Andrea Sassella, Vincenzo Scotti, Mark James Carman

The reasoning capabilities of Large Language Models (LLMs) have increased greatly over the last few years, as have their size and complexity. Nonetheless, the use of LLMs in production remains challenging due to their unpredictable nature and discrepancies that can exist between their desired behavior and their actual model output. In this paper, we introduce a new tool, InTraVisTo (Inside Transformer Visualisation Tool), designed to enable researchers to investigate and trace the computational process that generates each token in a Transformer-based LLM. InTraVisTo provides a visualization of both the internal state of the Transformer model (by decoding token embeddings at each layer of the model) and the information flow between the various components across the different layers of the model (using a Sankey diagram). With InTraVisTo, we aim to help researchers and practitioners better understand the computations being performed within the Transformer model and thus to shed some light on internal patterns and reasoning processes employed by LLMs.

nan


Article 492

Title@2025-07-18 (5): Modeling Fair Play in Detective Stories with Language Models

Title: Modeling Fair Play in Detective Stories with Language Models Modeling Fair Play in Detektivgeschichten mit Sprachmodellen 模拟具有语言模式的侦探故事中的公平游戏 2507.13841v1

Authors (3): Eitan Wagner, Renana Keydar, Omri Abend

Effective storytelling relies on a delicate balance between meeting the reader’s prior expectations and introducing unexpected developments. In the domain of detective fiction, this tension is known as fair play, which includes the implicit agreement between the writer and the reader as to the range of possible resolutions the mystery story may have. In this work, we present a probabilistic framework for detective fiction that allows us to define desired qualities. Using this framework, we formally define fair play and design appropriate metrics for it. Stemming from these definitions is an inherent tension between the coherence of the story, which measures how much it ``makes sense’’, and the surprise it induces. We validate the framework by applying it to LLM-generated detective stories. This domain is appealing since we have an abundance of data, we can sample from the distribution generating the story, and the story-writing capabilities of LLMs are interesting in their own right. Results show that while LLM-generated stories may be unpredictable, they generally fail to balance the trade-off between surprise and fair play, which greatly contributes to their poor quality.

nan


Article 493

Title@2025-07-18 (5): The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words

Title: The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words Die Ausdrucksformen von Depression und Angst im chinesischen Psycho-Springen: Verwendung von Singular Pronomen und negativen emotionalen Wörtern 《中国心理咨询中抑郁和焦虑的表现形式:第一人使用一人独唱普罗诺文和消极情感言词》 2507.13839v1

Authors (10): Lizhi Ma, Tong Zhao, Shuai Zhang, Nirui Song, Hongliang He, Anqi Li, Ran Feng, Huachuan Qiu, Jingsong Ma, Zhenzhong Lan

This study explores the relationship between linguistic expressions and psychological states of depression and anxiety within Chinese psycho-counseling interactions, focusing specifically on the usage of first-person singular pronouns and negative emotional words. Utilizing a corpus derived from 735 online counseling sessions, the analysis employed a general linear mixed-effect model to assess linguistic patterns quantified by the Linguistic Inquiry and Word Count (LIWC) software. Results indicate a significant positive correlation between the frequency of negative emotional words and the severity of both depressive and anxious states among clients. However, contrary to prior findings predominantly derived from English-language contexts, the usage frequency of first-person singular pronouns did not vary significantly with the clients’ psychological conditions. These outcomes are discussed within the framework of cultural distinctions between collectivist Chinese contexts and individualistic Western settings, as well as the interactive dynamics unique to psycho-counseling conversations. The findings highlight the nuanced influence of cultural and conversational contexts on language use in mental health communications, providing insights into psycholinguistic markers relevant to therapeutic practices in Chinese-speaking populations.

nan


Article 494

Title@2025-07-18 (5): LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop

Title: LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop 学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈 2507.04295v3

Authors (4): Runcong Zhao, Artem Bobrov, Jiazheng Li, Yulan He

Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.

nan


Article 495

Title@2025-07-18 (5): Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models

Title: Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models Frage-Antwort-Extraktion aus wissenschaftlichen Artikeln mit Wissensgraphen und großen Sprachmodellen 利用知识图和大语言模型从科学文章中提取问题答案 2507.13827v1

Authors (6): Hosein Azarbonyad, Zi Long Zhu, Georgios Cheirmpos, Zubair Afzal, Vikrant Yadav, Georgios Tsatsaronis

When deciding to read an article or incorporate it into their research, scholars often seek to quickly identify and understand its main ideas. In this paper, we aim to extract these key concepts and contributions from scientific articles in the form of Question and Answer (QA) pairs. We propose two distinct approaches for generating QAs. The first approach involves selecting salient paragraphs, using a Large Language Model (LLM) to generate questions, ranking these questions by the likelihood of obtaining meaningful answers, and subsequently generating answers. This method relies exclusively on the content of the articles. However, assessing an article’s novelty typically requires comparison with the existing literature. Therefore, our second approach leverages a Knowledge Graph (KG) for QA generation. We construct a KG by fine-tuning an Entity Relationship (ER) extraction model on scientific articles and using it to build the graph. We then employ a salient triplet extraction method to select the most pertinent ERs per article, utilizing metrics such as the centrality of entities based on a triplet TF-IDF-like measure. This measure assesses the saliency of a triplet based on its importance within the article compared to its prevalence in the literature. For evaluation, we generate QAs using both approaches and have them assessed by Subject Matter Experts (SMEs) through a set of predefined metrics to evaluate the quality of both questions and answers. Our evaluations demonstrate that the KG-based approach effectively captures the main ideas discussed in the articles. Furthermore, our findings indicate that fine-tuning the ER extraction model on our scientific corpus is crucial for extracting high-quality triplets from such documents.

nan


Article 496

Title@2025-07-18 (5): RAG-based Architectures for Drug Side Effect Retrieval in LLMs

Title: RAG-based Architectures for Drug Side Effect Retrieval in LLMs RAG-basierte Architekturen für Arzneimittel-Side-Effekt-Retrieval in LLMs 以RAG为基础的长效LM中药物副效应回收建筑 2507.13822v1

Authors (6): Shad Nygren, Pinar Avci, Andre Daniels, Reza Rassol, Afshin Beheshti, Diego Galeano

Drug side effects are a major global health concern, necessitating advanced methods for their accurate detection and analysis. While Large Language Models (LLMs) offer promising conversational interfaces, their inherent limitations, including reliance on black-box training data, susceptibility to hallucinations, and lack of domain-specific knowledge, hinder their reliability in specialized fields like pharmacovigilance. To address this gap, we propose two architectures: Retrieval-Augmented Generation (RAG) and GraphRAG, which integrate comprehensive drug side effect knowledge into a Llama 3 8B language model. Through extensive evaluations on 19,520 drug side effect associations (covering 976 drugs and 3,851 side effect terms), our results demonstrate that GraphRAG achieves near-perfect accuracy in drug side effect retrieval. This framework offers a highly accurate and scalable solution, signifying a significant advancement in leveraging LLMs for critical pharmacovigilance applications.

nan


Article 497

Title@2025-07-18 (5): Exploring Graph Representations of Logical Forms for Language Modeling

Title: Exploring Graph Representations of Logical Forms for Language Modeling Erforschen von Graphendarstellungen von Logischen Formen für die Sprachmodellierung 探讨语言建模逻辑格式图示图示 2505.14523v2

Authors (1): Michael Sullivan

We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the Graph-based Formal-Logical Distributional Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs (BERT) pretrained on the same data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.

nan


Article 498

Title@2025-07-18 (5): Consistency of Responses and Continuations Generated by Large Language Models on Social Media

Title: Consistency of Responses and Continuations Generated by Large Language Models on Social Media Kohärenz von Reaktionen und Fortsetzungen, die von großen Sprachmodellen in den sozialen Medien erzeugt werden 由社会媒体大语言模式生成的应对措施和延续的一致性 2501.08102v3

Authors (5): Wenlu Fan, Yuqi Zhu, Chenyang Wang, Bin Wang, Wentao Xu

Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using two open-source models: Gemma and Llama. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic similarity between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: Gemma shows a tendency toward negative emotion amplification, particularly anger, while maintaining certain positive emotions like optimism. Llama demonstrates superior emotional preservation across a broader spectrum of affects. Both models systematically generate responses with attenuated emotional intensity compared to human-authored content and show a bias toward positive emotions in response tasks. Additionally, both models maintain strong semantic similarity with original texts, though performance varies between continuation and response tasks. These findings provide insights into LLMs’ emotional and semantic processing capabilities, with implications for their deployment in social media contexts and human-AI interaction design.

nan


Article 499

Title@2025-07-18 (5): Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian

Title: Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian Code Lesbarkeit im Zeitalter großer Sprachmodelle: Eine industrielle Fallstudie von Atlassian 《大语言模式时代的可读性:阿特拉斯斯语工业案例研究》 2501.11264v3

Authors (6): Wannita Takerngsaksiri, Chakkrit Tantithamthavorn, Micheal Fu, Jirat Pasuksmit, Kun Chen, Ming Wu

Software engineers spend a significant amount of time reading code during the software development process, especially in the age of large language models (LLMs) that can automatically generate code. However, little is known about the readability of the LLM-generated code and whether it is still important from practitioners’ perspectives in this new era. In this paper, we conduct a survey to explore the practitioners’ perspectives on code readability in the age of LLMs and investigate the readability of our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios. Overall, the findings underscore that (1) readability remains a critical aspect of software development; (2) the readability of our LLM-generated code is comparable to human-written code, fostering the establishment of appropriate trust and driving the broad adoption of our LLM-powered software development platform.

nan


Article 500

Title@2025-07-18 (5): An Enhanced Model-based Approach for Short Text Clustering

Title: An Enhanced Model-based Approach for Short Text Clustering Ein verbesserter modellbasierter Ansatz für Kurztext-Clustering 强化的短文本集群化基于模式的强化办法 2507.13793v1

Authors (6): Enhao Cheng, Shoujia Zhang, Jianhua Yin, Xuemeng Song, Tian Gan, Liqiang Nie

Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. To address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster. Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adaptively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at https://github.com/chehaoa/VEMC.

nan


Article 501

Title@2025-07-18 (5): Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Title: Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions Vision-Sprachen-Modelle zu fragen lehren: Ambiguität in visuellen Fragen lösen 教学 “ 视觉-语言模型:解决视觉问题中的模糊问题 “ 的问询 2507.13773v1

Authors (5): Pu Jian, Donglei Yu, Wen Yang, Shuo Ren, Jiajun Zhang

In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs’ capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.

nan


Article 502

Title@2025-07-18 (5): From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

Title: From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation Von KMMLU-Redux zu KMMLU-Pro: Eine professionelle koreanische Benchmark-Suite für die LLM-Bewertung 从KMMLU-Redux到KMMLU-Pro:韩国用于LLM评价的专业基准套件 2507.08924v2

Authors (6): Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee

The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.

nan


Article 503

Title@2025-07-18 (5): Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models

Title: Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models Unschuld im Kreuzfeuer: Rollen von Skip Connections in Jailbreaking Visual Language Models 《交火中的无罪:在破狱视觉语言模型中跳过连接的作用》 2507.13761v1

Authors (3): Palash Nandi, Maithili Joshi, Tanmoy Chakraborty

Language models are highly sensitive to prompt formulations - small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and (c) the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak, and we show that even a small number of in-context examples (as few as three) can push the model toward generating inappropriate outputs. Furthermore, we propose a framework that utilizes a skip-connection between two internal layers of the VLM, which substantially increases jailbreak success rates, even when using benign images. Finally, we demonstrate that memes, often perceived as humorous or harmless, can be as effective as toxic visuals in eliciting harmful content, underscoring the subtle and complex vulnerabilities of VLMs.

nan


Article 504

Title@2025-07-18 (5): PRIDE – Parameter-Efficient Reduction of Identity Discrimination for Equality in LLMs

Title: PRIDE – Parameter-Efficient Reduction of Identity Discrimination for Equality in LLMs PRIDE – Parameter-Effiziente Reduzierung der Identitätsdiskriminierung für die Gleichstellung in LLMs PRIDE – – 有效减少在LLM中平等身份歧视的参数 2507.13743v1

Authors (2): Maluna Menke, Thilo Hagendorff

Large Language Models (LLMs) frequently reproduce the gender- and sexual-identity prejudices embedded in their training corpora, leading to outputs that marginalize LGBTQIA+ users. Hence, reducing such biases is of great importance. To achieve this, we evaluate two parameter-efficient fine-tuning (PEFT) techniques - Low-Rank Adaptation (LoRA) and soft-prompt tuning - as lightweight alternatives to full-model fine-tuning for mitigating such biases. Using the WinoQueer benchmark, we quantify bias in three open-source LLMs and observe baseline bias scores reaching up to 98 (out of 100) across a range of queer identities defined by gender and/or sexual orientation, where 50 would indicate neutrality. Fine-tuning with LoRA (< 0.1% additional parameters) on a curated QueerNews corpus reduces those scores by up to 50 points and raises neutrality from virtually 0% to as much as 36%. Soft-prompt tuning (10 virtual tokens) delivers only marginal improvements. These findings show that LoRA can deliver meaningful fairness gains with minimal computation. We advocate broader adoption of community-informed PEFT, the creation of larger queer-authored corpora, and richer evaluation suites beyond WinoQueer, coupled with ongoing audits to keep LLMs inclusive.

nan


Article 505

Title@2025-07-18 (5): From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios

Title: From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios Von Worten bis zu Kollisionen: LLM-geführte Bewertung und adversarische Generierung von sicherheitskritischen Fahrszenarien 从文字到碰撞:LLM-指导评价和反向生成安全紧急驾驶设想方案 2502.02145v4

Authors (5): Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz

Ensuring the safety of autonomous vehicles requires virtual scenario-based testing, which depends on the robust evaluation and generation of safety-critical scenarios. So far, researchers have used scenario-based testing frameworks that rely heavily on handcrafted scenarios as safety metrics. To reduce the effort of human interpretation and overcome the limited scalability of these approaches, we combine Large Language Models (LLMs) with structured scenario parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios. We introduce Cartesian and Ego-centric prompt strategies for scenario evaluation, and an adversarial generation module that modifies trajectories of risk-inducing vehicles (ego-attackers) to create critical scenarios. We validate our approach using a 2D simulation framework and multiple pre-trained LLMs. The results show that the evaluation module effectively detects collision scenarios and infers scenario safety. Meanwhile, the new generation module identifies high-risk agents and synthesizes realistic, safety-critical scenarios. We conclude that an LLM equipped with domain-informed prompting techniques can effectively evaluate and generate safety-critical driving scenarios, reducing dependence on handcrafted metrics. We release our open-source code and scenarios at: https://github.com/TUM-AVS/From-Words-to-Collisions.

nan


Article 506

Title@2025-07-18 (5): DailyLLM: Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLMs

Title: DailyLLM: Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLMs DailyLLM: Context-Aware-Aktivitätsprotokollierung mit Multi-Modal-Sensoren und LLMs DailyLLM: 使用多模式传感器和LLM 生成背景软件活动日志 2507.13737v1

Authors (6): Ye Tian, Xiaoyuan Ren, Zihao Wang, Onat Gungor, Xiaofan Yu, Tajana Rosing

Rich and context-aware activity logs facilitate user behavior analysis and health monitoring, making them a key research focus in ubiquitous computing. The remarkable semantic understanding and generation capabilities of Large Language Models (LLMs) have recently created new opportunities for activity log generation. However, existing methods continue to exhibit notable limitations in terms of accuracy, efficiency, and semantic richness. To address these challenges, we propose DailyLLM. To the best of our knowledge, this is the first log generation and summarization system that comprehensively integrates contextual activity information across four dimensions: location, motion, environment, and physiology, using only sensors commonly available on smartphones and smartwatches. To achieve this, DailyLLM introduces a lightweight LLM-based framework that integrates structured prompting with efficient feature extraction to enable high-level activity understanding. Extensive experiments demonstrate that DailyLLM outperforms state-of-the-art (SOTA) log generation methods and can be efficiently deployed on personal computers and Raspberry Pi. Utilizing only a 1.5B-parameter LLM model, DailyLLM achieves a 17% improvement in log generation BERTScore precision compared to the 70B-parameter SOTA baseline, while delivering nearly 10x faster inference speed.

nan


Article 507

Title: The Judge Variable: Challenging Judge-Agnostic Legal Judgment Prediction Die Richtervariable: Herausfordernde Richter-agnostische rechtliche Urteilsvorhersage 法官变量:挑战法官-不可接受法律判决预测 2507.13732v1

Authors (1): Guillaume Zambrano

This study examines the role of human judges in legal decision-making by using machine learning to predict child physical custody outcomes in French appellate courts. Building on the legal realism-formalism debate, we test whether individual judges’ decision-making patterns significantly influence case outcomes, challenging the assumption that judges are neutral variables that apply the law uniformly. To ensure compliance with French privacy laws, we implement a strict pseudonymization process. Our analysis uses 18,937 living arrangements rulings extracted from 10,306 cases. We compare models trained on individual judges’ past rulings (specialist models) with a judge-agnostic model trained on aggregated data (generalist models). The prediction pipeline is a hybrid approach combining large language models (LLMs) for structured feature extraction and ML models for outcome prediction (RF, XGB and SVC). Our results show that specialist models consistently achieve higher predictive accuracy than the general model, with top-performing models reaching F1 scores as high as 92.85%, compared to the generalist model’s 82.63% trained on 20x to 100x more samples. Specialist models capture stable individual patterns that are not transferable to other judges. In-Domain and Cross-Domain validity tests provide empirical support for legal realism, demonstrating that judicial identity plays a measurable role in legal outcomes. All data and code used will be made available.

nan


Article 508

Title@2025-07-18 (5): DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Title: DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Verstärkungslernen für subgoale Zersetzung 深SeepSeek-Prover-V2:通过强化学习推进正规数学理由,以降低次级目标的分目标分解 2504.21801v2

Authors (18): Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, Chong Ruan

We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3’s step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing.

nan


Article 509

Title@2025-07-18 (5): Automatically assessing oral narratives of Afrikaans and isiXhosa children

Title: Automatically assessing oral narratives of Afrikaans and isiXhosa children Automatische Beurteilung mündlicher Erzählungen von Afrikaans und isiXhosa Kindern 自动评估南非荷兰语和土著Xhoosa儿童口述叙述 2507.13205v2

Authors (6): Retief Louw, Emma Sharratt, Febe de Wet, Christiaan Jacobs, Annelien Smith, Herman Kamper

Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children’s learning.

nan


Article 510

Title@2025-07-18 (5): To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Title: To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization Um zu kodieren oder nicht zu kodieren? Adaptive Toolintegration für Math Language Models über Erwartungs-Maximierung 代码或非代码?通过期望-最大化将数学语言模型整合的适应性工具集成 2502.00691v4

Authors (7): Haozhe Wang, Long Li, Chao Qu, Fengming Zhu, Weidi Xu, Wei Chu, Fangzhen Lin

Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness – the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.

nan


Article 511

Title@2025-07-18 (5): LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning

Title: LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning LLM-getriebene medizinische Report Generierung über kommunikationseffizientes Heterogenes Federated Learning LLM 驱动的通过通信效率高的异质联邦学习编写医学报告 2506.17562v2

Authors (6): Haoxuan Che, Haibo Jin, Zhengrui Guo, Yi Lin, Cheng Jin, Hao Chen

LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.

nan


Article 512

Title@2025-07-18 (5): ASTRID – An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Title: ASTRID – An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems ASTRID – Eine automatisierte und skalierbare TRIaD für die Bewertung von RAG-basierten klinischen Frageantwortsystemen ASTRID – – 用于评价以RAG为基础的临床问题解答系统的自动和可升级的TRIAD 2501.08208v2

Authors (5): Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim

Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses. However, current automated RAG metrics perform poorly in clinical and conversational use cases. Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems. To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF). Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model’s response to the knowledge base without penalising conversational elements. To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, clinical, and non-clinical out-of-domain scenarios. We demonstrate that CF can predict human ratings of faithfulness better than existing definitions for conversational use cases. Furthermore, we show that evaluation using our triad consisting of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses. Finally, using nine different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines. We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development.

nan


Article 513

Title@2025-07-18 (5): Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations

Title: Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations Konsequente Erklärer oder unzuverlässige Erzähler? LLM-generierte Gruppenempfehlungen verstehen 理解LLM提出的集团建议 2507.13705v1

Authors (3): Cedric Waterschoot, Nava Tintarev, Francesco Barile

Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.

nan


Article 514

Title@2025-07-18 (5): Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models

Title: Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models Modellierung der Open-World-Kognition als On-Demand-Synthese probabilistischer Modelle 将开放世界的认知建模作为概率模型的 “ 现场合成 “ 模型 2507.12547v2

Authors (11): Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gerstenberg, Timothy O’Donnell, Alexander K. Lew, Jacob D. Andreas, Joshua B. Tenenbaum, Tyler Brooke-Wilson

When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea – a ``Model Synthesis Architecture’’ (MSA) – using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset – built around a Model Olympics domain of sports vignettes – tests models’ capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people’s ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.

nan


Article 515

Title@2025-07-18 (5): LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Title: LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues LoopServe: Ein adaptives Dual-Phase-LLM-Inferenz-Beschleunigungssystem für Multi-Turn-Dialoge 环环服务:多轨对话的适应性双阶段双阶段LLM推推加速系统 2507.13681v1

Authors (12): Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan

Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a \href{https://huggingface.co/datasets/TreeAILab/Multi-turn_Long-context_Benchmark_for_LLMs}{new benchmark} with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.

nan


Article 516

Title@2025-07-18 (5): KiC: Keyword-inspired Cascade for Cost-Efficient Text Generation with LLMs

Title: KiC: Keyword-inspired Cascade for Cost-Efficient Text Generation with LLMs KiC: Schlüsselwort-inspirierte Cascade für kosteneffiziente Textgenerierung mit LLMs KIC: 与LLMs一起制作成本效率高的文本的关键字启发级联 2507.13666v1

Authors (3): Woo-Chan Kim, Ji-Hoon Park, Seong-Whan Lee

Large language models (LLMs) have demonstrated state-of-the-art performance across a wide range of natural language processing tasks. However, high-performing models are typically accessible only via APIs, incurring substantial inference costs. Cascade methods address this by initially employing a cheaper model and escalating to a stronger one only when necessary. Nevertheless, existing cascade approaches struggle to select a reliable representative response and assess the overall reliability of free-form outputs, as they rely on exact text matching. To overcome these limitations, we propose Keyword-inspired Cascade (KiC), a novel framework for cost-efficient free-form text generation. KiC identifies the most representative answer among multiple outputs from a weaker model and evaluates the semantic alignment of other responses with it. Based on the degree of alignment, KiC determines whether to accept the weaker model’s output or escalate to a stronger model. Experiments on three free-form text generation benchmarks show that KiC achieves 97.53 percent of GPT-4’s accuracy while reducing API costs by 28.81 percent on average, and even outperforms GPT-4 in a specific benchmark.

nan


Article 517

Title@2025-07-18 (5): CU-ICU: Customizing Unsupervised Instruction-Finetuned Language Models for ICU Datasets via Text-to-Text Transfer Transformer

Title: CU-ICU: Customizing Unsupervised Instruction-Finetuned Language Models for ICU Datasets via Text-to-Text Transfer Transformer CU-ICU: Anpassen unüberwachter Instruktions-Finetuned Language Models für ICU-Datensätze über Text-zu-Text Transfer Transformer CU-ICU: 通过文本到文字传输变换器定制ICU数据集的不受监督的指令-不全调语言模型 2507.13655v1

Authors (1): Teerapong Panboonyuen

Integrating large language models into specialized domains like healthcare presents unique challenges, including domain adaptation and limited labeled data. We introduce CU-ICU, a method for customizing unsupervised instruction-finetuned language models for ICU datasets by leveraging the Text-to-Text Transfer Transformer (T5) architecture. CU-ICU employs a sparse fine-tuning approach that combines few-shot prompting with selective parameter updates, enabling efficient adaptation with minimal supervision. Our evaluation across critical ICU tasks–early sepsis detection, mortality prediction, and clinical note generation–demonstrates that CU-ICU consistently improves predictive accuracy and interpretability over standard fine-tuning methods. Notably, CU-ICU achieves up to a 15% increase in sepsis detection accuracy and a 20% enhancement in generating clinically relevant explanations while updating fewer than 1% of model parameters in its most efficient configuration. These results establish CU-ICU as a scalable, low-overhead solution for delivering accurate and interpretable clinical decision support in real-world ICU environments.

nan


Article 518

Title@2025-07-18 (5): The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Title: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Die Illusion des Denkens: Die Stärken und Grenzen von Vernunftmodellen über das Lens of Problem Complexity verstehen 思考的幻觉:通过问题复杂焦点了解理性模型的长处和局限性 2506.06941v2

Authors (6): Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

nan


Article 519

Title@2025-07-18 (5): EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

Title: EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation EvolveNav: Selbstverbessernde körpereigene Begründung für LLM-basierte Vision-Language-Navigation EvolveNav:基于LLM的愿景-语言导航自我改善自足理由 2506.01551v2

Authors (11): Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang

Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs’ reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs’ training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model’s navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.

nan


Article 520

Title@2025-07-18 (5): Temporal reasoning for timeline summarisation in social media

Title: Temporal reasoning for timeline summarisation in social media Temporale Argumentation für Zeitlinienzusammenfassung in sozialen Medien 社交媒体时间时间总结推理 2501.00152v3

Authors (4): Jiayu Song, Mahmud Elahi Akhter, Dana Atzil Slonim, Maria Liakata

This paper explores whether enhancing temporal reasoning capabilities in Large Language Models (LLMs) can improve the quality of timeline summarisation, the task of summarising long texts containing sequences of events, such as social media threads. We first introduce NarrativeReason, a novel dataset focused on temporal relationships among sequential events within narratives, distinguishing it from existing temporal reasoning datasets that primarily address pair-wise event relationships. Our approach then combines temporal reasoning with timeline summarisation through a knowledge distillation framework, where we first fine-tune a teacher model on temporal reasoning tasks and then distill this knowledge into a student model while simultaneously training it for the task of timeline summarisation. Experimental results demonstrate that our model achieves superior performance on out-of-domain mental health-related timeline summarisation tasks, which involve long social media threads with repetitions of events and a mix of emotions, highlighting the importance and generalisability of leveraging temporal reasoning to improve timeline summarisation.

nan


Article 521

Title@2025-07-18 (5): ViMMRC 2.0 – Enhancing Machine Reading Comprehension on Vietnamese Literature Text

Title: ViMMRC 2.0 – Enhancing Machine Reading Comprehension on Vietnamese Literature Text ViMMRC 2.0 – Verbesserung des Leseverständnisses in vietnamesischer Literatur Text VIMRC 2.0 – – 加强对越南文学文本的机器阅读理解 2303.18162v3

Authors (5): Son T. Luu, Khoi Trong Hoang, Tuong Quang Pham, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which contain the reading articles for students from Grade 1 to Grade 12. This dataset has 699 reading passages which are prose and poems, and 5,273 questions. The questions in the new dataset are not fixed with four options as in the previous version. Moreover, the difficulty of questions is increased, which challenges the models to find the correct choice. The computer must understand the whole context of the reading passage, the question, and the content of each choice to extract the right answers. Hence, we propose a multi-stage approach that combines the multi-step attention network (MAN) with the natural language inference (NLI) task to enhance the performance of the reading comprehension model. Then, we compare the proposed methodology with the baseline BERTology models on the new dataset and the ViMMRC 1.0. From the results of the error analysis, we found that the challenge of the reading comprehension models is understanding the implicit context in texts and linking them together in order to find the correct answers. Finally, we hope our new dataset will motivate further research to enhance the ability of computers to understand the Vietnamese language.

nan


Article 522

Title@2025-07-18 (5): Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Title: Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models Linguistische und einbettende Profilierung von Texten, die von Menschen und großen Sprachmodellen erzeugt werden 人类和大语言模式产生的文本的语言和嵌入式图解 2507.13614v1

Authors (2): Sergio E. Zanotto, Segun Aroyehun

The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.

nan


Article 523

Title@2025-07-18 (5): Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know?

Title: Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know? Vernunft über Ungewissheit: Wissen Vernunftmodelle, wenn sie es nicht wissen? 关于不确定性的原因:理性模型知道他们不知道什么时候知道吗? 2506.18183v3

Authors (6): Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar

Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.

nan


Article 524

Title@2025-07-18 (5): CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

Title: CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks CoTasks: Chain-of-Thought-basierte Video-Anleitung Tuning-Aufgaben 考量表: 以研究链为基础的视频教学图示任务 2507.13609v1

Authors (5): Yanan Wang, Julio Vizcarra, Zhi Li, Hao Niu, Mori Kurokawa

Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level video-text pairs, often lacking structured annotations necessary for compositional, step-by-step reasoning. We propose CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks, a new framework that decomposes complex video questions of existing datasets (e.g., NeXT-QA, STAR) into four entity-level foundational tasks: frame localization, entity tracking, spatial and temporal relation extraction. By embedding these intermediate CoT-style reasoning steps into the input, CoTasks enables models to explicitly perform object-centric spatiotemporal reasoning. Experiments on the NeXT-QA benchmark show that CoTasks significantly enhance inference performance: LLaVA-video-7B improves by +3.3 points in average GPT-4 evaluation score, and Qwen2.5-VL-3B gains +17.4, with large boosts in causal (+14.6), temporal (+10.9), and descriptive (+48.1) subcategories. These results demonstrate the effectiveness of CoTasks as a structured CoT-style supervision framework for improving compositional video reasoning.

nan


Article 525

Title@2025-07-18 (5): STACK: Adversarial Attacks on LLM Safeguard Pipelines

Title: STACK: Adversarial Attacks on LLM Safeguard Pipelines Gegenseitige Angriffe auf LLM Safeguard Pipelines 对LLM保障管道的反向攻击 2506.24068v2

Authors (8): Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave

Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

nan


Article 526

Title@2025-07-18 (5): Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Title: Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Ist das Ihre letzte Antwort? Test-Time Scaling verbessert selektive Fragen beantworten 这就是你最后的答案吗? 测试时间缩放能改善选择性回答问题 2502.13962v2

Authors (3): William Jurayj, Jeffrey Cheng, Benjamin Van Durme

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

nan


Article 527

Title@2025-07-18 (5): When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models

Title: When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models Wenn Menschen überflutet sind: Analyse der Entmenschlichung von Metaphoren in Einwanderungsdiskursen mit großen Sprachmodellen 当人们遭受洪水时:用大语言模型分析移民问题中非人化的比喻 2502.13246v2

Authors (2): Julia Mendelsohn, Ceren Budak

Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. “water” or “vermin”). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.

nan


Article 528

Title@2025-07-18 (5): TexGS-VolVis: Expressive Scene Editing for Volume Visualization via Textured Gaussian Splatting

Title: TexGS-VolVis: Expressive Scene Editing for Volume Visualization via Textured Gaussian Splatting TexGS-VolVis: Expressive Szenebearbeitung für die Volumenvisualisierung über texturierte Gaussian Splatting TexGS-VolVis: 通过Textured Gaussian Splatting 进行卷量可视化的显性场景编辑 2507.13586v1

Authors (4): Kaiyuan Tang, Kuangshi Ai, Jun Han, Chaoli Wang

Advancements in volume visualization (VolVis) focus on extracting insights from 3D volumetric data by generating visually compelling renderings that reveal complex internal structures. Existing VolVis approaches have explored non-photorealistic rendering techniques to enhance the clarity, expressiveness, and informativeness of visual communication. While effective, these methods often rely on complex predefined rules and are limited to transferring a single style, restricting their flexibility. To overcome these limitations, we advocate the representation of VolVis scenes using differentiable Gaussian primitives combined with pretrained large models to enable arbitrary style transfer and real-time rendering. However, conventional 3D Gaussian primitives tightly couple geometry and appearance, leading to suboptimal stylization results. To address this, we introduce TexGS-VolVis, a textured Gaussian splatting framework for VolVis. TexGS-VolVis employs 2D Gaussian primitives, extending each Gaussian with additional texture and shading attributes, resulting in higher-quality, geometry-consistent stylization and enhanced lighting control during inference. Despite these improvements, achieving flexible and controllable scene editing remains challenging. To further enhance stylization, we develop image- and text-driven non-photorealistic scene editing tailored for TexGS-VolVis and 2D-lift-3D segmentation to enable partial editing with fine-grained control. We evaluate TexGS-VolVis both qualitatively and quantitatively across various volume rendering scenes, demonstrating its superiority over existing methods in terms of efficiency, visual quality, and editing flexibility.

nan


Article 529

Title@2025-07-17 (4): An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots

Title: An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots Ein Ansatz zur automatischen Generierung von Beschriftungsfunktionen für Software Engineering Chatbots 软件工程聊天器自动生成标签功能的方法 2410.07094v2

Authors (4): Ebube Alor, Ahmad Abdellatif, SayedHassan Khatoonabadi, Emad Shihab

Software engineering (SE) chatbots are increasingly gaining attention for their role in enhancing development processes. At the core of chatbots are Natural Language Understanding platforms (NLUs), which enable them to comprehend user queries but require labeled data for training. However, acquiring such labeled data for SE chatbots is challenging due to the scarcity of high-quality datasets, as training requires specialized vocabulary and phrases not found in typical language datasets. Consequently, developers often resort to manually annotating user queries – a time-consuming and resource-intensive process. Previous approaches require human intervention to generate rules, called labeling functions (LFs), that categorize queries based on specific patterns. To address this issue, we propose an approach to automatically generate LFs by extracting patterns from labeled user queries. We evaluate our approach on four SE datasets and measure performance improvement from training NLUs on queries labeled by the generated LFs. The generated LFs effectively label data with AUC scores up to 85.3% and NLU performance improvements up to 27.2%. Furthermore, our results show that the number of LFs affects labeling performance. We believe that our approach can save time and resources in labeling users’ queries, allowing practitioners to focus on core chatbot functionalities rather than manually labeling queries.

nan


Article 530

Title@2025-07-17 (4): A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

Title: A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Ein Data-Centric Framework zur Bewältigung phonetischer und prosodischer Herausforderungen in russischen Sprachgenerativen Modellen 解决俄罗斯语音生成模型中电话和预发挑战的数据中心框架 2507.13563v1

Authors (7): Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Oleg Rogov, Grach Mkrtchian

Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.

nan


Article 531

Title@2025-07-17 (4): Culture is Not Trivia: Sociocultural Theory for Cultural NLP

Title: Culture is Not Trivia: Sociocultural Theory for Cultural NLP Kultur ist nicht Trivia: Soziokulturelle Theorie für kulturelle NLP 文化不是特里维亚文化:社会文化文化理论 2502.12057v2

Authors (3): Naitian Zhou, David Bamman, Isaac L. Bleaman

The field of cultural NLP has recently experienced rapid growth, driven by a pressing need to ensure that language technologies are effective and safe across a pluralistic user base. This work has largely progressed without a shared conception of culture, instead choosing to rely on a wide array of cultural proxies. However, this leads to a number of recurring limitations: coarse national boundaries fail to capture nuanced differences that lay within them, limited coverage restricts datasets to only a subset of usually highly-represented cultures, and a lack of dynamicity results in static cultural benchmarks that do not change as culture evolves. In this position paper, we argue that these methodological limitations are symptomatic of a theoretical gap. We draw on a well-developed theory of culture from sociocultural linguistics to fill this gap by 1) demonstrating in a case study how it can clarify methodological constraints and affordances, 2) offering theoretically-motivated paths forward to achieving cultural competence, and 3) arguing that localization is a more useful framing for the goals of much current work in cultural NLP.

nan


Article 532

Title@2025-07-17 (4): Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder

Title: Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder Lesen zwischen den Zeilen: Kombination von Pausendynamik und semantischer Kohärenz zur automatisierten Bewertung von Gedankenstörungen 在两行之间阅读:将暂停动态和语义一致性结合起来,以自动评估思想紊乱 2507.13551v1

Authors (12): Feng Chen, Weizhe Xu, Changye Li, Serguei Pakhomov, Alex Cohen, Simran Bhola, Sandy Yin, Sunny X Tang, Michael Mackinley, Lena Palaniyappan, Dror Ben-Zeev, Trevor Cohen

Formal thought disorder (FTD), a hallmark of schizophrenia spectrum disorders, manifests as incoherent speech and poses challenges for clinical assessment. Traditional clinical rating scales, though validated, are resource-intensive and lack scalability. Automated speech analysis with automatic speech recognition (ASR) allows for objective quantification of linguistic and temporal features of speech, offering scalable alternatives. The use of utterance timestamps in ASR captures pause dynamics, which are thought to reflect the cognitive processes underlying speech production. However, the utility of integrating these ASR-derived features for assessing FTD severity requires further evaluation. This study integrates pause features with semantic coherence metrics across three datasets: naturalistic self-recorded diaries (AVH, n = 140), structured picture descriptions (TOPSY, n = 72), and dream narratives (PsyCL, n = 43). We evaluated pause related features alongside established coherence measures, using support vector regression (SVR) to predict clinical FTD scores. Key findings demonstrate that pause features alone robustly predict the severity of FTD. Integrating pause features with semantic coherence metrics enhanced predictive performance compared to semantic-only models, with integration of independent models achieving correlations up to \r{ho} = 0.649 and AUC = 83.71% for severe cases detection (TOPSY, with best \r{ho} = 0.584 and AUC = 79.23% for semantic-only models). The performance gains from semantic and pause features integration held consistently across all contexts, though the nature of pause patterns was dataset-dependent. These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech and advance automated speech analysis in psychosis.

nan


Article 533

Title@2025-07-17 (4): GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models

Title: GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models GOFAI trifft Generative KI: Entwicklung von Expertensystemen mittels großer Sprachmodelle GOFAI会议:通过大语言模式发展专家系统 2507.13550v1

Authors (2): Eduardo C. Garrido-Merchán, Cristina Puente

The development of large language models (LLMs) has successfully transformed knowledge-based systems such as open domain question nswering, which can automatically produce vast amounts of seemingly coherent information. Yet, those models have several disadvantages like hallucinations or confident generation of incorrect or unverifiable facts. In this paper, we introduce a new approach to the development of expert systems using LLMs in a controlled and transparent way. By limiting the domain and employing a well-structured prompt-based extraction approach, we produce a symbolic representation of knowledge in Prolog, which can be validated and corrected by human experts. This approach also guarantees interpretability, scalability and reliability of the developed expert systems. Via quantitative and qualitative experiments with Claude Sonnet 3.7 and GPT-4.1, we show strong adherence to facts and semantic coherence on our generated knowledge bases. We present a transparent hybrid solution that combines the recall capacity of LLMs with the precision of symbolic systems, thereby laying the foundation for dependable AI applications in sensitive domains.

nan


Article 534

Title@2025-07-17 (4): A Computational Approach to Modeling Conversational Systems: Analyzing Large-Scale Quasi-Patterned Dialogue Flows

Title: A Computational Approach to Modeling Conversational Systems: Analyzing Large-Scale Quasi-Patterned Dialogue Flows Ein Computational Approach zur Modellierung von Gesprächssystemen: Analysieren großräumiger Quasi-gemusterter Dialogströme 模拟交汇系统模型化的计算方法:分析大型准源对话流量 2507.13544v1

Authors (2): Mohamed Achref Ben Ammar, Mohamed Taha Bennani

The analysis of conversational dynamics has gained increasing importance with the rise of large language model-based systems, which interact with users across diverse contexts. In this work, we propose a novel computational framework for constructing conversational graphs that capture the flow and structure of loosely organized dialogues, referred to as quasi-patterned conversations. We introduce the Filter & Reconnect method, a novel graph simplification technique that minimizes noise while preserving semantic coherence and structural integrity of conversational graphs. Through comparative analysis, we demonstrate that the use of large language models combined with our graph simplification technique has resulted in semantic metric S increasing by a factor of 2.06 compared to previous approaches while simultaneously enforcing a tree-like structure with 0 {\delta}-hyperbolicity, ensuring optimal clarity in conversation modeling. This work provides a computational method for analyzing large-scale dialogue datasets, with practical applications related to monitoring automated systems such as chatbots, dialogue management tools, and user behavior analytics.

nan


Article 535

Title@2025-07-17 (4): From Code to Compliance: Assessing ChatGPT’s Utility in Designing an Accessible Webpage – A Case Study

Title: From Code to Compliance: Assessing ChatGPT’s Utility in Designing an Accessible Webpage – A Case Study Von Code zur Compliance: Bewertung des Nutzens von ChatGPT bei der Gestaltung einer barrierefreien Webseite – Eine Fallstudie 从代码到合规:评估查盖伯特在设计无障碍网页方面的效用 – – 案例研究 2501.03572v2

Authors (4): Ammar Ahmed, Margarida Fresco, Fredrik Forsberg, Hallvard Grotli

Web accessibility ensures that individuals with disabilities can access and interact with digital content without barriers, yet a significant majority of most used websites fail to meet accessibility standards. This study evaluates ChatGPT’s (GPT-4o) ability to generate and improve web pages in line with Web Content Accessibility Guidelines (WCAG). While ChatGPT can effectively address accessibility issues when prompted, its default code often lacks compliance, reflecting limitations in its training data and prevailing inaccessible web practices. Automated and manual testing revealed strengths in resolving simple issues but challenges with complex tasks, requiring human oversight and additional iterations. Unlike prior studies, we incorporate manual evaluation, dynamic elements, and use the visual reasoning capability of ChatGPT along with the prompts to fix accessibility issues. Providing screenshots alongside prompts enhances the LLM’s ability to address accessibility issues by allowing it to analyze surrounding components, such as determining appropriate contrast colors. We found that effective prompt engineering, such as providing concise, structured feedback and incorporating visual aids, significantly enhances ChatGPT’s performance. These findings highlight the potential and limitations of large language models for accessible web development, offering practical guidance for developers to create more inclusive websites.

nan


Article 536

Title@2025-07-17 (4): Encoding syntactic objects and Merge operations in function spaces

Title: Encoding syntactic objects and Merge operations in function spaces Kodierung syntaktischer Objekte und Zusammenführen von Operationen in Funktionsräumen 在功能空格中编码同族天体和合并操作 2507.13501v1

Authors (2): Matilde Marcolli, Robert C. Berwick

We provide a mathematical argument showing that, given a representation of lexical items as functions (wavelets, for instance) in some function space, it is possible to construct a faithful representation of arbitrary syntactic objects in the same function space. This space can be endowed with a commutative non-associative semiring structure built using the second Renyi entropy. The resulting representation of syntactic objects is compatible with the magma structure. The resulting set of functions is an algebra over an operad, where the operations in the operad model circuits that transform the input wave forms into a combined output that encodes the syntactic structure. The action of Merge on workspaces is faithfully implemented as action on these circuits, through a coproduct and a Hopf algebra Markov chain. The results obtained here provide a constructive argument showing the theoretical possibility of a neurocomputational realization of the core computational structure of syntax. We also present a particular case of this general construction where this type of realization of Merge is implemented as a cross frequency phase synchronization on sinusoidal waves. This also shows that Merge can be expressed in terms of the successor function of a semiring, thus clarifying the well known observation of its similarities with the successor function of arithmetic.

nan


Article 537

Title@2025-07-17 (4): The role of large language models in UI/UX design: A systematic literature review

Title: The role of large language models in UI/UX design: A systematic literature review Die Rolle großer Sprachmodelle im UI/UX-Design: Ein systematischer Literaturbericht 大语言模型在UI/UX设计中的作用:系统文献审查 2507.04469v2

Authors (2): Ammar Ahmed, Ali Shariq Imran

This systematic literature review examines the role of large language models (LLMs) in UI/UX design, synthesizing findings from 38 peer-reviewed studies published between 2022 and 2025. We identify key LLMs in use, including GPT-4, Gemini, and PaLM, and map their integration across the design lifecycle, from ideation to evaluation. Common practices include prompt engineering, human-in-the-loop workflows, and multimodal input. While LLMs are reshaping design processes, challenges such as hallucination, prompt instability, and limited explainability persist. Our findings highlight LLMs as emerging collaborators in design, and we propose directions for the ethical, inclusive, and effective integration of these technologies.

nan


Article 538

Title@2025-07-17 (4): ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Title: ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data ParaPO: Sprachmodelle so ausrichten, dass verbatime Reproduktion von Vortrainingsdaten reduziert wird ParaPO:调整语文模式,减少培训前数据的逐字记录 2504.14452v2

Authors (8): Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi

Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).

nan


Article 539

Title@2025-07-17 (4): Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?

Title: Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? Die LLM Value Probing Strategies: Sind sie robust und ausdrucksstark? 重新研究LLM 价值检验战略:它们是否有力和具有表现力? 2507.13490v1

Authors (6): Siqi Shen, Mehar Singh, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Rada Mihalcea

There has been extensive research on assessing the value orientation of Large Language Models (LLMs) as it can shape user experiences across demographic groups. However, several challenges remain. First, while the Multiple Choice Question (MCQ) setting has been shown to be vulnerable to perturbations, there is no systematic comparison of probing methods for value probing. Second, it is unclear to what extent the probed values capture in-context information and reflect models’ preferences for real-world actions. In this paper, we evaluate the robustness and expressiveness of value representations across three widely used probing strategies. We use variations in prompts and options, showing that all methods exhibit large variances under input perturbations. We also introduce two tasks studying whether the values are responsive to demographic context, and how well they align with the models’ behaviors in value-related scenarios. We show that the demographic context has little effect on the free-text generation, and the models’ values only weakly correlate with their preference for value-based actions. Our work highlights the need for a more careful examination of LLM value probing and awareness of its limitations.

nan


Article 540

Title@2025-07-17 (4): On Pre-training of Multimodal Language Models Customized for Chart Understanding

Title: On Pre-training of Multimodal Language Models Customized for Chart Understanding Zur Vorausbildung multimodaler Sprachmodelle, die für das Chart-Verständnis angepasst sind 为了解图表而定制的多模式语言模型的预培训 2407.14506v3

Authors (5): Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal

Recent studies customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks have yielded promising results, especially in the field of scientific chart comprehension. These studies generally utilize visual instruction tuning with specialized datasets to enhance question and answer (QA) accuracy within the chart domain. However, they often neglect the fundamental discrepancy between natural image-caption pre-training data and digital chart image-QA data, particularly in the models’ capacity to extract underlying numeric values from charts. This paper tackles this oversight by exploring the training processes necessary to improve MLLMs’ comprehension of charts. We present three key findings: (1) Incorporating raw data values in alignment pre-training markedly improves comprehension of chart data. (2) Replacing images with their textual representation randomly during end-to-end fine-tuning transfer the language reasoning capability to chart interpretation skills. (3) Requiring the model to first extract the underlying chart data and then answer the question in the fine-tuning can further improve the accuracy. Consequently, we introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension. CHOPINLLM effectively interprets various types of charts, including unannotated ones, while maintaining robust reasoning abilities. Furthermore, we establish a new benchmark to evaluate MLLMs’ understanding of different chart types across various comprehension levels. Experimental results show that CHOPINLLM exhibits strong performance in understanding both annotated and unannotated charts across a wide range of types.

nan


Article 541

Title@2025-07-17 (4): RExBench: Can coding agents autonomously implement AI research extensions?

Title: RExBench: Can coding agents autonomously implement AI research extensions? RExBench: Können Codierer KI-Forschungserweiterungen autonom implementieren? RExBench:编码代理商能否自主实施AI研究扩展? 2506.22598v2

Authors (6): Nicholas Edwards, Yukyung Lee, Yujun Audrey Mao, Yulu Qin, Sebastian Schuster, Najoung Kim

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

nan


Article 542

Title@2025-07-17 (4): Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Title: Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers Papierzusammenfassung Angriff: Jailbreaking LLMs durch LLM Safety Papers 论文摘要攻击:通过LLM 安全文件建造监狱的LLMLM 2507.13474v1

Authors (8): Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, Songlin Hu

The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmname{PSA}), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97\% attack success rate (ASR) on well-aligned models like Claude3.5-Sonnet and an even higher 98\% ASR on Deepseek-R1. More intriguingly, our work has further revealed diametrically opposed vulnerability bias across different base models, and even between different versions of the same model, when exposed to either attack-focused or defense-focused papers. This phenomenon potentially indicates future research clues for both adversarial methodologies and safety alignment.Code is available at https://github.com/233liang/Paper-Summary-Attack

nan


Article 543

Title@2025-07-17 (4): psifx – Psychological and Social Interactions Feature Extraction Package

Title: psifx – Psychological and Social Interactions Feature Extraction Package psifx – Psychologische und soziale Interaktionen Feature Extraction Package psifx – – 心理和社会互动 2407.10266v4

Authors (3): Guillaume Rochette, Mathieu Rochat, Matthew J. Vowels

psifx is a plug-and-play multi-modal feature extraction toolkit, aiming to facilitate and democratize the use of state-of-the-art machine learning techniques for human sciences research. It is motivated by a need (a) to automate and standardize data annotation processes that typically require expensive, lengthy, and inconsistent human labour; (b) to develop and distribute open-source community-driven psychology research software; and (c) to enable large-scale access and ease of use for non-expert users. The framework contains an array of tools for tasks such as speaker diarization, closed-caption transcription and translation from audio; body, hand, and facial pose estimation and gaze tracking with multi-person tracking from video; and interactive textual feature extraction supported by large language models. The package has been designed with a modular and task-oriented approach, enabling the community to add or update new tools easily. This combination creates new opportunities for in-depth study of real-time behavioral phenomena in psychological and social science research.

nan


Article 544

Title@2025-07-17 (4): VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Title: VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning VisionThink: Intelligentes und effizientes Vision-Sprachmodell durch Verstärkungslernen 远景设想:通过强化学习建立聪明、高效的愿景语言模式 2507.13348v1

Authors (6): Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

nan


Article 545

Title@2025-07-17 (4): DeFine: Decision-Making with Analogical Reasoning over Factor Profiles

Title: DeFine: Decision-Making with Analogical Reasoning over Factor Profiles DeFine: Entscheidungsfindung mit analogischer Begründung über Faktorprofile DeFine: 与因子剖析档的模拟理由有关的决策 2410.01772v2

Authors (8): Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu

LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce \textsc{DeFine}, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.

nan


Article 546

Title@2025-07-17 (4): Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

Title: Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes Vergleich von Äpfeln mit Orangen: Ein Datensatz & Analyse des LLM Humorverständnisses von traditionellen Puns zu thematischen Witzen 将苹果与橙类比较:从传统Puns到专题笑话的LLM Humour理解数据集和分析 2507.13335v1

Authors (3): Tyler Loakman, William Thorne, Chenghua Lin

Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond “common sense”, rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.

nan


Article 547

Title@2025-07-17 (4): The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Title: The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Die Imitation Spiel: Turing Machine Imitator ist Länge Generalizable Reasoner 模拟游戏:图画机器模拟器是长可概括的理由 2507.13332v1

Authors (7): Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen

Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

nan


Article 548

Title@2025-07-17 (4): Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Title: Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It Vision-and-Language Training hilft, taxonomisches Wissen zu implementieren, ändert es aber nicht grundlegend 愿景和语言培训帮助利用分类学知识,但不能从根本上改变这种知识。 2507.13328v1

Authors (6): Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindström, Lucia Donatelli, Kanishka Misra, Najoung Kim

Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.

nan


Article 549

Title@2025-07-17 (4): Social and Political Framing in Search Engine Results

Title: Social and Political Framing in Search Engine Results Soziale und politische Framing in Suchmaschinen-Ergebnissen 寻找引擎结果中的社会和政治形式 2507.13325v1

Authors (2): Amrit Poudel, Tim Weninger

Search engines play a crucial role in shaping public discourse by influencing how information is accessed and framed. While prior research has extensively examined various dimensions of search bias – such as content prioritization, indexical bias, political polarization, and sources of bias – an important question remains underexplored: how do search engines and ideologically-motivated user queries contribute to bias in search results. This study analyzes the outputs of major search engines using a dataset of political and social topics. The findings reveal that search engines not only prioritize content in ways that reflect underlying biases but also that ideologically-driven user queries exacerbate these biases, resulting in the amplification of specific narratives. Moreover, significant differences were observed across search engines in terms of the sources they prioritize. These results suggest that search engines may play a pivotal role in shaping public perceptions by reinforcing ideological divides, thereby contributing to the broader issue of information polarization.

nan


Article 550

Title@2025-07-17 (4): HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals

Title: HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals HapticCap: Ein multimodaler Datensatz und die Aufgabe, die Benutzererfahrung von Schwingungshaptischen Signalen zu verstehen HapticCap:多模式数据集和了解用户振动信号信号体验的任务 2507.13318v1

Authors (3): Guimin Hu, Daniel Hershcovich, Hasti Seifi

Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text. To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category.

nan


Article 551

Title@2025-07-17 (4): HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Title: HuggingGraph: Understanding the Supply Chain of LLM Ecosystem HuggingGraph: Die Lieferkette von LLM Ecosystem verstehen HugggGraph:了解LLM生态系统的供应链 2507.14240v1

Authors (3): Mohammad Shahedur Rahman, Peng Gao, Yuede Ji

Large language models (LLMs) leverage deep learning to process and predict sequences of words from context, enabling them to perform various NLP tasks, such as translation, summarization, question answering, and content generation. However, the growing size and complexity of developing, training, and deploying advanced LLMs require extensive computational resources and large datasets. This creates a barrier for users. As a result, platforms that host models and datasets are widely used. For example, Hugging Face, one of the most popular platforms, hosted 1.8 million models and 450K datasets by June 2025, with no sign of slowing down. Since many LLMs are built from base models, pre-trained models, and external datasets, they can inherit vulnerabilities, biases, or malicious components from earlier models or datasets. Therefore, it is critical to understand the origin and development of these components to better detect potential risks, improve model fairness, and ensure compliance. Motivated by this, our project aims to study the relationships between models and datasets, which are core components of the LLM supply chain. First, we design a method to systematically collect LLM supply chain data. Using this data, we build a directed heterogeneous graph to model the relationships between models and datasets, resulting in a structure with 397,376 nodes and 453,469 edges. We then perform various analyses and uncover several findings, such as: (i) the LLM supply chain graph is large, sparse, and follows a power-law degree distribution; (ii) it features a densely connected core and a fragmented periphery; (iii) datasets play pivotal roles in training; (iv) strong interdependence exists between models and datasets; and (v) the graph is dynamic, with daily updates reflecting the ecosystem’s ongoing evolution.

nan


Article 552

Title@2025-07-17 (4): Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information

Title: Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information Ermittlung von Aufgabengruppen für Multi-Task-Lernen mit pointwise V-Usable Information 利用有分点的V-可靠信息确定多任务学习组 2410.12774v2

Authors (4): Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova

The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.

nan


Article 553

Title@2025-07-17 (4): The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

Title: The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations Die Generative Energy Arena (GEA): Einbeziehung des Energiebewusstseins in das Large Language Model (LLM) Human Assessments 产生能源竞技场:将能源意识纳入大语言模型(LLM)人类评估 2507.13302v1

Authors (5): Carlos Arriaga, Gonzalo Martínez, Eneko Sendin, Javier Conde, Pedro Reviriego

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

nan


Article 554

Title@2025-07-17 (4): AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

Title: AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research AbGen: Bewertung großer Sprachmodelle in Ablationsstudiendesign und Evaluation für wissenschaftliche Forschung AbGen:评估用于科学研究的实验研究设计和评价中的大语言模型 2507.13300v1

Authors (8): Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

nan


Article 555

Title@2025-07-17 (4): Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis

Title: Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis Multi-Agent Synergy-getriebene iterative visuelle Narrative Synthese 多机构协同-驱动动态迭代视觉叙述合成 2507.13285v1

Authors (8): Wang Xi, Quan Shi, Tian Yu, Yujie Peng, Jiayi Sun, Mengxing Ren, Zenghui Ding, Ningguang Yao

Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference-based evaluation framework employing rationale-enhanced multi-dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.

nan


Article 556

Title@2025-07-17 (4): ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Title: ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations ContextQFormer: Eine neue Context-Modellierungsmethode für Multi-Turn Multi-Modal-Gespräche 上下文前:多发多式多模式对话的新背景建模方法 2505.23121v2

Authors (8): Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang

Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

nan


Article 557

Title@2025-07-17 (4): Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management

Title: Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management Überblick über das TalentCLEF 2025: Kompetenz- und Berufstitel-Intelligenz für Human Capital Management 《2025年人才人才-CLEF概览:人力资本管理技能和职称情报》 2507.13275v1

Authors (7): Luis Gasco, Hermenegildo Fabregat, Laura García-Sardiña, Paula Estrella, Daniel Deniz, Alvaro Rodrigo, Rabih Zbib

Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.

nan


Article 558

Title@2025-07-17 (4): Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Title: Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering Sichere Multifaceted-RAG für Unternehmen: Hybrides Knowledge Retrieval mit Security-Filterung 企业安全多面安全RAG:带安全过滤器的混合知识检索 2504.13425v2

Authors (4): Grace Byun, Shinsun Lee, Nayoung Choi, Jinho D. Choi

Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.

nan


Article 559

Title@2025-07-17 (4): QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Title: QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation QuestA: Erweitern der Begründungskapazität in LLMs durch Frageerweiterung 目标A:通过问题增加扩大LLMs的理据能力 2507.13266v1

Authors (8): Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, Jingzhao Zhang

Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.

nan


Article 560

Title@2025-07-17 (4): Automating Steering for Safe Multimodal Large Language Models

Title: Automating Steering for Safe Multimodal Large Language Models Automatisierungslenkung für sichere multimodale große Sprachmodelle 安全多式联运大语言模式自动化指导 2507.13255v1

Authors (7): Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model’s internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

nan


Article 561

Title@2025-07-17 (4): ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs

Title: ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs ConTextual: Verbesserung der klinischen Textzusammenfassung in LLMs mit kontextschonender Token-Filterung und Wissensgraphen 共同方式:改进LLMLLM的临床文本摘要,同时保持上下文透视和知识图 2504.16394v3

Authors (2): Fahmida Liza Piya, Rahmatollah Beheshti

Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.

nan


Article 562

Title@2025-07-17 (4): Enhancing Cross-task Transfer of Large Language Models via Activation Steering

Title: Enhancing Cross-task Transfer of Large Language Models via Activation Steering Verbesserung der Cross-Task-Übertragung großer Sprachmodelle durch Aktivierungslenkung 通过启动指导加强大语言模式的跨任务转让 2507.13236v1

Authors (8): Xinyu Tang, Zhihao Lv, Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou

Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model’s internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.

nan


Article 563

Title@2025-07-17 (4): CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings

Title: CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings CoDet-M4: Erkennung maschinengenerierter Codes in Multi-Lingual-, Multi-Generator- und Multi-Domain-Einstellungen CoDet-M4:多语言、多驱动器和多域设置中的检测机生成代码 2503.13733v2

Authors (3): Daniil Orel, Dilshod Azizov, Preslav Nakov

Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, these advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. To this end, we propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis using evaluation of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting the authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Moreover, our extensive experiments show that our framework effectively distinguishes human- from LLM-written code and sets a new benchmark for this task.

nan


Article 564

Title@2025-07-17 (4): A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans

Title: A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans Ein Vergleichsansatz zur Beurteilung sprachlicher Kreativität von großen Sprachmodellen und Menschen 评估大语言模式和人类语言创造性的比较方法 2507.12039v2

Authors (3): Anca Dinu, Andra-Maria Florescu, Alina Resceanu

The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.

nan


Article 565

Title@2025-07-17 (4): GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems

Title: GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems GEMMAS: Graph-basierte Evaluations-Metriken für Multi-Agent-Systeme GEMMAS:基于图表的多剂系统评价计量表 2507.13190v1

Authors (5): Jisoo Lee, Raeyoung Chang, Dongwook Kwon, Harmanpreet Singh, Nikhil Verma

Multi-agent systems built on language models have shown strong performance on collaborative reasoning tasks. However, existing evaluations focus only on the correctness of the final output, overlooking how inefficient communication and poor coordination contribute to redundant reasoning and higher computational costs. We introduce GEMMAS, a graph-based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph. To capture collaboration quality, we propose two process-level metrics: Information Diversity Score (IDS) to measure semantic variation in inter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths. We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration. These findings demonstrate that outcome-only metrics are insufficient for evaluating multi-agent performance and highlight the importance of process-level diagnostics in designing more interpretable and resource-efficient collaborative AI systems.

nan


Article 566

Title@2025-07-17 (4): Feature-based analysis of oral narratives from Afrikaans and isiXhosa children

Title: Feature-based analysis of oral narratives from Afrikaans and isiXhosa children Feature-basierte Analyse oraler Erzählungen von Afrikaans und isiXhosa-Kindern 对南非荷兰语和土著Xhoosa儿童口述叙述的基于特征的分析 2507.13164v1

Authors (6): Emma Sharratt, Annelien Smith, Retief Louw, Daleen Klop, Febe de Wet, Herman Kamper

Oral narrative skills are strong predictors of later literacy development. This study examines the features of oral narratives from children who were identified by experts as requiring intervention. Using simple machine learning methods, we analyse recorded stories from four- and five-year-old Afrikaans- and isiXhosa-speaking children. Consistent with prior research, we identify lexical diversity (unique words) and length-based features (mean utterance length) as indicators of typical development, but features like articulation rate prove less informative. Despite cross-linguistic variation in part-of-speech patterns, the use of specific verbs and auxiliaries associated with goal-directed storytelling is correlated with a reduced likelihood of requiring intervention. Our analysis of two linguistically distinct languages reveals both language-specific and shared predictors of narrative proficiency, with implications for early assessment in multilingual contexts.

nan


Article 567

Title@2025-07-17 (4): CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation

Title: CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation CCL-XCoT: Eine effiziente Cross-Lingual Knowledge Transfer Methode zur Minderung der Halluzination Generation CCL-XCot: 用于减少幻觉一代的有效交叉知识转让方法 2507.14239v1

Authors (6): Weihua Zheng, Roy Ka-Wei Lee, Zhengyuan Liu, Kui Wu, AiTi Aw, Bowei Zou

Multilingual Large Language Models(MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages, due to training data imbalances. These hallucinations, which include inaccurate or fabricated outputs, are particularly problematic in domain-specific generation tasks (Chataigner et al., 2024). To address this challenge, we propose CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought), a two-stage fine-tuning framework for mitigating hallucination in MLLMs. Our approach first enhances cross-lingual semantic alignment through curriculum-based contrastive learning combined with next-token prediction during continued pre-training. Building on this foundation, we then introduce a cross-lingual Chain-of-Thought (XCoT) prompting strategy during instruction fine-tuning, which guides the model to reason in a high-resource language before generating answers in the target low-resource language. Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs, without relying on external retrieval or multi-model ensembles.

nan


Article 568

Title@2025-07-17 (4): Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

Title: Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities Inverse Stärkung Lernen trifft auf großes Sprachmodell Post-Training: Grundlagen, Fortschritte und Chancen 培训后培训:基础、进步和机会 2507.13158v1

Authors (2): Hao Sun, Mihaela van der Schaar

In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

nan


Article 569

Title@2025-07-17 (4): SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

Title: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben SWE-MERA: 积极评价软件工程任务大语言模型的动态基准 2507.11059v2

Authors (9): Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.

nan


Article 570

Title@2025-07-17 (4): Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Title: Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung 结合人口偏见和示范解释评估LLM 说明的可靠性 2507.13138v1

Authors (6): Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou

Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

nan


Article 571

Title@2025-07-17 (4): A Computational Framework to Identify Self-Aspects in Text

Title: A Computational Framework to Identify Self-Aspects in Text Ein Computational Framework zur Identifizierung von Selbstaspekten im Text 文本中识别自我特征的计算框架 2507.13115v1

Authors (3): Jaya Caporusso, Matthew Purver, Senja Pollak

This Ph.D. proposal introduces a plan to develop a computational framework to identify Self-aspects in text. The Self is a multifaceted construct and it is reflected in language. While it is described across disciplines like cognitive science and phenomenology, it remains underexplored in natural language processing (NLP). Many of the aspects of the Self align with psychological and other well-researched phenomena (e.g., those related to mental health), highlighting the need for systematic NLP-based analysis. In line with this, we plan to introduce an ontology of Self-aspects and a gold-standard annotated dataset. Using this foundation, we will develop and evaluate conventional discriminative models, generative large language models, and embedding-based retrieval approaches against four main criteria: interpretability, ground-truth adherence, accuracy, and computational efficiency. Top-performing models will be applied in case studies in mental health and empirical phenomenology.

nan


Article 572

Title@2025-07-17 (4): Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression

Title: Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression Task-Circuit Quantization: Nutzung von Wissen Lokalisierung und Dolmetschbarkeit für Komprimierung 任务-环境环境定量:利用知识本地化和压缩解释 2504.07389v2

Authors (4): Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal

Post-training quantization (PTQ) reduces a model’s memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits – which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct’s unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ’s ability to identify important weights is not limited to task-conditioned settings.

nan


Article 573

Title@2025-07-17 (4): Language Models Change Facts Based on the Way You Talk

Title: Language Models Change Facts Based on the Way You Talk Sprachmodelle ändern Fakten anhand der Art und Weise, wie Sie sprechen 以你说话的方式为基础的语言模式改变事实 2507.14238v1

Authors (3): Matthew Kearney, Reuben Binns, Yarin Gal

Large language models (LLMs) are increasingly being used in user-facing applications, from providing medical consultations to job interview advice. Recent research suggests that these models are becoming increasingly proficient at inferring identity information about the author of a piece of text from linguistic patterns as subtle as the choice of a few words. However, little is known about how LLMs use this information in their decision-making in real-world applications. We perform the first comprehensive analysis of how identity markers present in a user’s writing bias LLM responses across five different high-stakes LLM applications in the domains of medicine, law, politics, government benefits, and job salaries. We find that LLMs are extremely sensitive to markers of identity in user queries and that race, gender, and age consistently influence LLM responses in these applications. For instance, when providing medical advice, we find that models apply different standards of care to individuals of different ethnicities for the same symptoms; we find that LLMs are more likely to alter answers to align with a conservative (liberal) political worldview when asked factual questions by older (younger) individuals; and that LLMs recommend lower salaries for non-White job applicants and higher salaries for women compared to men. Taken together, these biases mean that the use of off-the-shelf LLMs for these applications may cause harmful differences in medical care, foster wage gaps, and create different political factual realities for people of different identities. Beyond providing an analysis, we also provide new tools for evaluating how subtle encoding of identity in users’ language choices impacts model decisions. Given the serious implications of these findings, we recommend that similar thorough assessments of LLM use in user-facing applications are conducted before future deployment.

nan


Article 574

Title@2025-07-17 (4): SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts

Title: SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts SemCSE: Semantische kontrastive Satzeinbettungen mit LLM-generierten Zusammenfassungen für wissenschaftliche Abstracts SEMCSE: 使用LLM创制的科学摘要摘要 2507.13105v1

Authors (2): Marc Brinner, Sina Zarriess

We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model’s ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.

nan


Article 575

Title@2025-07-17 (4): Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Title: Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models Unified Triplet-Level Halluzination Evaluation für große Vision-Sprache Modelle 大型视觉语言模型统一三维级幻觉评价 2410.23114v4

Authors (4): Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung

Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure the object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations via (object, relation, object) triplets extracted from LVLMs’ responses, making it easily generalizable to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.

nan


Article 576

Title@2025-07-17 (4): SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control

Title: SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control SmartThinker: 学会按职级长长控制进行压缩和保留理由 2507.04348v2

Authors (3): Xingyang He, Xiao Ling, Jie Liu

Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.

nan


Article 577

Title@2025-07-17 (4): MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Title: MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben MERA 守则:一个统一框架,用于评估不同任务制定守则的情况 2507.12284v2

Authors (23): Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

nan


Article 578

Title@2025-07-17 (4): Formalizing Attack Scenario Description: A Proposed Model

Title: Formalizing Attack Scenario Description: A Proposed Model Formalisierung des Angriffsszenarios Beschreibung: Ein vorgeschlagenes Modell 正式化攻击设想情况说明:拟议模式 2507.13076v1

Authors (2): Quentin Goux, Nadira Lammari

Organizations face an ever-changing threat landscape. They must continuously dedicate significant efforts to protect their assets, making their adoption of increased cybersecurity automation inevitable. However, process automation requires formalization of input data. Through this paper, we address this need for processes that use attack scenarios as input. Among these processes, one can mention both the generation of scripts for attack simulation and training purposes, as well as the analysis of attacks. Therefore, the paper’s main research contribution is a novel formal model that encompasses the attack’s context description and its scenario. It is abstracted using UML class model. Once the description of our model done, we will show how it could serve an upstream attack analysis process. We will show also its use for an automatic generation of attack scripts in the context of cybersecurity training. These two uses cases constitute the second contribution of this present research work.

nan


Article 579

Title@2025-07-17 (4): Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Title: Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities Rethinking the Embodyd Gap in Vision-and-Language Navigation: Eine ganzheitliche Studie physischer und visueller Disparitäten 重新思考视觉和语言导航中的 “ 内博差距 “ :关于物理和视觉差异的综合研究 2507.13019v1

Authors (9): Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang

Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment’s overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.

nan


Article 580

Title@2025-07-17 (4): Teach Old SAEs New Domain Tricks with Boosting

Title: Teach Old SAEs New Domain Tricks with Boosting Lehren Sie alte SAEs neue Domain Tricks mit Förderung 教授旧的 SAEs 新域圈套 2507.12990v1

Authors (6): Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov

Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

nan


Article 581

Title@2025-07-17 (4): Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

Title: Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen 学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词 2507.03580v2

Authors (5): Nathaniel Berger, Johannes Eschbach-Dymanus, Miriam Exel, Matthias Huck, Stefan Riezler

In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.

nan


Article 582

Title@2025-07-17 (4): MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps

Title: MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps MRT bei IberLEF-2025 PRESTA Aufgabe: Maximierung der Erholung von Tischen mit mehreren Schritten IberLEF-2025 PRESTA任务:最大限度地从有多个步骤的表格中回收 2507.12981v1

Authors (5): Maximiliano Hormazábal Lagos, Álvaro Bueno Sáez, Héctor Cerezo-Costas, Pedro Alonso Doval, Jorge Alcalde Vesteiro

This paper presents our approach for the IberLEF 2025 Task PRESTA: Preguntas y Respuestas sobre Tablas en Espa~nol (Questions and Answers about Tables in Spanish). Our solution obtains answers to the questions by implementing Python code generation with LLMs that is used to filter and process the table. This solution evolves from the MRT implementation for the Semeval 2025 related task. The process consists of multiple steps: analyzing and understanding the content of the table, selecting the useful columns, generating instructions in natural language, translating these instructions to code, running it, and handling potential errors or exceptions. These steps use open-source LLMs and fine-grained optimized prompts for each step. With this approach, we achieved an accuracy score of 85\% in the task.

nan


Article 583

Title@2025-07-17 (4): UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets

Title: UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets UniSLU: Unified Spoken Language Understanding aus heterogenen Cross-Task-Datensätzen UUSLU:从不同式跨任务数据集获得统一口语语言理解 2507.12951v1

Authors (4): Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li

Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.

nan


Article 584

Title@2025-07-17 (4): Probabilistic Soundness Guarantees in LLM Reasoning Chains

Title: Probabilistic Soundness Guarantees in LLM Reasoning Chains Probabilistische Solidität garantiert in LLM-Aufklärungsketten LLM 理赔链条的概率稳妥性保障 2507.12948v1

Authors (7): Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong

In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).

nan


Article 585

Title: OASIS: Order-Augmented Strategy for Improved Code Search OASIS: Order-Augmented Strategy for Improved Code Search OASIS:改进守则搜索的有秩序加强战略 2503.08161v4

Authors (9): Zuchen Gao, Zizheng Zhan, Xianming Li, Erxin Yu, Ziqi Zhan, Haotian Zhang, Bin Chen, Yuqun Zhang, Jing Li

Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.

nan


Article 586

Title@2025-07-17 (4): Making Language Model a Hierarchical Classifier and Generator

Title: Making Language Model a Hierarchical Classifier and Generator Sprachmodell zu einem hierarchischen Klassifikator und Generator machen 使语言模式成为等级分类和生成器 2507.12930v1

Authors (11): Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji

Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human’s hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.

nan


Article 587

Title@2025-07-17 (4): MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Title: MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents MEM1: Lernen, Speicher zu synergisieren und für effiziente Long-Horizon-Agenten zu verankern MEM1:学习如何使记忆和理由相互协调,以有效长森剂 2506.15841v2

Authors (9): Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang

Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

nan


Article 588

Title@2025-07-17 (4): Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning

Title: Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v3

Authors (26): Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at https://github.com/tongjingqi/Code2Logic.

nan


Article 589

Title@2025-07-17 (4): IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Title: IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization IOPO: Verstärkung von LLMs mit komplexer Anleitung über Input-Output Preference Optimization IOPO:通过投入-产出优化,以复杂教学赋予LLMs权力 2411.06208v3

Authors (5): Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, Yongbin Li

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

nan


Article 590

Title@2025-07-17 (4): Why Braking? Scenario Extraction and Reasoning Utilizing LLM

Title: Why Braking? Scenario Extraction and Reasoning Utilizing LLM Warum bremsen? Szenario Extraktion und Vernunft Verwendung LLM 为什么要踩脚? 设想提取和合理使用LLM 2507.15874v1

Authors (6): Yin Wu, Daniel Slieter, Vivek Subramanian, Ahmed Abouelazm, Robin Bohn, J. Marius Zöllner

The growing number of ADAS-equipped vehicles has led to a dramatic increase in driving data, yet most of them capture routine driving behavior. Identifying and understanding safety-critical corner cases within this vast dataset remains a significant challenge. Braking events are particularly indicative of potentially hazardous situations, motivating the central question of our research: Why does a vehicle brake? Existing approaches primarily rely on rule-based heuristics to retrieve target scenarios using predefined condition filters. While effective in simple environments such as highways, these methods lack generalization in complex urban settings. In this paper, we propose a novel framework that leverages Large Language Model (LLM) for scenario understanding and reasoning. Our method bridges the gap between low-level numerical signals and natural language descriptions, enabling LLM to interpret and classify driving scenarios. We propose a dual-path scenario retrieval that supports both category-based search for known scenarios and embedding-based retrieval for unknown Out-of-Distribution (OOD) scenarios. To facilitate evaluation, we curate scenario annotations on the Argoverse 2 Sensor Dataset. Experimental results show that our method outperforms rule-based baselines and generalizes well to OOD scenarios.

nan


Article 591

Title@2025-07-17 (4): On the Limitations of Large Language Models (LLMs): False Attribution

Title: On the Limitations of Large Language Models (LLMs): False Attribution Über die Grenzen großer Sprachmodelle (LLMs): Falsche Attribution 对大语言模式限制的限制: 2404.04631v2

Authors (4): Tosin Adewumi, Nudrat Habib, Lama Alkhaled, Elisa Barney

In this work, we introduce a new hallucination metric - Simple Hallucination Index (SHI) and provide insight into one important limitation of the parametric knowledge of large language models (LLMs), i.e. false attribution. The task of automatic author attribution for relatively small chunks of text is an important NLP task but can be challenging. We empirically evaluate the power of 3 open SotA LLMs in zero-shot setting (Gemma-7B, Mixtral 8x7B, and LLaMA-2-13B). We acquired the top 10 most popular books of a month, according to Project Gutenberg, divided each one into equal chunks of 400 words, and prompted each LLM to predict the author. We then randomly sampled 162 chunks per book for human evaluation, based on the error margin of 7% and a confidence level of 95%. The average results show that Mixtral 8x7B has the highest prediction accuracy, the lowest SHI, and a Pearson’s correlation (r) of 0.724, 0.263, and -0.9996, respectively, followed by LLaMA-2-13B and Gemma-7B. However, Mixtral 8x7B suffers from high hallucinations for 3 books, rising as high as a SHI of 0.87 (in the range 0-1, where 1 is the worst). The strong negative correlation of accuracy and SHI, given by r, demonstrates the fidelity of the new hallucination metric, which may generalize to other tasks. We also show that prediction accuracies correlate positively with the frequencies of Wikipedia instances of the book titles instead of the downloads and we perform error analyses of predictions. We publicly release the annotated chunks of data and our codes to aid the reproducibility and evaluation of other models.

nan


Article 592

Title@2025-07-17 (4): Aligning Knowledge Graphs and Language Models for Factual Accuracy

Title: Aligning Knowledge Graphs and Language Models for Factual Accuracy Ausrichtung von Wissensgraphen und Sprachmodellen für die tatsächliche Genauigkeit 将知识图和语言模型与事实准确性对齐 2507.13411v1

Authors (6): Nur A Zarin Nishat, Andrea Coletta, Luigi Bellomarini, Kossi Amouzouvi, Jens Lehmann, Sahar Vahdati

Large language models like GPT-4, Gemini, and Claude have transformed natural language processing (NLP) tasks such as question answering, dialogue generation, summarization, and so forth; yet their susceptibility to hallucination stands as one of the major challenges. Among numerous approaches to overcome this challenge, integration of Knowledge Graphs (KGs) into language models has emerged as a promising solution as it provides structured, reliable, domain-specific, and up-to-date external information to the language models. In this paper, we introduce ALIGNed-LLM, a simple yet effective approach to improve language models’ factuality via a lean strategy to infuse KGs into the latent space of language models inspired by LLaVA where visual and textual information is infused. We use embeddings from a pre-trained Knowledge Graph Embedding (KGE) model, such as TransE, and a trainable projection layer to align entity and text embeddings. This alignment enables the language model to distinguish between similar entities improving factual grounding and reducing hallucination. We tested our approach on three popular questions-answering benchmark datasets alongside language models of varying sizes, showing significant improvement. Furthermore, we applied our approach to a real-world financial use case from a large central bank in Europe, which demands high accuracy and precision, demonstrating a substantial improvement of the LLM answers.

nan


Article 593

Title@2025-07-17 (4): A Logically Consistent Chain-of-Thought Approach for Stance Detection

Title: A Logically Consistent Chain-of-Thought Approach for Stance Detection Ein logisch konsistenter, schlüsselfertiger Ansatz zur Stance-Erkennung 一种逻辑上一致的研究链方法,以探测Stance 2312.16054v2

Authors (4): Bowen Zhang, Daijun Ding, Liwen Jing, Hu Huang

Zero-shot stance detection (ZSSD) aims to detect stances toward unseen targets. Incorporating background knowledge to enhance transferability between seen and unseen targets constitutes the primary approach of ZSSD. However, these methods often struggle with a knowledge-task disconnect and lack logical consistency in their predictions. To address these issues, we introduce a novel approach named Logically Consistent Chain-of-Thought (LC-CoT) for ZSSD, which improves stance detection by ensuring relevant and logically sound knowledge extraction. LC-CoT employs a three-step process. Initially, it assesses whether supplementary external knowledge is necessary. Subsequently, it uses API calls to retrieve this knowledge, which can be processed by a separate LLM. Finally, a manual exemplar guides the LLM to infer stance categories, using an if-then logical structure to maintain relevance and logical coherence. This structured approach to eliciting background knowledge enhances the model’s capability, outperforming traditional supervised methods without relying on labeled data.

nan


Article 594

Title@2025-07-17 (4): MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Title: MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness MAC-Tuning: Mehrkompositionelles LLM-Problem mit verbesserter Kenntnis der Grenzen des Wissens MAC-指导:LLM 以增进知识边界意识为由的多组问题 2504.21773v2

Authors (6): Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, May Fung

With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

nan


Article 595

Title@2025-07-17 (4): SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Title: SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems SEALGuard: Mehrsprachige Gespräche in südostasiatischen Sprachen für LLM-Softwaresysteme sichern SEALGuard:为LLM软件系统维护东南亚语言多语言对话 2507.08898v3

Authors (4): Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn

Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?’’), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. We release our pre-trained model and benchmark at https://github.com/awsm-research/SEALGuard to support further research.

nan


Article 596

Title@2025-07-17 (4): Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?

Title: Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent? Sind Wissen und Referenz in mehrsprachigen Sprachmodellen bereichsübergreifend konsistent? 多语文模式中的知识和参考资料是否相互一致? 2507.12838v1

Authors (3): Xi Ai, Mahardika Krisna Ihsani, Min-Yen Kan

Cross-lingual consistency should be considered to assess cross-lingual transferability, maintain the factuality of the model knowledge across languages, and preserve the parity of language model performance. We are thus interested in analyzing, evaluating, and interpreting cross-lingual consistency for factual knowledge. We examine code-mixed coreferential statements conveyed identical knowledge across languages to study cross-lingual knowledge consistency. We use some interpretability approaches to analyze the behavior of a model in cross-lingual contexts, discovering that multilingual models show different levels of consistency, subject to language families, linguistic factors, and a bottleneck in cross-lingual consistency on a particular layer. In addition, we evaluate common strategies aimed at improving multilingual performance to observe whether these strategies can improve knowledge consistency at the same time. While knowledge is not cross-lingual consistency in many cases, code-switching training and cross-lingual word alignment objectives show the most promising results, emphasizing the noteworthiness of cross-lingual alignment supervision and code-switching training for both multilingual performance and cross-lingual consistency enhancement.

nan


Article 597

Title@2025-07-17 (4): Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Title: Causal Language Control in Multilingual Transformers via Sparse Feature Steering Causal Language Control in Mehrsprachigen Transformatoren über Sparse Feature Steering 多语种变换器的因果语言控制 2507.13410v1

Authors (7): Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O’Brien

Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90\% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.

nan


Article 598

Title@2025-07-17 (4): Emotional Support with LLM-based Empathetic Dialogue Generation

Title: Emotional Support with LLM-based Empathetic Dialogue Generation Emotionale Unterstützung mit LLM-basiertem Empathetic Dialogue Generation 利用基于LLM的 “ 同情对话 “ 生成的LLM “ 情感支持 2507.12820v1

Authors (5): Shiquan Wang, Ruiyu Fang, Zhongjiang He, Shuangyong Song, Yongxiang Li

Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model’s ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.

nan


Article 599

Title@2025-07-17 (4): Large Language Models’ Internal Perception of Symbolic Music

Title: Large Language Models’ Internal Perception of Symbolic Music Die innere Wahrnehmung symbolischer Musik durch große Sprachmodelle 大语言模型内部对符号音乐的感知 2507.12808v1

Authors (2): Andrew Shin, Kunitake Kaneko

Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.

nan


Article 600

Title@2025-07-17 (4): MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Title: MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models MCPEval: Automatische MCP-basierte Deep Evaluation für AI Agent Modelle MCPEval:AI 代理模型的自动MCP深度评估 2507.12806v1

Authors (12): Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Huan Wang, Shelby Heinecke, Silvio Savarese, Caiming Xiong

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

nan


Article 601

Title@2025-07-17 (4): PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Title: PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database PMKLC: Parallele Multi-Knowledge Learning-basierte Lossless-Kompression für großformatige Genomics-Datenbank PMKLC: 大型基因组数据库的平行多知识学习-无损失压缩 2507.12805v1

Authors (8): Hui Sun, Yanfeng Ding, Liping Yi, Huidong Ma, Gang Wang, Xiaoguang Liu, Cheng Zhong, Wentong Cai

Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors’ backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036$\times$ and 10.710$\times$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.

nan


Article 602

Title@2025-07-17 (4): ReCode: Updating Code API Knowledge with Reinforcement Learning

Title: ReCode: Updating Code API Knowledge with Reinforcement Learning ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen ReCode:更新法规API知识与强化学习 2506.20495v2

Authors (5): Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.

nan


Article 603

Title@2025-07-17 (4): Beyond Architectures: Evaluating the Role of Contextual Embeddings in Detecting Bipolar Disorder on Social Media

Title: Beyond Architectures: Evaluating the Role of Contextual Embeddings in Detecting Bipolar Disorder on Social Media Beyond Architectures: Bewertung der Rolle kontextueller Einbettungen bei der Erkennung bipolarer Störungen in sozialen Medien 超越建筑:评价背景嵌入在发现社会媒体两极分极分崩离析现象中的作用 2507.14231v1

Authors (2): Khalid Hasan, Jamil Saquer

Bipolar disorder is a chronic mental illness frequently underdiagnosed due to subtle early symptoms and social stigma. This paper explores the advanced natural language processing (NLP) models for recognizing signs of bipolar disorder based on user-generated social media text. We conduct a comprehensive evaluation of transformer-based models (BERT, RoBERTa, ALBERT, ELECTRA, DistilBERT) and Long Short Term Memory (LSTM) models based on contextualized (BERT) and static (GloVe, Word2Vec) word embeddings. Experiments were performed on a large, annotated dataset of Reddit posts after confirming their validity through sentiment variance and judgmental analysis. Our results demonstrate that RoBERTa achieves the highest performance among transformer models with an F1 score of ~98% while LSTM models using BERT embeddings yield nearly identical results. In contrast, LSTMs trained on static embeddings fail to capture meaningful patterns, scoring near-zero F1. These findings underscore the critical role of contextual language modeling in detecting bipolar disorder. In addition, we report model training times and highlight that DistilBERT offers an optimal balance between efficiency and accuracy. In general, our study offers actionable insights for model selection in mental health NLP applications and validates the potential of contextualized language models to support early bipolar disorder screening.

nan


Article 604

Title@2025-07-17 (4): Learning Robust Negation Text Representations

Title: Learning Robust Negation Text Representations Robuste Negations-Textdarstellungen lernen 学习强力否定文本代表 2507.12782v1

Authors (4): Thinh Hung Truong, Karin Verspoor, Trevor Cohn, Timothy Baldwin

Despite rapid adoption of autoregressive large language models, smaller text encoders still play an important role in text understanding tasks that require rich contextualized representations. Negation is an important semantic function that is still not properly captured by such methods, affecting many downstream applications relying on text embeddings. We propose a strategy to improve negation robustness of text encoders, by distilling data from large language models using diverse patterns of negation and hedging. We adopt a standard contrastive learning strategy to finetune a strong BERT-based model, and observe large improvement in negation understanding capabilities while maintaining competitive performance on general benchmarks. In addition, we also show that our method can be adapted to LLMs, leading to improved performance on negation benchmarks.

nan


Article 605

Title@2025-07-17 (4): A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Title: A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models Eine umfassende Umfrage zur elektronischen Gesundheitsdatenmodellierung: Von Deep Learning Ansätzen bis hin zu großen Sprachmodellen 《电子健康记录模型综合调查:从深学习方法到大语言模式》 2507.12774v1

Authors (5): Weijieying Ren, Jingxi Zhu, Zehao Liu, Tianxiang Zhao, Vasant Honavar

Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to https://survey-on-tabular-data.github.io/.

nan


Article 606

Title@2025-07-17 (4): Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Title: Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback Kritik-GRPO: LLM-Vernunft mit natürlicher Sprache und numerischem Feedback verbessern Critique-GROPO: 提高以自然语言和数字反馈为依据的LLM 2506.03106v4

Authors (7): Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024, respectively.

nan


Article 607

Title@2025-07-17 (4): Synergy: End-to-end Concept Model

Title: Synergy: End-to-end Concept Model Synergie: Ende-zu-Ende-Konzeptmodell 协同增效:端到端概念模型 2507.12769v1

Authors (2): Keli Zheng, Zerong Xie

In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.

nan


Article 608

Title@2025-07-17 (4): VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Title: VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents VIDEE: Visuelle und Interaktive Zersetzung, Ausführung und Auswertung von Text Analytics mit intelligenten Agenten VIDE: 视觉和交互分解、执行和评价与智能剂的文字分析分析 2506.21582v2

Authors (6): Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma

Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE’s effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience – from none to expert – demonstrates the system’s usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

nan


Article 609

Title@2025-07-17 (4): Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Title: Logit Arithmetic Elicits Long Reasoning Capabilities Without Training Logit Arithmetische Elizite lange mit Gründen verbundene Fähigkeiten ohne Training 未经培训的逻辑 2507.12759v1

Authors (8): Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang

Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work first investigates whether we can elicit such behavior without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic (Liu et al., 2024) to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model – a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 26% and 29%, respectively, over four mathematical datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B – a model 21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skills acquired through reinforcement learning, improving pass@1 by 13% relative compared to the Qwen2.5-32B base model. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.

nan


Article 610

Title@2025-07-17 (4): Strategy Adaptation in Large Language Model Werewolf Agents

Title: Strategy Adaptation in Large Language Model Werewolf Agents Strategieanpassung im großen Sprachmodell Werwolf-Agenten 大语言示范狼人代理物的适应战略 2507.12732v1

Authors (3): Fuya Nakamori, Yin Jou Huang, Fei Cheng

This study proposes a method to improve the performance of Werewolf agents by switching between predefined strategies based on the attitudes of other players and the context of conversations. While prior works of Werewolf agents using prompt engineering have employed methods where effective strategies are implicitly defined, they cannot adapt to changing situations. In this research, we propose a method that explicitly selects an appropriate strategy based on the game context and the estimated roles of other players. We compare the strategy adaptation Werewolf agents with baseline agents using implicit or fixed strategies and verify the effectiveness of our proposed method.

nan


Article 611

Title@2025-07-17 (4): TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

Title: TransEvalnia: Reasoning-based Evaluation and Ranking of Translations TransEvalnia: Reasoning-based Evaluation und Ranking von Übersetzungen 过年:基于理由的评价和笔译的排名 2507.12724v1

Authors (3): Richard Sproat, Tianyu Zhao, Llion Jones

We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic’s Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system – as well as MT-Ranker – to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system’s evaluation and reasoning, human assessments, as well as code is released.

nan


Article 612

Title@2025-07-17 (4): Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

Title: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs Synthesizing Privacy-Preserving Text Data via Finetuning ohne Finetuning Billion-Scale LLMs 通过不作十亿规模的微调微调的微调合成保护隐私文本数据 2503.12347v2

Authors (5): Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu

Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

nan


Article 613

Title@2025-07-17 (4): GUI Test Migration via Abstraction and Concretization

Title: GUI Test Migration via Abstraction and Concretization GUI-Test-Migration über Abstraktion und Konkretisierung GUI 通过抽象和简明化测试移民 2409.05028v2

Authors (7): Yakun Zhang, Chen Liu, Xiaofei Xie, Yun Lin, Jin Song Dong, Dan Hao, Lu Zhang

GUI test migration aims to produce test cases with events and assertions to test specific functionalities of a target app. Existing migration approaches typically focus on the widget-mapping paradigm that maps widgets from source apps to target apps. However, since different apps may implement the same functionality in different ways, direct mapping may result in incomplete or buggy test cases, thus significantly impacting the effectiveness of testing target functionality and the practical applicability of migration approaches. In this paper, we propose a new migration paradigm (i.e., the abstraction-concretization paradigm) that first abstracts the test logic for the target functionality and then utilizes this logic to generate the concrete GUI test case. Furthermore, we introduce MACdroid, the first approach that migrates GUI test cases based on this paradigm. Specifically, we propose an abstraction technique that utilizes source test cases from source apps targeting the same functionality to extract a general test logic for that functionality. Then, we propose a concretization technique that utilizes the general test logic to guide an LLM in generating the corresponding GUI test case (including events and assertions) for the target app. We evaluate MACdroid on two widely-used datasets (including 31 apps, 34 functionalities, and 123 test cases). On the FrUITeR dataset, the test cases generated by MACdroid successfully test 64% of the target functionalities, improving the baselines by 191%. On the Lin dataset, MACdroid successfully tests 75% of the target functionalities, outperforming the baselines by 42%. These results underscore the effectiveness of MACdroid in GUI test migration.

nan


Article 614

Title@2025-07-17 (4): Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening

Title: Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening Fairness ist nicht genug: Auditing-Kompetenz und Intersektions-Bias in KI-powered Resume Screening 公平不够充分:审计能力和大赦国际授权的恢复筛选中的跨部门比阿斯 2507.11548v2

Authors (1): Kevin T Webster

The increasing use of generative AI for resume screening is predicated on the assumption that it offers an unbiased alternative to biased human decision-making. However, this belief fails to address a critical question: are these AI systems fundamentally competent at the evaluative tasks they are meant to perform? This study investigates the question of competence through a two-part audit of eight major AI platforms. Experiment 1 confirmed complex, contextual racial and gender biases, with some models penalizing candidates merely for the presence of demographic signals. Experiment 2, which evaluated core competence, provided a critical insight: some models that appeared unbiased were, in fact, incapable of performing a substantive evaluation, relying instead on superficial keyword matching. This paper introduces the “Illusion of Neutrality” to describe this phenomenon, where an apparent lack of bias is merely a symptom of a model’s inability to make meaningful judgments. This study recommends that organizations and regulators adopt a dual-validation framework, auditing AI hiring tools for both demographic bias and demonstrable competence to ensure they are both equitable and effective.

nan


Article 615

Title@2025-07-17 (4): ActionStudio: A Lightweight Framework for Data and Training of Large Action Models

Title: ActionStudio: A Lightweight Framework for Data and Training of Large Action Models ActionStudio: Ein leichter Rahmen für Daten und Training großer Aktionsmodelle 行动研究:关于大型行动模式的数据和培训的轻量框架 2503.22673v3

Authors (16): Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9x higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories. Code: https://github.com/SalesforceAIResearch/xLAM.

nan


Article 616

Title@2025-07-17 (4): Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation

Title: Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation Chain-of-Thought Prompting Obscures Halluzination Cues in großen Sprachmodellen: Eine empirische Bewertung 引导大语言模型中传译锥体:经验评价 2506.17088v2

Authors (8): Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li

Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM’s internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://anonymous.4open.science/r/cot-hallu-detect.

nan


Article 617

Title@2025-07-17 (4): AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

Title: AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation AudioJudge: Verstehen, was in der großen Audiomodell basierten Sprachbewertung funktioniert 音频法官:了解大型音频示范演讲评价有什么用 2507.12705v1

Authors (8): Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang

Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.

nan


Article 618

Title@2025-07-17 (4): Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis

Title: Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis Ausnutzung adaptiver Kontextmasken für aspektbasierte Sentiment-Analysen 利用适应性环境掩码进行外观感应力分析 2402.13722v2

Authors (4): S M Rafiuddin, Mohammed Rakib, Sadia Kamal, Arunkumar Bagavathi

Aspect-Based Sentiment Analysis (ABSA) is a fine-grained linguistics problem that entails the extraction of multifaceted aspects, opinions, and sentiments from the given text. Both standalone and compound ABSA tasks have been extensively used in the literature to examine the nuanced information present in online reviews and social media posts. Current ABSA methods often rely on static hyperparameters for attention-masking mechanisms, which can struggle with context adaptation and may overlook the unique relevance of words in varied situations. This leads to challenges in accurately analyzing complex sentences containing multiple aspects with differing sentiments. In this work, we present adaptive masking methods that remove irrelevant tokens based on context to assist in Aspect Term Extraction and Aspect Sentiment Classification subtasks of ABSA. We show with our experiments that the proposed methods outperform the baseline methods in terms of accuracy and F1 scores on four benchmark online review datasets. Further, we show that the proposed methods can be extended with multiple adaptations and demonstrate a qualitative analysis of the proposed approach using sample text for aspect term extraction.

nan


Article 619

Title@2025-07-17 (4): AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis

Title: AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis AdaptiSent: Context-Aware Adaptive Aufmerksamkeit für multimodale Aspect-Based-Sentiment-Analysen 适应性:基于多种模式的光谱感应分析的上下文知识适应性关注 2507.12695v1

Authors (5): S M Rafiuddin, Sadia Kamal, Mohammed Rakib, Arunkumar Bagavathi, Atriya Sen

We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model’s ability to adjust its focus dynamically based on the context’s relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.

nan