cs.CL @ 2025-08-01: 666
-
00 07-31 (4) Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities Cascaded Information Disclosure for Generalized Evaluation of Problem Lösing Capabilities 用于对解决问题能力通用评价的连锁信息披露 2507.23776v1 -
01 07-31 SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model SimuRA: Auf dem Weg zu einem General Goal-Oriented Agent über Simulative Reasoning Architecture mit LLM-basiertem Weltmodell SimurRA:通过使用以LLM为基础的世界模型的模拟合理理由结构,努力实现以一般目标为导向的代理 2507.23773v1 -
02 07-31 Perception-Aware Policy Optimization for Multimodal Reasoning Perception-Aware Policy Optimization für multimodale Reasoning 对多式联运理由的观念-认知软件政策优化 2507.06448v3 -
03 07-31 CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks CoT-Self-Instruct: Aufbau hochwertiger synthetischer Aufforderungen zur Begründung und zu nicht-vernünftigen Aufgaben COT-自学教学:为推理和非理由性任务建立高质量的合成提示 2507.23751v1 -
04 07-31 Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs Regel2Text: Natürliche Sprache Erklärung der logischen Regeln in Wissensgraphen 规则2案文:知识图中逻辑规则的自然语言解释 2507.23740v1 -
05 07-31 How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment Wie KI-Ideen die Kreativität, Vielfalt und Evolution menschlicher Ideen beeinflussen: Beweise aus einem großen, dynamischen Experiment AI Ideas如何影响人类思想的创造性、多样性和演变:大规模动态实验的证据 2401.13481v3 -
06 07-31 Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving Seed-Prover: Tiefe und breite Begründung für automatisierte Theorem Proving 种子文献:用于自动理论论证的深度和广度理由 2507.23726v1 -
07 07-31 RecGPT Technical Report Technischer Bericht des RecGPT RecGPT 技术报告 2507.22879v2 -
08 07-31 Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length Nicht zu vergessen: Proaktive Interferenz offenbart Arbeitsspeichergrenzen in LLMs jenseits der Kontextlänge 无法忘却: 事外长长的LLMM 中主动干扰流出工作内存限制 2506.08184v3 -
09 07-31 TextQuests: How Good are LLMs at Text-Based Video Games? TextQuests: Wie gut sind LLMs bei textbasierten Videospielen? 文本Quests: 文本视频游戏的LLMs效果如何? 2507.23701v1 -
10 07-31 TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses TweakLLM: Eine Routing-Architektur für dynamisches Tailoring von Cached Responses TweakLLLM: 快速快速定制快速响应的运行结构 2507.23674v1 -
11 07-31 Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning Arabische Hass-Spracherkennung und Maskenbildung in sozialen Medien mit Deep-Learning-Modellen und vortrainierten Modellen Feinabstimmung 利用深学习模式和预培训模式进行微调,在社会媒体中识别和遮掩阿拉伯仇恨言论 2507.23661v1 -
12 07-31 DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures DocPolarBERT: Ein vortrainiertes Modell zum Dokumentverständnis mit relativer Polarkoordinate Kodierung von Layoutstrukturen DocPolarBERT:一个预先培训的文件理解模式,其布局结构的相对极地协调编码 2507.08606v3 -
13 07-31 Who’s important? – SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation Wer ist wichtig? – SUnSET: Synergistisches Verständnis von Stakeholdern, Ereignissen und Zeit für die Timeline Generation 谁重要? - SUNSET:对利益攸关方、事件和时间的协同理解,以产生时间表。 2507.21903v2 -
14 07-31 How Can I Publish My LLM Benchmark Without Giving the True Answers Away? Wie kann ich meinen LLM-Benchmark veröffentlichen, ohne die wahren Antworten wegzugeben? 我怎样才能公布我的LLM基准而不给出正确的答案? 2505.18102v2 -
15 07-31 Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation Splits! Ein flexibler Datensatz und Evaluationsrahmen für die soziokulturelle Linguistische Untersuchung 社会文化语言调查灵活数据集和评价框架 2504.04640v2 -
16 07-31 ILID: Native Script Language Identification for Indian Languages ILID: Native Script Language Identification für indische Sprachen ILID:印第安人语言的土著脚本语言识别 2507.11832v2 -
17 07-31 Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Assessments 具有不确定性估计值的临床试验的深入学习预测 2507.23607v1 -
18 07-31 Inside-Out: Hidden Factual Knowledge in LLMs Inside-Out: Verstecktes Sachwissen in LLMs 内外:LLM中隐藏的事实知识 2503.15299v3 -
19 07-31 DiffLoRA: Differential Low-Rank Adapters for Large Language Models DiffLoRA: Differential-Low-Rank-Adapter für große Sprachmodelle DiffLORA:用于大语言模型的差别型低兰克适应器 2507.23588v1 -
20 07-31 T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text T-Detect: Tail-Aware Statistische Normalisierung zur robusten Erkennung von maschinengeneriertem Text T-检测:用于对反转机制文本进行强力探测的尾件软件统计标准化 2507.23577v1 -
21 07-31 Neutral Residues: Revisiting Adapters for Model Extension Neutrale Rückstände: Adapter zur Modellerweiterung 中立残留物:重新审视适应器,用于示范推广 2410.02744v3 -
22 07-31 Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation Kann LLMs mit Ambiguity helfen? Eine quantitative Bewertung verschiedener großer Sprachmodelle auf Word Sense Disambiguation LLMs能否协助其模糊性? 量化评估关于 “ Word Sense Disanderation “ 的各种大语言模型。 2411.18337v4 -
23 07-31 Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning Med-R$^3$: Verbesserung der medizinischen Retrieval-Augmented Reasoning von LLMs durch Progressive Verstärkung Lernen 3美元Med-R$3美元:通过逐步加强学习加强医疗取回-增加LLMs的理据 2507.23541v1 -
24 07-31 PurpCode: Reasoning for Safer Code Generation PurpCode: Begründung für eine sicherere Code-Generierung PurpCode:更安全代码生成的理由 2507.19060v2 -
25 07-31 MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks MECAT: Ein Multi-Experten-Benchmark für feinkörnige Audio-Verstandsaufgaben MECAT: 完善的音频理解任务多专家基准 2507.23511v1 -
26 07-31 LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning LLaVA-MORE: Eine vergleichende Studie von LLMs und visuellen Backbones für verbesserte visuelle Instruktions-Tuning LLAVA-MORE:用于强化视觉教学的LLM和视觉背骨比较研究 2503.15621v2 -
27 07-31 A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains Ein neuartiger Bewertungs-Benchmark für medizinische LLMs: Beleuchtende Sicherheit und Wirksamkeit in klinischen Bereichen 医疗LLMs新颖的评价基准:临床域的引明安全和有效性 2507.23486v1 -
28 07-31 Role-Aware Language Models for Secure and Contextualized Access Control in Organizations Role-Aware Sprachmodelle für sichere und kontextualisierte Zugriffskontrolle in Organisationen 各组织内安全和环境化出入控制使用控制实用语言模式 2507.23465v1 -
29 07-31 Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems Counterfactual Evaluation für Blindangriffserkennung in LLM-basierten Evaluationssystemen 以LLM为基础的评价系统中盲人攻击探测的反事实评价 2507.23453v1 -
30 07-31 EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework BildungQ: Bewertung der Lehrfähigkeiten von LLMs durch Multi-Agent Dialograhmen 教育Q:通过多机构对话框架评价LLMS的教学能力 2504.14928v3 -
31 07-31 The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models Der Pragmatische Geist der Maschinen: Auf der Spur des Entstehens der Pragmatischen Kompetenz in großen Sprachmodellen 机器的实用思维:追踪大语言模式中实用能力的出现 2505.18497v2 -
32 07-31 Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration Über passives kritisches Denken hinaus: Förderung proaktiver Befragungen zur Verbesserung der Mensch-KI-Kollaboration 超越被动的批判性思考:促进积极主动的提问,以加强人类与大赦国际的协作 2507.23407v1 -
33 07-31 RAVine: Reality-Aligned Evaluation for Agentic Search RAVine: Realitätsorientierte Bewertung für die Agentische Suche RAVine: 化学搜索的现实统一评价 2507.16725v2 -
34 07-31 Enhanced Arabic Text Retrieval with Attentive Relevance Scoring Verbesserte arabische Text-Retrieval mit aufmerksamer Relevanz Scoring 阿拉伯强化文本检索, 带有启动相关性显示器 2507.23404v1 -
35 07-31 MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization MRGSEM-Sum: Ein unbeaufsichtigtes Multi-Dokument Zusammenfassungsrahmen basierend auf Multi-Relational Graphen und struktureller Entropie Minimierung MRGSEM-Sum:基于多关系图和结构元件最小化的无人监督的多文件概括框架 2507.23400v1 -
36 07-31 Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators Beyond the Cloud: Bewertung der Vorteile und Nachteile lokaler LLM-Einsatzmöglichkeiten für Übersetzer 云云之外:评估为笔译员部署当地LLM的利弊 2507.23399v1 -
37 07-31 Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models Causal2Vec: Verbessere Dekoder-nur LLMs als vielseitige Einbettungsmodelle Causal2Vec:改进只有解码器的LLMs作为Versatile嵌入模型 2507.23386v1 -
38 07-31 MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models MPCC: Ein neuartiger Benchmark für multimodale Planung mit komplexen Einschränkungen in multimodalen großen Sprachmodellen MPCC:具有多种多语言模式复杂限制的多式联运规划新基准 2507.23382v1 -
39 07-31 Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models Theorem-of-Thought: Ein Multi-Agenten-Framework für abduktive, deduktive und induktive Vernunft in Sprachmodellen 所探讨的理论理论:语言模式中指导、贬低和诱导理由的多机构框架 2506.07106v2 -
40 07-31 WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation WildSpeech-Bench: Benchmarking von Audio-LLMs im natürlichen Sprachgespräch WirdSpeech-Bench:为自然演讲对话中的音频LMs设定基准 2506.21875v2 -
41 07-31 Holistic Evaluations of Topic Models Ganzheitliche Bewertungen von Themenmodellen 专题模式整体评价 2507.23364v1 -
42 07-31 Robust and Fine-Grained Detection of AI Generated Texts Robuste und feinkörnige Erkennung von KI-generierten Texten 对 AI 生成文本的强力和精细探测 2504.11952v3 -
43 07-31 SWE-Exp: Experience-Driven Software Issue Resolution SWE-Exp: Erfahrungsgetriebene Software-Ausgabeauflösung SWE-Expl:经验丰富的软件问题决议 2507.23361v1 -
44 07-31 VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning VL-Cogito: Progressives Curriculum-Verstärkungslernen für fortgeschrittene multimodale Vernunft VL-Cocito:先进多式联运理由的渐进课程强化学习 2507.22607v2 -
45 07-31 Text-to-SQL Task-oriented Dialogue Ontology Construction Text-zu-SQL Aufgabenorientierter Dialog Ontologie Konstruktion 以任务为导向的对话肿瘤构建 2507.23358v1 -
46 07-31 KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v2 -
47 07-31 SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution SWE-Debatte: Wettbewerbsfähige Multi-Agenten-Debatte für die Lösung von Software-Problemen SWE-Debate:解决软件问题竞争性多机构辩论 2507.23348v1 -
48 07-31 Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance Mehrsprachige Fähigkeiten mit kulturellem und lokalem Wissen in großen Sprachmodellen verbessern und gleichzeitig die Leistungsfähigkeit der Ureinwohner verbessern 提高多语言多语言能力,在提高土著绩效的同时,利用大语言模式的文化和地方知识,同时提高土著绩效 2504.09753v3 -
49 07-31 DSBC : Data Science task Benchmarking with Context engineering DSBC : Data Science-Aufgabe Benchmarking mit Kontext-Engineering DSBC: 数据科学任务与背景工程基准 2507.23336v1 -
50 07-31 MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation MUST-RAG: MUSical Text Question Beantwortung mit retrieval Augmented Generation MOST-RAG: 以回取增加的一代人回答的中文本问题 2507.23334v1 -
51 07-31 Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette Kulturelle Palette: Pluralisierung der Kulturausrichtung über Multi-Agenten-Palette 文化调色板:通过多试剂调色板实现多元化文化协调 2412.11167v3 -
52 07-31 FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain FinGAIA: Ein chinesischer Benchmark für KI-Agenten in der Real-World Financial Domain 金融界:中国真实世界金融领域AI代理商基准 2507.17186v2 -
53 07-31 Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages Multi-Hypothese Destillation von mehrsprachigen Neuralübersetzungsmodellen für ressourcenarme Sprachen 多语言低资源语言多语言神经翻译模型的蒸馏 2507.21568v2 -
54 07-31 LLMs and the Human Condition LLMs und der menschliche Zustand LLM和人类条件 2402.08403v6 -
55 07-31 What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content Was ist Taboo für Sie? - Eine empirische Bewertung von LLMs Verhalten für Sensitive Inhalte - 对行为举止为敏感内容的LLMS的 经验评估 2507.23319v1 -
56 07-31 LiMe: a Latin Corpus of Late Medieval Criminal Sentences LiMe: ein lateinischer Corpus der spätmittelalterlichen Strafurteile Lime:拉丁美洲中世纪晚期刑事判决区 2404.12829v2 -
57 07-31 SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy SequenzLayer: Sequenzverarbeitung und Streaming von Neuronalen Netzwerken leicht gemacht 序列激光器:序列处理和串联神经网络变得容易 2507.23292v1 -
58 07-31 Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics Fortschritte in LLMs mit Fokus auf Vernunft, Anpassungsfähigkeit, Effizienz und Ethik 注重理由、适应性、效率和道德操守的LLMs项目的进展 2506.12365v2 -
59 07-31 Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability Iterative Reparatur mit schwachen Verifierern für wenige Aufnahmen in KBQA mit Unbeantwortbarkeit KBQA 中无法解答的微小投射点校验器的迭代性修补 2406.14313v3 -
60 07-31 AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora AutoSchemaKG: Autonome Wissensgraphenkonstruktion durch dynamische Schemainduktion aus Web-Scale Corpora AutoSchemaKG:通过网络规模公司动态气相引入,建立自主知识图 2505.23628v2 -
61 07-31 Unveiling Super Experts in Mixture-of-Experts Large Language Models Enthüllen Super-Experten in Mixture-of-Experts große Sprachmodelle 混合专家大语言模型中不懈的超级专家 2507.23279v1 -
62 07-31 AI-Reporter: A Path to a New Genre of Scientific Communication AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation AI-记者:通向科学通信新一流的道路 2507.05903v2 -
63 07-31 Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis Bewertung der Mehrsprachigkeitsfähigkeiten von LLMs für Bengalen: Benchmark-Erstellung und Leistungsanalyse 评价孟加拉多种语文能力:基准设定和业绩分析 2507.23248v1 -
64 07-31 P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication P-ReMIS: Pragmatische Vernunft in der psychischen Gesundheit und einer sozialen Implikation P-REMIS: 心理健康和社会影响方面的实用原因 2507.23247v1 -
65 07-31 Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents Generalisiertes Verstärkungslernen für retriever-spezifische Abfrage-Rewriter mit unstrukturierten Real-World-Dokumenten 利用无结构的 “ 现实世界文件 “ 检索特定查询卷卷的通用强化学习 2507.23242v1 -
66 07-31 Cutting Through the Noise: Boosting LLM Performance on Math Word Problems Schneiden durch den Lärm: Steigerung der LLM-Performance bei Math Word-Problemen 通过噪音剪切:促进数学字问题LLM的LLM性能 2406.15444v4 -
67 07-31 Framing Political Bias in Multilingual LLMs Across Pakistani Languages Framing politische Bias in mehrsprachigen LLMs in pakistanischen Sprachen 以多语种LLMs多种巴基斯坦语言界定政治偏见 2506.00068v2 -
68 07-31 AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents AgentSpec: Anpassbare Runtime Enforcement für sichere und zuverlässige LLM-Agenten 安全可靠LLM代理商的可定制运行时间执法 2503.18666v3 -
69 07-31 Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs Ermöglichung der weniger scharfen Alzheimer-Krankheit Diagnose auf Tabular Biomarker Daten mit LLMs 使小热阿尔茨海默氏病的疾病诊断能够用LMS在表示生物标记数据上进行 2507.23227v1 -
70 07-31 Unveiling the Influence of Amplifying Language-Specific Neurons Enthüllen des Einflusses amplifizierender sprachspezifischer Neuronen 消除扩增语言特有新元的影响 2507.22581v2 -
71 07-31 LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models LLM-Crowdsourced: Ein Benchmark-freies Paradigma zur gegenseitigen Bewertung großer Sprachmodelle LLM-文献来源:用于对大语言模式进行相互评价的无基准建模 2507.22359v2 -
72 07-31 Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders Model Directions, keine Worte: Mechanistische Themenmodelle mit Sparse Autoencodern 模型方向,非单词:使用粗态自动编码器的机械专题模型 2507.23220v1 -
73 07-31 Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires Kulturelle Bias in großen Sprachmodellen: Bewertung von KI-Agenten durch moralische Fragebögen 大语言模式中的文化偏见:通过道德问卷评估AI代理 2507.10073v2 -
74 07-31 Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples Fehler sind die Steinschritte zum Erfolg: Erweitern Sie das wenige-heiße In-Context-Lernen durch die Nutzung negativer Muster 失败是走向成功的一步步石:通过利用负面样本加强少许热的文体学习 2507.23211v1 -
75 07-31 InfAlign: Inference-aware language model alignment InfAlign: Inference-aware Sprachmodellausrichtung Infagign: 参考意识语言模型对齐 2412.19792v4 -
76 07-31 Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages Towards Inclusive ASR: Untersuchung der Sprachumwandlung für Dysarthric Speech Recognition in Low-Resource Sprachen 努力实现包容性的ASR:低资源语言中承认代谢语言语音转换调查 2505.14874v4 -
77 07-31 Explaining vague language Unbestimmte Sprache erklären 解释含糊措辞 2404.18154v2 -
78 07-31 Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks Geak: Einführung von Triton Kernel AI Agent & Evaluation Benchmarks Geak:介绍Triton Kernel AI 代理和评估基准 2507.23194v1 -
79 07-31 EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts EgoOops: Ein Datensatz zur Erkennung von Fehlern aus egozentrischen Videos, die sich auf Verfahrenstexte beziehen EgoOops: 用于从 Egocentic 视频中检测错误动作的数据集, 指程序文字 2410.05343v3 -
80 07-31 Leveraging LLMs to Create Content Corpora for Niche Domains LLMs nutzen, um Content Corpora für Niche Domains zu erstellen 利用LMLM 来为新域创建内容公司 2505.02851v2 -
81 07-31 LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration LENS: Lerne Ensemble Vertrauen aus neuralen Staaten für Multi-LLM-Antwortintegration LENS:从神经国家学习多LLM应答整合的集合信任 2507.23167v1 -
82 07-31 Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation Vision-Language-Modelle sind in Bezug auf Expression Generation nicht pragmatisch kompetent 视觉-语言模型在代言表达式生成中不具备实用能力 2504.16060v3 -
83 07-30 (3) User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal User Feedback in Human-LLM Dialogen: Ein Objektiv, um die Nutzer zu verstehen, aber laut als Lernsignal 人类- LLLM 对话框中的用户反馈: 了解用户的镜头, 但将吵闹当作学习信号 2507.23158v1 -
84 07-30 Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer Kann eine Größe für alle passen?: Messfehler in Multi-Document-Zusammenfassung Domain-Transfer 能够一刀切吗? :在多文件概括性文件转让中衡量失败 2503.15768v2 -
85 07-30 ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans ISO-Bench: Benchmarking multimodaler Kausalität in visuellen Sprachmodellen durch verfahrenstechnische Pläne ISO-Bench:通过程序计划确定视觉语言模型中多式因果关系基准 2507.23135v1 -
86 07-30 Meta CLIP 2: A Worldwide Scaling Recipe Meta CLIP 2: Ein weltweites Scaling-Rezept Meta CLIP 2: 全球规模扩大食谱 2507.22062v2 -
87 07-30 Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity Enthüllen der Fragilität von vertrauenswürdigen LLMs durch chinesische Text-Ambiguität 通过中文文字缩略图,揭开可信赖的LLM 易用性 2507.23121v1 -
88 07-30 RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL RASL: 大规模数据库文本到 SQL 的检索增强的相连接表表 2507.23104v1 -
89 07-30 SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity SMART-Editor: Multi-Agenten-Framework für menschenähnliche Designbearbeitung mit struktureller Integrität SMART-编辑:具有结构完整性的多机构设计设计框架 2507.23095v1 -
90 07-30 Context-aware Rotary Position Embedding Context-aware Rotary Position Einbetten 扶轮位置嵌入式 2507.23083v1 -
91 07-30 Exploring In-Context Learning for Frame-Semantic Parsing In-Context-Lernen für rahmensemantisches Parsing erforschen 探索用于框架语义分析的内文学习 2507.23082v1 -
92 07-30 Math Natural Language Inference: this should be easy! Math Natural Language Inferenz: das sollte einfach sein! Math自然语言推论:这应该很容易! 2507.23063v1 -
93 07-30 Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion Conan: Ein Chunkwise Online-Netzwerk für Null-Shot Adaptive Voice Conversion Conan:一个零热适应性语音转换的中远在线网络 2507.14534v3 -
94 07-30 Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Moslems in Large Language Models: A Systematic Review 减轻大语言模式中针对阿拉伯人和穆斯林的文化偏见的迅速工程技术:系统审查 2506.18199v2 -
95 07-30 Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning Wo man Demos in Deinem Prompt zeigt: Ein positionelles Bias des In-Context-Lernens 在哪里显示您快速的演示 : 内容学习的定位偏见 2507.22887v1 -
96 07-30 C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations C3: Ein zweisprachiger Benchmark für gesprochene Dialogmodelle zur Erforschung von Herausforderungen in komplexen Gesprächen C3:探讨复杂对话挑战的口头对话模式的双语基准 2507.22968v1 -
97 07-30 GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis GeoOutageKG: Ein multimodaler Geospatiotemporaler Wissensgraph für die Multiauflösungsanalyse von Stromausfällen GeoouteageKG:多分辨率电源外向分析多式地球观测时知识图 2507.22878v1 -
98 07-30 FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models FRED: Finanzielle Retrieval-erweiterte Erkennung und Bearbeitung von Halluzinationen in Sprachmodellen FRED: 财务检索-加强发现和编辑语言模型中的幻觉 2507.20930v2 -
99 07-30 Past Meets Present: Creating Historical Analogy with Large Language Models Vergangenheit trifft Gegenwart: Historische Analogie mit großen Sprachmodellen erstellen 过去曾出席的会议:创建具有大语言模式的历史分析 2409.14820v2 -
100 07-30 The Incomplete Bridge: How AI Research (Mis)Engages with Psychology Die unvollendete Brücke: Wie KI-Forschung (Mis) mit Psychologie verstrickt 不完整的桥梁:人工智能如何研究(Miss)心理学的组合 2507.22847v1 -
101 07-30 ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer ReverBERT: Ein State Space Model für eine effiziente textgesteuerte Sprachübertragung ReverBERT: 高效发短信语音风格转让国家空间模型 2503.20992v2 -
102 07-30 Cross-Modal State-Space Graph Reasoning for Structured Summarization Grenzüberschreitende State-Space-Graph-Gründung für strukturierte Zusammenfassung 结构归纳的跨模式国家空间图 2503.20988v2 -
103 07-30 Scaling RL to Long Videos Skalierung von RL zu langen Videos 缩放 RL 到长视频 2507.07966v3 -
104 07-30 MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models MiniLongBench: Der kostengünstige Long Context Benchmark für große Sprachmodelle verstehen MiniLongBunench:大语言模式低成本长方背景理解基准 2505.19959v2 -
105 07-30 Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization Jenseits natürlicher Sprachpläne: Struktur-Bewusst-Planung für Abfrage-fokussierte Tabellenzusammenfassung 超越自然语言计划: 查询用户使用表的结构-软件规划 2507.22829v1 -
106 07-30 SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs 空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v3 -
107 07-30 DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph DBLPLink 2.0 – Ein Entity Linker für den DBLP-Wissenschaftsgraphen DBLPLink 2.0 - DBLPLP 学术知识图的实体链接 2507.22811v1 -
108 07-30 IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation IterKey: Iterative Keyword Generation mit LLMs für verbesserte retrieval Augmented Generation IterKey: 循环关键字生成,并配有 “ 增强再获取能力增量一代 “ 的LMML 2505.08450v2 -
109 07-30 Towards the Law of Capacity Gap in Distilling Language Models Auf dem Weg zum Gesetz der Kapazitä tigkeitslücke bei der Destillierung von Sprachmodellen 迈向《语文模式再学习能力差距法》 2311.07052v4 -
110 07-30 MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations MFTCXplain: Mehrsprachiger Benchmark-Datensatz zur Bewertung der moralischen Vernunft von LLMs durch Hassreden-Multi-Hop-Erklärungen MFTCXplain:通过仇恨言论多呼多呼解释评估LLMs道德理由的多语言基准数据集 2506.19073v2 -
111 07-30 DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router DeepSieve: Informationen über LLM-as-a-Knowledge-Router 深筛选:通过LLM-as-a- knowledge-Router获取信息 2507.22050v2 -
112 07-30 GATEAU: Selecting Influential Samples for Long Context Alignment GATEAU: Auswahl von einflussreichen Proben für lange Kontextausrichtung GATEAU:为长期对齐选择有影响的样本 2410.15633v6 -
113 07-30 MASCA: LLM based-Multi Agents System for Credit Assessment MASCA: LLM-basiertes Multi-Agenten-System zur Bonitätsbeurteilung MASCA: 以LLM为基础的信用评估多边代理系统 2507.22758v1 -
114 07-30 Opportunities and Challenges of LLMs in Education: An NLP Perspective Chancen und Herausforderungen von LLM im Bildungswesen: Eine NLP-Perspektive 教育中法学硕士的机遇和挑战:国家学习方案展望 2507.22753v1 -
115 07-30 CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset CUS-QA:以本地知识为主的不限成员名额问题解答数据集 2507.22752v1 -
116 07-30 Next Tokens Denoising for Speech Synthesis Nächste Tokens Denoising für Sprachsynthese 下一集 Tokens 代言人演讲综述 2507.22746v1 -
117 07-30 Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index Verringerung der Halluzinationen in der Zusammenfassung durch Verstärkungslernen mit Entity Halluzination Index 利用实体幻觉指数,通过强化学习减少在总结中的幻觉 2507.22744v1 -
118 07-30 Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning Bewertungsprüfer: Bewertung der synthetischen Überprüfung für Code und Begründung 标定验证符:评估编码和理由的合成核查 2502.13820v3 -
119 07-30 Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning Ressourceneffiziente Anpassung großer Sprachmodelle für Text-Embeddings über Prompt Engineering und Contrastive Fine-Tuning 通过即时工程和反竞争微调对文本嵌入大语言模型进行资源高效率的改编 2507.22729v1 -
120 07-30 Investigating Hallucination in Conversations for Low Resource Languages Untersuchung von Halluzinationen in Gesprächen über Sprachen mit geringem Ressourcenreichtum 低资源语言对话中的幻觉 2507.22720v1 -
121 07-30 Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining Erhöhung der Ultra-Low-Bit-Quantisierung großer Sprachmodelle durch Saliency-Aware Partial Retraining 通过提高质量-软件部分再培训,加强大语言模型的超低比小量量化 2504.13932v3 -
122 07-30 From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs Von der Fähigkeit zur Reflexion: Stärkungsorientiertes Denken Qualität in retrieval-augmented Begründung für LLMs 从充足到反思:LLMs在追偿和增加理由方面的强化引导思考质量 2507.22716v1 -
123 07-30 UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis UI-E2I-Synth: Weiterentwicklung der GUI-Grundierung mit großformatiger Instruktionssynthese UI-E2I-Synth:以大型教学合成为基础推进图形界面 2504.11257v4 -
124 07-30 Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations Raumsprache Likelihood Grounding Network für Bayesian Fusion von Mensch-Roboter-Beobachtungen Bayesian人类-机器人观测融合空间语言定位网络 2507.19947v2 -
125 07-30 Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment Hören auf das Unausgesprochene: Erforschen von 365 Aspekten der multimodalen Interview-Performance Assessment 聆听无语者:探索多模式访谈业绩评估的365方面 2507.22676v1 -
126 07-30 What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization Wovon reden sie? Ein Benchmark der wissensgeprägten Diskussionszusammenfassung 他们在谈论什么?知识类讨论总结的基准 2505.12474v2 -
127 07-30 Instruction-tuned Large Language Models for Machine Translation in the Medical Domain Instruktionsorientierte große Sprachmodelle für die maschinelle Übersetzung im medizinischen Bereich 医疗领域机器翻译大语言模型 2408.16440v2 -
128 07-30 QE4PE: Word-level Quality Estimation for Human Post-Editing QE4PE: Qualitätsschätzung auf Word-Ebene für die menschliche Nachbearbeitung QE4PE: 计算后人类的字级质量估算 2503.03044v2 -
129 07-30 Multilingual Political Views of Large Language Models: Identification and Steering Mehrsprachige politische Ansichten von großen Sprachmodellen: Identifikation und Steuerung 大语言模式多语言多语言政治观点:识别和指导 2507.22623v1 -
130 07-30 Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation Sprache Arithmetik: Auf dem Weg zur systemischen Sprache Neuronenidentifikation und Manipulation 语言解貌学:迈向系统语言中中子识别和操纵 2507.22608v1 -
131 07-30 UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding UI-AGILE: Verbesserung von GUI-Agenten mit effektivem Verstärkungslernen und präziser Schlussfolgerungs-Zeiterdung UI-AGILE: 提高具有有效强化学习和精确推断时间定位的图形代理器 2507.22025v2 -
132 07-30 BALSAM: A Platform for Benchmarking Arabic Large Language Models BALSAM: Eine Plattform für Benchmarking arabischer Großsprachenmodelle BALSAM:阿拉伯语大语言模式基准制定平台 2507.22603v1 -
133 07-30 Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation Lernen, rationale Beweise durch Verstärkungslernen für die retrieval-angereicherte Generation zu extrahieren 学习如何通过为回收-提款一代人加强学习来提取合理证据 2507.15586v4 -
134 07-30 Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions Die Frontier of Vision-Language Models erkunden: Eine Übersicht aktueller Methoden und Zukunftsrichtungen 探索远景-语言模型的前沿:对当前方法和未来方向的调查 2404.07214v3 -
135 07-30 Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck Effizientes kontinuierliches Lernen für kleine Sprachmodelle mit einem diskreten Schlüsselwert-Bottleneck 高效持续学习具有分立键- Value 瓶颈的小语言模式 2412.08528v2 -
136 07-30 Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning Effizientes Differentielles Privates Feintuning von LLMs durch Verstärkungslernen 通过强化学习对LLMs 进行有区别的私人高效率私人罚款 2507.22565v1 -
137 07-30 Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs Nutzung synergistischer Kognitiv-Biasen zur Umgehung der Sicherheit in LLMs 利用协同协同一致的双星体在LLM中用于绕过安全 2507.22564v1 -
138 07-30 Rationale-guided Prompting for Knowledge-based Visual Question Answering Rationale-geführte Aufforderung zur wissensbasierten visuellen Fragebeantwortung 以知识为基础的视觉问题解答 2412.16936v2 -
139 07-30 Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection Co-AttenDWG: Co-Attentive Dimension-Wise-Gating und Expertenfusion für Multi-Modal-Offensive Content Detection 共同-DWG:多模式进攻性攻击物质探测联合加速维维维-韦兹交织和专家混合 2505.19010v2 -
140 07-30 ControlMed: Adding Reasoning Control to Medical Language Model ControlMed: Reasoning Control in das medizinische Sprachmodell aufnehmen 控制Med:在医疗语文模式中增加理由控制 2507.22545v1 -
141 07-30 Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law Vortrainierte Modelle führen das Beste aus, wenn Token-Distributionen Zipfs Gesetz folgen 事先培训的模型按照Zipf法在配制时最佳表现 2507.22543v1 -
142 07-30 A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support Benchmark Dataset und Evaluation Framework für vietnamesische Großsprachenmodelle im Kundensupport 越南客户支助大语言模式基准数据集和评价框架 2507.22542v1 -
143 07-30 Training language models to be warm and empathetic makes them less reliable and more sycophantic Training Sprachmodelle warm und einfühlsam zu sein macht sie weniger zuverlässig und sykophantischer 培训语言模式,使其温暖和同情,使其不那么可靠,更具有共生性 2507.21919v2 -
144 07-30 CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records CliCARE: Grounding Large Language Models in klinischen Richtlinien zur Entscheidungsunterstützung über Longitudinal Cancer Electronic Health Records CliCARE:在纵向癌症电子健康记录决策支持临床指南中以大语言模式为基础 2507.22533v1 -
145 07-30 Yankari: A Monolingual Yoruba Dataset Yankari: Einsprachiger Yoruba-Datensatz Yankari:单语Yoruba数据集 2412.03334v2 -
146 07-30 Probing Information Distribution in Transformer Architectures through Entropy Analysis Probing Information Distribution in Transformer-Architekturen durch Entropie-Analyse 通过 Entropy 分析在变形结构中进行测试信息发布 2507.15347v2 -
147 07-30 SLM-SQL: An Exploration of Small Language Models for Text-to-SQL SLM-SQL: Eine Erforschung kleiner Sprachmodelle für Text-zu-SQL SMS-SQL:探索文字到SQL的小型语言模型 2507.22478v1 -
148 07-30 Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR Dynamische Parameter für vietnamesische geschlechtsunabhängige ASR erkunden 探索越南性别独立ASR的动态参数 2507.22964v1 -
149 07-30 Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears Stimmen freiberuflicher Schriftsteller über KI: Einschränkungen, Erwartungen und Ängste 自由职业作家对大赦国际的呼声:限制、期望和恐惧 2504.05008v2 -
150 07-30 IFEvalCode: Controlled Code Generation IFEvalCode: Kontrollierte Code-Generierung IFEvalCode:受控制的代码生成 2507.22462v1 -
151 07-30 FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training FineMedLM-o1: Verbesserung des medizinischen Wissens, das die Fähigkeit von LLM vom überwachten Feintuning bis zum Test-Time Training begründet FineMedLM-o1:提高LLM从监督的精密教学到试验时间培训的医疗知识能力 2501.09213v3 -
152 07-30 What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models Was ist ein “Abstract Reasoner”? Experimenten und Argumenten über große Sprachmodelle nachzuvollziehen 什么是“抽象理由” ? 关于大语言模型的重新审视实验和争论 2507.22457v1 -
153 07-30 Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance Falcon-H1: Eine Familie hybrider Sprachmodelle zur Neudefinition von Effizienz und Leistung Falcon-H1:调整效率和绩效的混合语言模式家庭 2507.22448v1 -
154 07-30 AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini KI-generierte Geschichten begünstigen Stabilität gegenüber Veränderung: Homogenität und kulturelle Stereotypisierung in Erzählungen, die von gpt-4o-mini erzeugt werden AI产生的故事有利于稳定而不是变化:在gpt-4o-mini产生的叙事中,同质性和文化陈规定型 2507.22445v1 -
155 07-30 BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition BERSting at the Screams: Ein Maßstab für distanzierte, emotionale und erschrockene Spracherkennung 尖叫时发出尖叫声:远程、情感和呼喊语音识别基准 2505.00059v2 -
156 07-30 Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation Mmm whatcha sagen? Enthüllen distale und proximale Kontexteffekte in der ersten und zweiten Sprache Wort Wahrnehmung mit psychophysischen umgekehrten Korrelation 使用心理物理反向关系,在第一和第二语言的词感中产生未发现和预期的环境效应 2406.05515v2 -
157 07-30 NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models NeedleChain: Messung der Intact-Langkontext-Begründungsfähigkeit großer Sprachmodelle Nenelechain:计量大语言模型的精密长文理由 2507.22411v1 -
158 07-30 Question Generation for Assessing Early Literacy Reading Comprehension Fragegenerierung für die Bewertung des frühen Leseverständnisses 评估早期阅读读写能力读写能力的提问一代 2507.22410v1 -
159 07-30 R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs R2-KG:知识图表可靠理由通用双重目的机构框架 2502.12767v6 -
160 07-30 Reservoir Computing as a Language Model Reservoir Computing als Sprachmodell 作为语言模式的 “ 储量计算 “ 模式 2507.15779v2 -
161 07-30 PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs PATENTWRITER: Eine Benchmarking-Studie für die Patenterstellung mit LLMs PATENTWRITER: 专利起草基准研究与LLMs 2507.22387v1 -
162 07-30 OWLViz: An Open-World Benchmark for Visual Question Answering OWLViz: Ein Open-World-Benchmark für visuelle Fragen OWLViz:视觉问答的开放世界基准 2503.07631v3 -
163 07-30 Multimodal LLMs as Customized Reward Models for Text-to-Image Generation Multimodale LLMs als maßgeschneiderte Reward-Modelle für die Text-zu-Image-Generierung 以多式多式LLMs作为生成文字到图像的自定制奖励模型 2507.21391v2 -
164 07-30 BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity BlockFFN: Auf dem Weg zur End-Side Acceleration-Friendly Mixture-of-Experts mit Chunk-Level-Aktivierung Sparsity 块块FFN: 向具有整块级激活分级的 终端- 双极加速- 友好混合混合专家方向 2507.08771v2 -
165 07-30 Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors Traits Run Deep: Verbesserung der Persönlichkeitsbeurteilung durch psychologisch geführte LLM-Darstellungen und multimodale Scheinverhalten 深层轨迹:通过心理学辅导LLM代表和多模式亲善行为,加强个性评估 2507.22367v1 -
166 07-30 Masked Language Models are Good Heterogeneous Graph Generalizers Masked Language Models sind gute Heterogene Graph Generalizers 遮罩语言模型是好异基因图形缩略图 2506.06157v2 -
167 07-30 Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning Nutzung von großen Sprachmodellen für Bengalische Mathematik-Wort-Probleme bei der Lösung der Kette der Gedankenveranlagung 利用大语言模型解决孟加拉语数学字词与思维链理性的解决问题 2505.21354v2 -
168 07-30 MuSciClaims: Multimodal Scientific Claim Verification MuSciClaims: Multimodale wissenschaftliche Antragsprüfung 穆西索赔: 多式联运科学索赔核实 2506.04585v2 -
169 07-30 A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers Eine umfassende Taxonomie der Negation für NLP und Neuralretriever NLP和神经再研究综合清点分类 2507.22337v1 -
170 07-30 Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing Prompt-Reverse Inkonsistenz: LLM Selbstinkonsistenz jenseits generativer Zufälligkeit und prompt Paraphrasierung 迅速反向不一致:LLM 自我不连贯,超越发生性随机和迅速划线 2504.01282v2 -
171 07-30 Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges Natürliche Sprachverarbeitung für den Rechtsbereich: Eine Übersicht über Aufgaben, Datensätze, Modelle und Herausforderungen 法律领域自然语言处理:任务、数据集、模型和挑战调查 2410.21306v3 -
172 07-29 (2) Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations Intent Recognition und Out-of-Scope-Erkennung mit LLMs in Multi-Party-Konversationen 在多方对话中使用LLMs 2507.22289v1 -
173 07-29 Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs Bedeutungsverstärkte Grammatik: Gradient Akzeptabilität formt die geometrischen Darstellungen von Konstruktionen in LLMs 含义内含语法:逐渐可接受性形状 LLM 中工程的几何表示法 2507.22286v1 -
174 07-29 CoEx – Co-evolving World-model and Exploration CoEx – Co-evolving World-Modell und Exploration CoEx – – 共同发展的世界模式和探索 2507.22281v1 -
175 07-29 Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz 嵌入式网络代理:为综合特工情报连接物理数字王国 2506.15677v3 -
176 07-29 Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering Denoising Concept Vectors mit Sparse Autoencodern für verbesserte Sprachmodellsteuerung 用于改进语言模式指导的与斯普鲁斯自动编码器一起的失言概念矢量 2505.15038v2 -
177 07-29 Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs Modeling Story Erwartungen, Engagement zu verstehen: Ein generatives Framework mit LLMs 模拟对理解参与的理论期望:利用LLMM的生成框架 2412.15239v3 -
178 07-29 ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling EKG-Byte: Ein Tokenizer für die End-to-End Generative Elektrokardiogramm-Sprachenmodellierung ECG-Byte: 终端到 En-En Energy 电动心电图语言建模调控器 2412.14373v3 -
179 07-29 GneissWeb: Preparing High Quality Data for LLMs at Scale GneissWeb: Hochqualitative Daten für LLMs im Maßstab vorbereiten GneissWeb: 为缩放 LLMs 准备高品质数据 2502.14907v2 -
180 07-29 LLM-as-a-qualitative-judge: automating error analysis in natural language generation LLM-as-a-qualitative-Richter: Automatisierung der Fehleranalyse bei der Generierung natürlicher Sprachen LLM-as-as-法官法官:在自然语言生成中进行自动误差分析 2506.09147v2 -
181 07-29 RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation RL von Lehrer-Modell-Verfeinerung: Graduale Imitation Lernen für maschinelle Übersetzung 教师-模式改进:机器翻译逐步模拟学习 2507.22219v1 -
182 07-29 Can adversarial attacks by large language models be attributed? Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden? 大型语言模式的对抗性攻击能否归结为对抗性攻击? 2411.08003v3 -
183 07-29 How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor? Wie gut ist die Erst-Token-Entropie ungefähre Wort-Entropie als psycholinguistischer Vorhersager? 作为心理语言学预测者,第一到真真真真真真真假 近似单字真真真假如何? 2507.22209v1 -
184 07-29 The role of media memorability in facilitating startups’ access to venture capital funding Die Rolle der Medienerinnerung bei der Erleichterung des Zugangs von Start-ups zu Risikokapitalfinanzierungen B. 媒体在便利初创企业获得风险资本资金方面的作用 2507.22201v1 -
185 07-29 Basic Reading Distillation Grundlesedestillation 基础阅读蒸馏 2507.19741v2 -
186 07-29 Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence Erklärbarkeit durch Systematik: Die harte Systematik-Herausforderung für künstliche Intelligenz 系统化解释:人工智能的硬系统化挑战 2507.22197v1 -
187 07-29 Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation Déjà Vu: Mehrsprachige LLM-Evaluierung durch die Lens of Machine Translation Evaluation Déjà Vu:通过机器翻译评价的镜头进行多种语文LLM评价 2504.11829v3 -
188 07-29 A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models Eine skalierbare Pipeline zur Schätzung von Verb Frame Frequenzen mit großen Sprachmodellen 使用大语言模型估算 Verb 框架频谱的可缩放管道 2507.22187v1 -
189 07-29 Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training Positiv-Augmented Contrastive Learning für Vision-and-Language Evaluation und Training 愿景和语言评价和培训的积极强化反竞争学习 2410.07336v2 -
190 07-29 Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles Persona-Augmented Benchmarking: Bewertung von LLMs über unterschiedliche Schreibstile hinweg 人 人 推 基准 定 : 评价各种写 作 风格 2507.22168v1 -
191 07-29 Strategic Deflection: Defending LLMs from Logit Manipulation Strategische Durchbiegung: LLMs durch Logit-Manipulation verteidigen 战略抵消:保护LLMs免受逻辑操纵 2507.22160v1 -
192 07-29 IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian IndoPref: Ein multi-Domain-Pairwise-Preference-Datensatz für Indonesisch IndoPref:印度尼西亚多域对等优惠数据集 2507.22159v1 -
193 07-29 The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face? Die Bedeutung von Gesichtsfunktionen bei der visionsbasierten Erkennung von Zeichensprachen: Augen, Mund oder Gesicht? 面貌在基于愿景的手语识别中的重要性:眼、嘴还是脸? 2507.20884v2 -
194 07-29 Prompt Optimization and Evaluation for LLM Automated Red Teaming Prompt Optimierung und Auswertung für LLM Automatisiertes Red Teaming LLM自动红色小组迅速优化和评价 2507.22133v1 -
195 07-29 SAKE: Steering Activations for Knowledge Editing SAKE: Steuerung von Aktivierungen für die Wissensbearbeitung 战略:知识编辑指导活动 2503.01751v2 -
196 07-29 UserBench: An Interactive Gym Environment for User-Centric Agents UserBench: Eine interaktive Gym-Umgebung für User-Centric-Agenten 用户 Bench: 用户中心代理器的交互式 Gym 环境 2507.22034v1 -
197 07-29 FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression FLAT-LLM: Feinkörnige Low-rank Aktivierung Raumtransformation für großsprachliche Modellkompression FLAT-LLM: 用于大语言模型压缩的精制低级激活空间转换 2505.23966v3 -
198 07-29 SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers SAND-Math: LLMs nutzen, um neuartige, schwierige und nützliche Mathematikfragen und -antworten zu generieren SAND-Math:利用LLMs生成新创、困难和有用的数学问答 2507.20527v2 -
199 07-29 Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models Vorhersage mikrobieller Ontologie und Pathogenrisiken durch Umweltmetadaten mit großen Sprachmodellen 预测具有大语言模型的环境元数据产生的微生物本体学和病原体风险和病原体风险 2507.21980v1 -
200 07-29 LIMO: Less is More for Reasoning LIMO: Weniger ist mehr für Vernunft LIMO: 较少的理由更多 2502.03387v3 -
201 07-29 Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation Kulinarische Kreuzungen: Ein RAG-Rahmen zur Verbesserung der Vielfalt in der kulturübergreifenden Rezeptanpassung 烹饪十字路口:加强跨文化适应性适应多样性的RAG框架 2507.21934v1 -
202 07-29 Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory LLM-Autoscoring-Verlässlichkeit in großformatigen Schriftbeurteilungen unter Verwendung von Generalisierbarkeitstheorien erkunden 利用通用理论探索利用通用理论进行大型书写评估时的可靠性 2507.19980v2 -
203 07-29 “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection „Auf wessen Seite bist du?” Schätzung der Ideologie von Politik und Nachrichteninhalten mit großen Sprachmodellen und der Auswahl von Demonstrationsobjekten “你站在谁一边?” 估计政治和新闻内容使用大语言模型和少见的示范选择的意识形态和新闻内容。 2503.20797v2 -
204 07-29 Post-Training Large Language Models via Reinforcement Learning from Self-Feedback Post-Training Große Sprachmodelle durch Stärkung Lernen aus Selbst-Feedback 培训后通过 “ 自我学习 “ 强化学习大语言模式 2507.21931v1 -
205 07-29 CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation CHIMERA: Eine Wissensbasis für wissenschaftliche Ideen-Rekombinationen für Forschungsanalyse und -Ideation CHIMERA: 研究分析和衰变科学理念重组知识库 2505.20779v4 -
206 07-29 Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs Rotes Lernen als nützlich betrachtet: Verallgemeinern über gemerkte Daten in LLMs 认为轮试学习有用:在LLMs中普遍使用记忆数据 2507.21914v1 -
207 07-29 SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs SmoothRot: Kombination von Kanal-Weiss-Skalierung und Rotation für Quantisierungsfreundliche LLMs 平滑旋转: 将频道- Wise 缩放和旋转组合起来, 用于量化- 友好型LLMS 2506.05413v2 -
208 07-29 SLR: Automated Synthesis for Scalable Logical Reasoning SLR: Automatisierte Synthese für skalierbare logische Vernunft SLR: 用于可缩放逻辑理由的自动合成 2506.15787v3 -
209 07-29 Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning Graph-R1: Auf dem Weg zu einem agentischen GraphRAG-Framework durch durchgängiges Ausbau-Lernen 图R1:通过端至端强化学习,迈向 “ 干点至端强化学习 “ 框架 2507.21892v1 -
210 07-29 FrugalRAG: Learning to retrieve and reason for multi-hop QA FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA FrugalRAG:学会检索和多呼QA的理由 2507.07634v2 -
211 07-29 WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking WakenLLM: Bewertung des Potenzials und der Stabilität von LLMs mittels feinkörniger Benchmarking WakenLLLM:通过细微基准评估LLM公司的合理潜力和稳定性 2507.16199v3 -
212 07-29 FB-RAG: Improving RAG with Forward and Backward Lookup FB-RAG: Verbesserung der RAG durch Vorwärts- und Rückwärtsblick FB-RAG:以前向和后向看改进RAG 2505.17206v2 -
213 07-29 AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning AutoTIR: Autonome Tools Integriertes Reasoning durch Verstärkungslernen AutoTIR:通过强化学习综合解释理由的自主工具 2507.21836v1 -
214 07-29 Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences Einführung von HALC: Eine allgemeine Pipeline für die Suche nach optimalen Promptenstrategien für die automatisierte Codierung mit LLMs in den Computational Social Sciences 介绍HALC:寻找计算社会科学中与LLMs自动编码的最佳加速战略的一般管道 2507.21831v1 -
215 07-29 EEG-CLIP : Learning EEG representations from natural language descriptions EEG-CLIP : Lernen von EEG-Darstellungen aus natürlichen Sprachbeschreibungen EEG-CLIP:从自然语言说明中学习EEG代表 2503.16531v2 -
216 07-29 Modelling Adjectival Modification Effects on Semantic Plausibility Modellierung adjektiver Modifizierungseffekte auf die semantische Plausibilität 模拟弹道改变对语义等高可变性的影响 2507.21828v1 -
217 07-29 HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs HRIPBench: Benchmarking von LLMs bei der Bereitstellung von Informationen zur Schadensreduzierung zur Unterstützung von Drogenkonsumenten HRIPBENCH:在向吸毒者提供支助的减少危害信息提供中确定LLMs基准 2507.21815v1 -
218 07-29 Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish Übersicht über ADoBo bei IberLEF 2025: Automatische Erkennung von Anglizismen auf Spanisch IberLEF 2025年IberLEF ADoBo ADoBo 概览:西班牙文自动检测 2507.21813v1 -
219 07-29 ChartMark: A Structured Grammar for Chart Annotation ChartMark: Eine strukturierte Grammatik für Chart-Annotation 图表 Mark: 用于图表注释的结构性语法 2507.21810v1 -
220 07-29 Task Arithmetic for Language Expansion in Speech Translation Aufgabe Arithmetik für Spracherweiterung in der Sprachübersetzung 语音翻译中语言扩展任务 2409.11274v3 -
221 07-29 The Problem with Safety Classification is not just the Models Das Problem der Sicherheitsklassifizierung sind nicht nur die Modelle 安全分类问题不仅仅是模型 2507.21782v1 -
222 07-29 Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages Sparse Autoencoder können sprachspezifische Konzepte über verschiedene Sprachen hinweg erfassen 能够捕捉不同语言语言的特定语言概念的简单自定义者 2507.11230v2 -
223 07-29 AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models AgriEval: Ein umfassender chinesischer Landwirtschafts-Benchmark für große Sprachmodelle 农业:中国大语言模式农业综合基准 2507.21773v1 -
224 07-29 Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal Adversariale Verteidigung ohne Adversariale Verteidigung: Verbesserung der Sprachmodell Robustheit über Instanz-Ebene Hauptkomponentenentfernung 无反向辩护的反向辩护,无反向辩护:通过一审一级主要组成部分删除,加强语言模式的强力 2507.21750v1 -
225 07-29 Image Captioning via Compact Bidirectional Architecture Bildunterschrift über kompakte bidirektionale Architektur 通过契约双向双向建筑进行图像描述 2201.01984v2 -
226 07-29 My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt Mein Leben in Künstlicher Intelligenz: Menschen, Anekdoten und einige Lektionen gelernt 我在人工智能中的生活:人、流浪者、以及一些经验教训 2504.04142v2 -
227 07-29 Technical Report of TeleChat2, TeleChat2.5 and T1 Technischer Bericht von TeleChat2, TeleChat2.5 und T1 TeleChat2、TeleChat2.5和T1技术报告 2507.18013v3 -
228 07-29 UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases UnsafeChain: Verbesserung der Modellsicherheit über Hard Cases 不安全Chain:通过困难案件加强说明理由的示范安全 2507.21652v1 -
229 07-29 Libra: Assessing and Improving Reward Model by Learning to Think Waage: Bewertung und Verbesserung des Prämienmodells durch Lernen zu denken 利布拉:通过学习思考来评估和改进奖励模式 2507.21645v1 -
230 07-29 Probing then Editing Response Personality of Large Language Models Probing dann Editing Response Persönlichkeit von großen Sprachmodellen 检验后编辑大语言模型的个性反应 2504.10227v2 -
231 07-29 Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search Stratege: Selbstverbesserung der LLM-Entscheidungsfindung über die Bi-Level-Baumsuche 战略:通过双层树木搜索自我改善LLM决策 2408.10635v3 -
232 07-29 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Latent Adversarial Training verbessert Robustheit für persistente schädliche Verhalten in LLMs 长效对长效有害行为培训能提高长效LMM中持久性有害行为的积极性 2407.15549v3 -
233 07-29 Multilingual JobBERT for Cross-Lingual Job Title Matching Mehrsprachiger JobBERT für Cross-Lingual Job Title Matching 跨语言工作职称匹配多语言工作BERT 2507.21609v1 -
234 07-29 Pralekha: Cross-Lingual Document Alignment for Indic Languages Pralekha: Cross-Lingual Document Alignment für indische Sprachen Pralekha:印度语交叉语言文档协调 2411.19096v2 -
235 07-29 A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models Eine detaillierte Faktorenanalyse für den politischen Kompasstest: Navigieren von Ideologien großer Sprachmodelle 《政治指南测试的详细要素分析:掌握大语言模式的特征》 2506.22493v2 -
236 07-29 AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning AIM: Adaptive Schlussfolgerung von Multi-Modal LLMs über Token Merging und Pruning AIM:通过 Token 兼并和预留的多模式LMs的适应性推理 2412.03248v2 -
237 07-29 Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers Bewertung der kognitiven Realität der spanischen unregelmäßigen morphomischen Muster: Menschen vs. Transformers 评估西班牙非正常染色体模式的认知现实:人类与变异体 2507.21556v1 -
238 07-29 C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning C2-Evo: Co-Evolving multimodale Daten und Modell zur Selbstverbesserung C2-Evo:共同演进的多模式数据和自我改进理由模型 2507.16518v2 -
239 07-29 Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models Linguistische und einbettende Profilierung von Texten, die von Menschen und großen Sprachmodellen erzeugt werden 人类和大语言模式产生的文本的语言和嵌入式图解 2507.13614v2 -
240 07-29 Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri Achten Sie auf die Sprachlücke in digitalen Geisteswissenschaften: LLM-Aided Translation of SKOS Thesauri 注意数字人文中的语言差距:SKOS Thesauri的LLM辅助翻译 2507.19537v2 -
241 07-29 Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator Zeichen als Zeichen: Ein retrieval-erweiterter Mehrsprachiger Zeichen-Generator 标为 Tokens 的符号: 一个检索增强的多语种手语手语生成器 2411.17799v3 -
242 07-29 MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation MAGIC: Multi-Hop- und Graphenbasierter Benchmark für Inter-Kontext-Konflikte in der retrieval-generierten Generation MAGIC: 回收后一代人中多重和基于图表的多重和基于图表的相互冲突基准 2507.21544v1 -
243 07-29 Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language Modern Uighur Dependency Treebank (MUDT): Ein integriertes morphosyntaktisches Framework für eine ressourcenarme Sprache 现代维吾尔依赖性树库(MUDT): 一种低资源语言综合磷合成法框架 2507.21536v1 -
244 07-29 Automatic Classification of User Requirements from Online Feedback – A Replication Study Automatische Klassifizierung der Benutzeranforderungen aus Online-Feedback – Eine Replikationsstudie 在线反馈用户要求自动分类 – – 复制研究 2507.21532v1 -
245 07-29 HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation HIRAG: 高层次研究教学-引导检索-推荐一代 2507.05714v2 -
246 07-29 TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling TriangleMix: Ein verlustfreies und effizientes Aufmerksamkeitsmuster für den langen Kontext 三角组合:长期预填无损高效关注模式 2507.21526v1 -
247 07-29 Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting Modellfreies Spekulatives Dekodieren für Transformer-basierte ASR mit Token-Map-Entwurf 采用 Token 地图起草的基于变换器的ASR无示范投机代号 2507.21522v1 -
248 07-29 Simulated patient systems are intelligent when powered by large language model-based AI agents Simulierte Patientensysteme sind intelligent, wenn sie von großen modellbasierten AI-Agenten angetrieben werden 由大型语言模型型人工智能代理器供电时,模拟的病人系统是智能系统 2409.18924v3 -
249 07-29 What Does it Mean for a Neural Network to Learn a “World Model”? Was bedeutet es für ein neurales Netzwerk, ein “Weltmodell” zu lernen? 神经网络学习“世界模型”意味着什么? 2507.21513v1 -
250 07-29 Persona Vectors: Monitoring and Controlling Character Traits in Language Models Persona-Vektoren: Überwachung und Kontrolle von Charaktereigenschaften in Sprachmodellen 人向量:监测和控制语言模式中的字符轨迹 2507.21509v1 -
251 07-29 The Carbon Cost of Conversation, Sustainability in the Age of Language Models Die CO2-Kosten des Gesprächs, Nachhaltigkeit im Zeitalter der Sprachmodelle 对话的碳成本、语言模式时代的可持续性 2507.20018v2 -
252 07-29 Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach Zuverlässige Proof-Generation mit LLMs: Ein neuro-symbolischer Ansatz 努力利用LLM女士实现可靠的证据生产:神经-双曲方法 2505.14479v4 -
253 07-29 VN-MTEB: Vietnamese Massive Text Embedding Benchmark VN-MTEB: Vietnamesisch Massiver Text Einbettung Benchmark VN-MTEB:越南大规模文本嵌入基准 2507.21500v1 -
254 07-29 Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen 为采用大语言模式的高级指示提供激励理由 2506.01413v5 -
255 07-29 Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning Low-Confidence Gold: Verfeinerung von Low-Confidence-Proben für effizientes Instruktionstuning 低信任金:改进低信任金样本,以进行高效教学计费 2502.18978v4 -
256 07-29 Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering Sem-DPO: Semantische Inkonsistenz bei der Preference-Optimierung für Prompt Engineering mindern Sem-DPO: 减轻在优先优化即时工程方面的语义不一致现象 2507.20133v2 -
257 07-29 The pitfalls of next-token prediction Die Fallstricke der Next-Token-Vorhersage 下吨预测的陷阱 2403.06963v3 -
258 07-29 Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs Verbesserung der Aufgabenvielfalt bei der Label-Effizient überwachten Feinsteuerung von LLMs 改进LLMML在标签高效监督监督下改进任务多样性 2507.21482v1 -
259 07-29 Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench Welche LLMs bekommen den Spaß? Mit HumorBench nicht-STEM-vernünftige Fähigkeiten beweisen 哪个LLMs得到的笑话? 2507.21476v1 -
260 07-29 BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data BIG5-CHAT: LLM-Persönlichkeiten durch Schulung auf menschenverändernden Daten gestalten BIG5-CHAT:通过提供人际数据培训塑造专业人才 2410.16491v3 -
261 07-29 Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning Weiche Einspritzung von Task-Embeddings Outperforms Prompt-Based In-Context Learning 任务嵌入器的软输入超出迅速基于信息学习的绩效 2507.20906v2 -
262 07-29 Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour Auf dem Weg zu lokal einsetzbaren großformatigen großformatigen Sprachmodellen für Modewahlverhalten 以当地可部署的优质因果大语言模式进行模式选择行为 2507.21432v1 -
263 07-29 LLAMAPIE: Proactive In-Ear Conversation Assistants LLAMAPIE: Proaktive In-Ear-Gesprächsassistenten LLAMAPIE: 主动的在轨在轨对话助理 2505.04066v2 -
264 07-29 Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling Bergbau-Intrinsische Belohnungen aus LLM-Hidden States für effiziente Best-of-N-Probenahme LLM隐藏国为高效率最佳采样而从LLM公司获得的采矿内部奖赏 2505.12225v2 -
265 07-29 MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations MemTool: Optimierung der Kurzzeit-Speicherverwaltung für dynamisches Werkzeug beim Aufrufen von LLM Agent Multi-Turn-Konversationen MemTool:优化短期内存管理,以便利用动态工具在LLM代理多转对话中打电话 2507.21428v1 -
266 07-29 ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs ReGATE: Schneller und besser lernen mit weniger Token in MLLMs ReGATE:与较少的男、女、女、女、男、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女 2507.21420v1 -
267 07-28 (1) Teaching Language Models To Gather Information Proactively Sprachmodelle lehren, um Informationen proaktiv zu sammeln 积极主动地收集资料的教学语言模式 2507.21389v1 -
268 07-28 Ai2 Scholar QA: Organized Literature Synthesis with Attribution Ai2 Scholar QA: Organisierte Literatursynthese mit Attribution Ai2学者QA:有组织文学综述与归属 2504.10861v2 -
269 07-28 Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge Beyond the Reported Cutoff: Wo große Sprachmodelle auf finanzielles Wissen zurückfallen 超越报告的截止点:大语言模式对财务知识的缺陷 2504.00042v2 -
270 07-28 Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models Audio Flamingo 3: Advancing Audio Intelligence mit vollständig offenen großen Audio-Sprachen-Modelle 3:以完全开放的大型音频语言模式推进音频情报 2507.08128v2 -
271 07-28 Turbocharging Web Automation: The Impact of Compressed History States Turbocharging Web Automation: Die Auswirkungen von Komprimierten Geschichte Staaten 涡轮连载网络自动化:压缩历史国家的影响 2507.21369v1 -
272 07-28 StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation StructText: Ein synthetischer Table-to-Text-Ansatz für Benchmark-Erzeugung mit multidimensionaler Bewertung 条形图文本:以多层次评价方式编制基准的基准的合成表到文本方法 2507.21340v1 -
273 07-28 A Deep Learning Automatic Speech Recognition Model for Shona Language Ein Deep Learning automatische Spracherkennung Modell für Shona Sprache Shona语言深学习自动语音识别模式 2507.21331v1 -
274 07-28 SQuat: Subspace-orthogonal KV Cache Quantization SQuat: Subraum-orthogonale KV-Cache-Quantisierung Suat: 子空间- orthogonal KV 缓存缓存量化 2503.24358v2 -
275 07-28 Do Large Language Models Understand Morality Across Cultures? Verstehen große Sprachmodelle Moral über Kulturen hinweg? 大语言模式是否理解各种文化的道德? 2507.21319v1 -
276 07-28 Can human clinical rationales improve the performance and explainability of clinical text classification models? Können menschliche klinische Grundlagen die Leistungsfähigkeit und Erklärbarkeit klinischer Textklassifikationsmodelle verbessern? 人类临床原理能否改善临床文本分类模型的性能和解释性? 2507.21302v1 -
277 07-28 FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung FlaignEvalMMM:综合多式联运模式评价灵活框架 2506.09081v3 -
278 07-28 Narrative Context Protocol: An Open-Source Storytelling Framework for Generative AI Narrative Context Protocol: Ein Open Source Storytelling Framework für generative KI 叙述性背景议定书:开源的开源描述框架 2503.04844v5 -
279 07-28 Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues Schulung von LLM-basierten Tutoren zur Verbesserung der Lernergebnisse von Studierenden in Dialogen 培训基于LLLM LLM的辅导员,以改善学生在对话中的学习成果 2503.06424v2 -
280 07-28 LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems LeMix: Unified Scheduling für LLM-Training und Schlussfolgerung auf Multi-GPU-Systemen LeMix:关于多功能保U系统的LLM培训和推理的LLM培训统一日程安排 2507.21276v1 -
281 07-28 Levels of Analysis for Large Language Models Analyseebenen für große Sprachmodelle 大语言模式分析水平 2503.13401v2 -
282 07-28 CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting CompoST: Ein Benchmark für die Analyse der Fähigkeit von LLMs, Fragen in einer QALD-Einstellung kompositorisch zu interpretieren CompoST:在质量和限期设计中分析高管公司在组成上解释问题的能力的基准 2507.21257v1 -
283 07-28 Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach Bangla BERT für hyperparteiische Nachrichtenerkennung: Ein halbüberwachter und erklärbarer KI-Ansatz 超党派新闻探测孟加拉BERT:半监督和可解释的AI方法 2507.21242v1 -
284 07-28 Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability Öffentliche Wahrnehmung der Kriminalität in Bangladesch verstehen: Ein transformerbasierter Ansatz mit Erklärbarkeit 了解孟加拉国公众对犯罪的认识:基于变革和可解释的方法 2507.21234v1 -
285 07-28 Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation Multi-Agent-as-Judge: LLM-Agent-basierte automatisierte Evaluierung mit multidimensionaler menschlicher Bewertung ausrichten 多边代理法官:将LLM-基于代理的自动评价与多层次的人力评价统一起来 2507.21028v1 -
286 07-28 Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation Verbesserung der LLM-Vernunft mit iterativem DPO: Eine umfassende empirische Untersuchung 与具有迭接作用的DPO:全面经验调查加强LLM 2503.12854v3 -
287 07-28 Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions Bewertung des Versprechens und der Fälle von LLMs bei Hiring-Entscheidungen 评估LLM女士在雇用决定中的许诺和机会 2507.02087v2 -
288 07-28 Memorization in Fine-Tuned Large Language Models Auswendiglernen in fein getönten großen Sprachmodellen 微微调大语言模型的记忆 2507.21009v1 -
289 07-28 LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning LoRA-PAR: Ein flexibler Dual-System-LoRA-Partitionsansatz für effizientes LLM-Feintuning LOLAR-PAR:高效 LLM 微调的灵活双系统滚动分割法 2507.20999v1 -
290 07-28 GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding GUI-G$^2$: Gaussian Reward Modeling für GUI Grounding GUI-G$$2美元:GUI地基的高斯奖赏模型 2507.15846v3 -
291 07-28 Scaling Physical Reasoning with the PHYSICS Dataset Skalierung der physikalischen Vernunft mit dem PHYSICS-Datensatz 利用PHYSICS数据集调整物理理由 2506.00022v3 -
292 07-28 Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands Cog-TiPRO: Iterative Prompt-Verfeinerung mit LLMs zur Erkennung kognitiver Deklination über Longitudinal Voice Assistant-Befehle COg-TiPRO:与LLMs一起与LLMs进行自动迅速改进,以便通过纵向语音助理指挥部检测认知衰减 2505.17137v2 -
293 07-28 A Survey of Deep Learning for Geometry Problem Solving Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen 解决几何问题深层学习调查 2507.11936v4 -
294 07-28 Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning Mehrbildbeschreibungen für mehrsprachige, leichte Kognitive Impairment-Erkennung durch kontrastives Lernen enthüllen 通过差异学习发现多语种轻视认知缺陷的单形多语种描述 2505.17067v3 -
295 07-28 Your AI, Not Your View: The Bias of LLMs in Investment Analysis Ihre KI, nicht Ihre Ansicht: Die Bias von LLMs in der Investitionsanalyse 您的AI, 而不是您的观点: 投资分析中LLM 的偏见 2507.20957v1 -
296 07-28 Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models Mind the Gap: Konformative Dekodierung zur Verbesserung der Output-Vielfalt von instruction-tuned großen Sprachmodellen 注意差距:改进教学型大语言模式产出多样性的合规化配方 2507.20956v1 -
297 07-28 Dissecting Persona-Driven Reasoning in Language Models via Activation Patching Persona-Driven Reasoning in Sprachmodellen per Aktivierungs-Patching auflösen 通过激活补丁在语言模型中通过激活补丁解剖人-人-驱动原因 2507.20936v1 -
298 07-28 LLM2TEA: An Agentic AI Designer for Discovery with Generative Evolutionary Multitasking LLM2TEA: Agentischer AI-Designer für Entdeckung mit generativem evolutionären Multitasking LLM2TEA: 利用产生进化多任务探索的代理AI 设计器 2406.14917v3 -
299 07-28 FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models FHSTP@EXIST 2025 Benchmark: Sexismuserkennung mit transparenten Sprachkonzepten Engpassmodelle FHSTP@EXIST 2025 基准:用透明言论概念瓶颈模型探测性别主义 2507.20924v1 -
300 07-28 MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation MediQAl: Eine französische medizinische Frage zur Beantwortung von Datensätzen für Wissens- und Begründungsbewertung MediQAl:用于知识和合理评估的法国医学问题解答数据集 2507.20917v1 -
301 07-28 Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models Benchmarking Open-Ended Audio Dialogue Understanding für große Audio-Language-Modelle 确定大型音频语言模型不限成员名额音频对话理解基准 2412.05167v2 -
302 07-28 Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery? Sollte Top-Down-Clustering Grenzen in unüberwachten Word Discovery beeinflussen? 在无人监督的“发现字”中, 上下层群集是否应该影响边界? 2507.19204v2 -
303 07-28 $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement $A^2R^2$: Verbesserung der Img2LaTeX-Umwandlung durch visuelles Reasoning mit aufmerksamkeitsgeführter Verfeinerung $A2R2美元:通过关注引导的精炼,通过视觉理性推进Img2LaTeX转换 2507.20890v1 -
304 07-28 Enhancing Project-Specific Code Completion by Inferring Internal API Information Verbesserung der projektspezifischen Code-Vervollständigung durch Schlussfolgerung interner API-Informationen 通过推断内部API信息加强具体项目法规的完成 2507.20888v1 -
305 07-28 Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings Nutzung von Open-Source-Großsprachenmodellen für die Extraktion klinischer Informationen in ressourcenbeschränkten Einstellungen 利用开放源码大语言模型,在受资源限制的环境下进行临床信息采掘 2507.20859v1 -
306 07-28 A survey of diversity quantification in natural language processing: The why, what, where and how Eine Übersicht der Diversitätsquantifizierung in der natürlichen Sprachverarbeitung: Das Warum, Was, Wo und Wie 自然语言处理中多样性量化调查:原因、内容、地点和方式 2507.20858v1 -
307 07-28 Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities Sprachenmodellierung für die Zukunft der Finanzen: Eine Umfrage zu Metrics, Aufgaben und Datenmöglichkeiten 未来融资语言建模:计量、任务和数据机会调查 2504.07274v2 -
308 07-28 Latent Inter-User Difference Modeling for LLM Personalization Latent Inter-User Difference Modeling für LLM Personalisierung LLM个性化不同模型 2507.20849v1 -
309 07-28 Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models Kritik des unreinen Grundes: Enthüllen des Argumentationsverhaltens medizinischer Großsprachenmodelle 简便理由的批评:统一医学大语言模式的推理行为 2412.15748v2 -
310 07-28 FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings FocalPO: Verbesserung der Preference-Optimierung durch Fokussierung auf korrekte Preference-Rankings 重点:通过注重正确的优先排序,加强优惠优化 2501.06645v3 -
311 07-28 Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models Automatisieren der thematischen Überprüfung der Prävention von zukünftigen Todesfällen Berichte: Nachahmung der ONS-Kinder-Selbstmord-Studie mit großen Sprachmodellen 对预防今后死亡报告进行自动化专题审查:利用大语言模式复制ONS儿童自杀研究 2507.20786v1 -
312 07-28 On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey Über die Rolle von vorgebildeten Sprachmodellen in allgemeinen Text-Embeddings: Eine Umfrage 关于 “ 预先培训的语言模式在一般用途文本嵌入中所起的作用:调查 “ 2507.20783v1 -
313 07-28 TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks TN-AutoRCA:电信网络中自我改进基于警报的原始原因分析的基准建设和示范框架 2507.18190v2 -
314 07-28 The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints Die Auswirkungen von LoRA-Adaptern auf LLMs für die klinische Textklassifikation unter Computational und Data Constraints LoRA适应器对在计算和数据限制下临床文本分类的LLMs的影响 2407.19299v3 -
315 07-28 Multilingual Self-Taught Faithfulness Evaluators Mehrsprachige Selbstlernende Bewertung von Treue 多语言自学自学信仰评价员 2507.20752v1 -
316 07-28 Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study Untersuchung struktureller Pruning- und Recovery-Techniken zur Komprimierung multimodaler Großsprachenmodelle: Eine empirische Studie 压缩多式大语言模式结构保护和恢复调查技术:经验研究 2507.20749v1 -
317 07-28 Everything is a Video: Unifying Modalities through Next-Frame Prediction Alles ist ein Video: Vereinheitlichen von Modalitäten durch Next-Frame-Vorhersage 一切都是一部视频:通过下框架预测实现统一的方式 2411.10503v2 -
318 07-28 Group Sequence Policy Optimization Optimierung der Gruppensequenzpolitik 组序列政策优化 2507.18071v2 -
319 07-28 Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models Text2VLM: Anpassung von Text-Only-Datensätzen an die Auswertung von Alignment-Trainings in visuellen Sprachmodellen Text2VLM: 调整纯文本数据集以评价视觉语言模型的对齐培训 2507.20704v1 -
320 07-28 Computational Analysis of Character Development in Holocaust Testimonies Computational Analyse der Charakterentwicklung in Holocaust-Zeugnissen 大屠杀证词特征发展计算分析 2412.17063v4 -
321 07-28 When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification Wenn Scale auf Vielfalt trifft: Bewertung von Sprachmodellen auf feinkörnige Mehrsprachigkeitsprüfung 规模达到多样性时:评价精细多语言索赔核实的语言模式 2507.20700v1 -
322 07-28 Geometric-Mean Policy Optimization Geometrisch-Mean-Policy-Optimierung 几何海洋政策优化 2507.20673v1 -
323 07-28 Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs Benchmarking Graph Neural Networks für die Dokumentenlayout-Analyse in öffentlichen Angelegenheiten 用于公共事务文件布局分析的图表神经网络 2505.14699v2 -
324 07-28 Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study Nachweis von unerwünschten Arzneimittelereignissen in niederländischen klinischen Textdokumenten mit Transformer-Modellen: Benchmark-Studie 利用变换模型发现荷兰临床免费文本文件中的不良毒品事件:基准研究 2507.19396v2 -
325 07-28 Ontology-Enhanced Knowledge Graph Completion using Large Language Models Ontologie-erweiterte Wissensgraphenvervollständigung mit großen Sprachmodellen 利用大语言模式完成本部强化知识图 2507.20643v1 -
326 07-28 Explainable Synthetic Image Detection through Diffusion Timestep Ensembling Erklärbare Synthetische Bilderkennung durch Diffusionszeitpunkt Zusammenbauen 通过传播时间步骤组合进行可解释的合成图像探测 2503.06201v2 -
327 07-28 Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior Vor der Empörung: Herausforderungen und Fortschritte bei der Vorhersage von Online-Antisozialverhalten 暴政前:预测在线反社会行为的挑战和进展 2507.20614v1 -
328 07-28 AutoLibra: Agent Metric Induction from Open-Ended Feedback AutoLibra: Agent Metric Induktion aus offenem Feedback AutoLibra: 不限名额反馈的计量介绍代理 2505.02820v2 -
329 07-28 ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning ZSE-Cap: Ein Zero-Shot-Ensemble für Bildwiederherstellung und Prompt-Führung ZSE-Cap: 用于图像检索和即时指导说明的零热组合 2507.20564v1 -
330 07-28 Enhancing Hallucination Detection via Future Context Halluzinationserkennung durch zukünftigen Kontext verbessern 通过未来环境加强幻觉探测 2507.20546v1 -
331 07-28 From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought Von Antworten zu Rationalen: Selbstjustierung multimodaler Vernunft mit answer-oriented Chain-of-Thought 从答案到理由:自调整的多式联运理由与以回答为主的探索链 2507.02984v2 -
332 07-28 Kimi K2: Open Agentic Intelligence Kimi K2: Offene Agentische Intelligenz Kimi K2:开放特工情报 2507.20534v1 -
333 07-28 SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law SafeWork-R1: Koevolving Safety and Intelligence unter dem AI-45$^{\circ}$ Gesetz 安全工作-R1:根据AI-45$ circ}$ 法发展安全和情报 2507.18576v2 -
334 07-28 Otter: A Multi-Modal Model with In-Context Instruction Tuning Otter: Ein Multi-Modal-Modell mit In-Context-Anleitung Tuning Ottter:具有内文指导图纸的多模式模型 2305.03726v2 -
335 07-28 Dialogues of Dissent: Thematic and Rhetorical Dimensions of Hate and Counter-Hate Speech in Social Media Conversations Dialoge von Dissent: Thematische und rhetorische Dimensionen von Hass und Gegenhass in Social Media-Gesprächen 不同意见对话:社会媒体对话中的仇恨和反仇恨言论的主题和风湿方面 2507.20528v1 -
336 07-28 Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards Versehentliche Sicherheitslücke: Faktoren bei Feinsteuerung, die das Modell schützen 意外脆弱性:改变模式保障保障措施的微调因素 2505.16789v2 -
337 07-28 Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition Sicherheitsherausforderungen bei der Bereitstellung von KI-Agenten: Einblicke aus einem groß angelegten öffentlichen Wettbewerb AI 代理部署在安全方面面临的挑战:大规模公共竞争的展望 2507.20526v1 -
338 07-28 AQUA: A Large Language Model for Aquaculture & Fisheries AQUA: Ein großes Sprachmodell für Aquakultur und Fischerei AQUA:水产养殖和渔业大语言模式 2507.20520v1 -
339 07-28 Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Pre-Training 推进藏藏人大语言模式,提供 “ 扩展数据 “ 和 “ 持续培训前 “ 。 2507.09205v4 -
340 07-28 REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v7 -
341 07-28 Customize Multi-modal RAI Guardrails with Precedent-based predictions Multimodale RAI-Guardrails mit vorausschauenden Vorhersagen anpassen 定制具有先例预测的多式RAI护卫车 2507.20503v1 -
342 07-28 Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT Pruning for Performance: Effiziente Idiom- und Metaphor-Klassifikation in Low-Resource Konkani mit mBERT 利用mBERT, 低资源 Konkani 中高效的低资源 Konkani 和同义词分类 2506.02005v2 -
343 07-28 Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems Sprechen in Worten, Denken in Logik: Ein Dual-Process-Framework in QA-Systemen 用文字说,用逻辑思考:质量保证系统中的双重处理框架 2507.20491v1 -
344 07-28 Juru: Legal Brazilian Large Language Model from Reputable Sources Juru: Rechtliches brasilianisches Large Language Model aus seriösen Quellen Juru:来自有名来源的巴西大语言法律模型 2403.18140v2 -
345 07-28 Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents Benutzer vor ihnen selbst schützen: Schutz der kontextuellen Privatsphäre in Interaktionen mit Gesprächspartnern 保护用户免受自我伤害:在与交流代理人的互动中保护环境隐私 2502.18509v2 -
346 07-28 Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM Ähnliches Beispiel verbessern Retrieval-Ranking-Performance durch Revisiting RankSVM 通过重审RanksSVM改进类似案例检索排名 2502.11131v2 -
347 07-28 In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents In Prospect und Retrospect: Reflektierendes Speichermanagement für langfristige Personalisierte Dialogagenten 展望和回顾:长期个人化对话代理人的反思记忆管理 2503.08026v2 -
348 07-27 (7) Critiques of World Models Kritik an Weltmodellen 世界模式的证明 2507.05169v3 -
349 07-27 CodeNER: Code Prompting for Named Entity Recognition CodeNER: Codeaufforderung für die benannte Entitätserkennung 识别名称实体的代码提示代码 2507.20423v1 -
350 07-27 Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks? Umfrage zu NLU-Benchmarks Diagnose Linguistische Phänomene: Warum nicht Diagnose-Benchmarks standardisieren? NLU基准诊断语言神话调查:为什么不使诊断基准标准化? 2507.20419v1 -
351 07-27 CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning CONCAP: Über das Englische hinaussehen mit Konzepten Retrieval-Augmented Captioning CONCACM: 以概念检索增强说明方式在英语以外看问题 2507.20411v1 -
352 07-27 Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training Clarify lernen: Multiturn-Gespräche mit aktionsbasiertem Kontrast-Selbst-Training 学习澄清:与基于行动的差异性自我培训进行多方向对话 2406.00222v2 -
353 07-27 Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification Selbstregularisierung mit Sparse Autoencodern für steuerbare LLM-basierte Klassifizierung 与基于可控 LLM 的可控 LLM 分类的 Sparse 自动编码器的自调节 2502.14133v3 -
354 07-27 Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations Kognitive Denkkette: Strukturierte multimodale Begründung über soziale Situationen 认知思考链:社会状况的结构性多模式原因 2507.20409v1 -
355 07-27 Length Representations in Large Language Models Längendarstellungen in großen Sprachmodellen 大语言模式中的长长代表 2507.20398v1 -
356 07-27 Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation 以证据为依据的反健康错误信息反言多证据检索强化框架 2507.07307v2 -
357 07-27 Memorization: A Close Look at Books Auswendiglernen: Ein genauer Blick auf Bücher 记忆化:对书籍的近视 2504.12549v2 -
358 07-27 Scaling Analysis of Interleaved Speech-Text Language Models Skalierungsanalyse interleaved Speech-Text Language Models 剖分间语音-文字语言模式扩大分析 2504.02398v2 -
359 07-27 RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing RMTBench: Benchmarking von LLMs durch Multi-Turn-Benutzer-Centric-Rollenspiel RMTBench:通过多发用户中心发挥作用,确定LLMs基准 2507.20352v1 -
360 07-27 DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns DYNARTmo: Ein dynamisches Artikulationsmodell zur Visualisierung von Sprachbewegungsmustern DYNARTmo:语音移动模式视觉化动态脉动模型 2507.20343v1 -
361 07-27 FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation FMSD-TTS: Wenige Aufnahmen Multi-Speeaker Multi-Dialekt Text-zu-Speech-Synthese für Ü-Tsang, Amdo und Kham Speech Dataset Generation FMSD-TTS:为于赞、阿姆多和康言语数据集制作而制作的微小多声多声多功能多语音文本到语音合成合成 2505.14351v2 -
362 07-27 ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios ELMES: Ein automatisierter Rahmen für die Bewertung großer Sprachmodelle in Bildungsszenarien ELMES:评估教育情景中大语言模式自动框架 2507.22947v1 -
363 07-27 What is Wrong with Perplexity for Long-context Language Modeling? Was ist falsch an Verwirrung für Langkontext-Sprachenmodellierung? 长文本语言建模的复杂性有什么问题? 2410.23771v5 -
364 07-27 Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation Förderung dialektischer arabischer zu moderner arabischer Standard-Maschinenübersetzung 向现代标准阿拉伯文机器翻译推广阿拉伯语 2507.20301v1 -
365 07-27 Real-time Factuality Assessment from Adversarial Feedback Echtzeit-Faktualitätsbeurteilung aus dem Adversarial Feedback 从反反向反馈反馈中实时进行实况评估 2410.14651v3 -
366 07-27 SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration SciToolAgent: Ein wissensbasierter wissenschaftlicher Agent für Multi-Tool-Integration SciToolAgent: 多工具整合知识图表驱动科学代理 2507.20280v1 -
367 07-27 What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations Welche Sprache(n) denkt Aya-23? Wie Mehrsprachigkeit die Repräsentationen der internen Sprache beeinflusst Aya-23 思考什么语言?多语言如何影响内部语言代表性? 2507.20279v1 -
368 07-27 Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning Agent-Fin-R1: Verbesserung der Finanzintelligenz durch Domain-Expertise, Trainingseffizienz und Advanced Reasoning Agentar Fin-Fin-R1:通过域域专门知识、培训效率和高级理由加强金融情报 2507.16802v4 -
369 07-27 MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning MoL-RL: Destillieren von mehrstufigem Umweltfeedback in LLMs zur feedbackunabhängigen Begründung MoL-RL:将多层环境反馈保留到LLMs,用于提供反馈-独立理由 2507.20278v1 -
370 07-27 ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech ChildGuard: Ein spezieller Datensatz zur Bekämpfung von kindgewordener Hassrede 儿童指南:打击针对儿童的仇恨言论专门数据集 2506.21613v2 -
371 07-27 EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms EMBRACE: Inclusive Opinion Representation gestalten, indem implizite Gespräche mit sozialen Normen ausgerichtet werden EMBRACE:通过与社会规范的关联性交流,形成包容性的见解代表制 2507.20264v1 -
372 07-27 Post-Completion Learning for Language Models Post-Completion-Lernen für Sprachmodelle 语文模式完成后学习 2507.20252v1 -
373 07-27 Modeling Professionalism in Expert Questioning through Linguistic Differentiation Modellierung von Professionalität in der Expertenbefragung durch sprachliche Differenzierung 通过语言差异问题专家提问的示范专业精神 2507.20249v1 -
374 07-27 Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers Contrast-CAT: Kontrastierende Aktivierungen für verbesserte Interpretierbarkeit in Transformer-basierten Textklassifikatoren 反对-CAT:在基于变换器的文本分类中增强解释力的对比活动 2507.21186v1 -
375 07-27 Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models Reframe Your Life Story: Interaktiver Erzähltherapeut und innovative Moment-Assessment mit großen Sprachmodellen 重构你的生活故事:与大语言模式互动叙述治疗师和创新时间评估 2507.20241v1 -
376 07-27 DoubleDipper: Improving Long-Context LLMs via Context Recycling DoubleDipper: Verbesserung der Langkontext-LLMs über Kontext-Recycling 双重顶点:通过上下文再循环改进长文本LLMs 2406.13632v4 -
377 07-27 Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines Lernende-LLM-Chatbot-Interaktionen verstehen und die Auswirkungen von Sofortrichtlinien verstehen 了解学习者-LLLM 聊天室互动和推动准则的影响 2504.07840v3 -
378 07-27 Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation Co-NAML-LSTUR: Ein kombiniertes Modell mit attentivem Multi-View-Lernen und Langzeit- und Kurzzeit-Benutzervertretungen für News-Empfehlungen NAML-LTUR:与多视学习和新闻建议长期及短期用户代表相结合的综合模式 2507.20210v1 -
379 07-27 IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs IQ-Test für LLMs: Ein Bewertungsrahmen für die Entdeckung von Kernkompetenzen in LLMs LLMLM的IQ测试:LLM中核心技能覆盖的评估框架 2507.20208v1 -
380 07-27 Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data Günstiges Lernen: Maximierung der Leistungsfähigkeit von Sprachmodellen für die Sozialdatenwissenschaft mit minimalen Daten 廉价学习:利用最低数据使社会数据科学语言模型的绩效最大化 2401.12295v2 -
381 07-27 Diversity-Enhanced Reasoning for Subjective Questions Diversity-Enhanced Reasoning für subjektive Fragen 主观问题的多样性强化理由 2507.20187v1 -
382 07-27 SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding SessionIntentBench: Ein Multi-Task Inter-Session Intention-Shift Modelling Benchmark für E-Commerce Kundenverhalten Verständnis A. 会议内容:电子商务客户行为理解的多任务、多任务、跨会期、出于利益转移的 电子商业客户行为理解示范基准 2507.20185v1 -
383 07-27 SGPO: Self-Generated Preference Optimization based on Self-Improver SGPO: Selbsterzeugte Preference-Optimierung auf Basis von Self-Improver SGPO:基于自我改造的自发优惠优化 2507.20181v1 -
384 07-27 Checklist Engineering Empowers Multilingual LLM Judges Checkliste Engineering Empowers Mehrsprachige LLM-Richter 多语种LLM法官 2507.06774v2 -
385 07-27 Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective Intersektionale Bias in japanischen großen Sprachmodellen aus einer kontextualisierten Perspektive 日本大语言模型中从背景角度分析的交叉比阿语 2506.12327v2 -
386 07-27 Goal Alignment in LLM-Based User Simulators for Conversational AI Zielausrichtung in LLM-basierten Benutzersimulatoren für KI 在基于LLM的LLM用户模拟器中实现目标对齐 2507.20152v1 -
387 07-27 The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models The Policy Cliff: Eine theoretische Analyse von Belohnungs-Policy-Karten in großen Sprachmodellen 政策悬崖:大语言模式奖励政策图的理论分析 2507.20150v1 -
388 07-27 Multi-Agent Interactive Question Generation Framework for Long Document Understanding Multi-Agent Interactive Question Generierung Framework for Long Document Understanding 长期文件理解问题多机构互动问题生成框架 2507.20145v1 -
389 07-27 Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG Multi-Stage Verifikations-Centric Framework zur Eindämmung der Halluzination in Multi-Modal RAG 多模式RAG中减轻幻觉多阶段核查-中心框架 2507.20136v1 -
390 07-27 EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models EvoSLD: Automatisierte Neural Scaling Law Discovery mit großen Sprachmodellen EvoSLD: 用大语言模型发现自动神经放大法 2507.21184v1 -
391 07-27 When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars Wann funktioniert Metadata Conditioning (NOT) für Sprachmodell-Vorschulungen? Eine Studie mit kontextfreien Grammatiken 元数据条件(NOT)何时能为语言示范培训前培训提供语言示范?无背景语法研究 2504.17562v2 -
392 07-27 MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge MaPPO: Maximale Posteriori-Preference-Optimierung mit vorherigem Wissen MaPPPO: 与先前知识最优化的后世偏好 2507.21183v1 -
393 07-27 TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling TIB-STC: Ein großflächiger strukturierter tibetischer Benchmark für ressourcenarme Sprachmodellierung TIB-STC: 低资源语言建模的西藏大型结构化基准 2503.18288v4 -
394 07-27 Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice Seed LiveInterpret 2.0: End-to-End Simultanübersetzung mit Ihrer Stimme 种子实况解释2.0:用声音翻译终端到终端同声语音语音 2507.17527v3 -
395 07-27 Measuring Information Distortion in Hierarchical Ultra long Novel Reconstruction:The Optimal Expansion Ratio Messung von Informationsverzerrung bei hierarchischer Ultralangem Novel Reconstruction:The Optimal Expansion Ratio 测量高层次超长新世纪重建中的信息扭曲:最佳扩展比率 2505.12572v2 -
396 07-27 Language Models Resist Alignment: Evidence From Data Compression Sprachmodelle widerstehen Ausrichtung: Beweise aus Datenkompression 语言模型阻力对齐:数据压缩中的证据 2406.06144v5 -
397 07-27 AI-Driven Generation of Old English: A Framework for Low-Resource Languages AI-Driven Generation of Old English: Ein Rahmen für Low-Resource-Sprachen AI-Driven 一代老英语:低资源语言框架 2507.20111v1 -
398 07-27 Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen 超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs 2507.04886v2 -
399 07-27 EcoTransformer: Attention without Multiplication EcoTransformer: Achtung ohne Multiplikation 生态转换:注意不乘数 2507.20096v1 -
400 07-27 ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models ProsodyLM: Enthüllen der neu entstehenden Prosody-Verarbeitungsfähigkeiten in Sprachmodellen ProsodyLM: 解决语言模式中新出现的处理能力问题 2507.20091v1 -
401 07-27 Reinforcement learning fine-tuning of language model for instruction following and math reasoning Verstärktes Lernen der Feinabstimmung des Sprachmodells für Unterricht und Mathe-Reinigung 强化学习,微调用于教学的语文模式和数学推理 2506.21560v2 -
402 07-26 (6) The Devil is in the EOS: Sequence Training for Detailed Image Captioning Der Teufel ist im EOS: Sequenztraining für detaillierte Bildunterschriften 魔鬼在EOS:详细图像说明的序列训练 2507.20077v1 -
403 07-26 PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training PITA: Präferenz-geführte Inferenz-Zeit-Ausrichtung für LLM nach dem Training PITA:LLM培训后培训的优先指导推论-时间协调 2507.20067v1 -
404 07-26 RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation RAG in the Wild: Über die (In)Wirksamkeit von LLMs mit Mixture-of-Knowledge Retrieval Augmentation 野生ROG:关于利用混合知识回收增加的LLMs(内)效力 2507.20059v1 -
405 07-26 A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications Ein Tensor-basierter Compiler und eine Laufzeit für die Spezifikationen des Neuron-Level DNN Certifier 一个基于 Tensor 的编纂器和中子级别 DNN 验证符规格的运行时间 2507.20055v1 -
406 07-26 $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning $K^4$: Online Log Anomalienerkennung durch unüberwachtes Lernen 4K元:在线记录异常探测不受监督的典型学习 2507.20051v1 -
407 07-26 AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants KI als beratender Partner fördert interkulturelles Empathie für Amerikaner, scheitert aber für lateinamerikanische Teilnehmer 作为审议伙伴的大赦国际促进美国人的文化间同情,但拉丁美洲参与者却失败 2504.13887v2 -
408 07-26 Infogen: Generating Complex Statistical Infographics from Documents Infogen: Erzeugen komplexer statistischer Infografiken aus Dokumenten 信息源:从文件生成复杂的统计图表 2507.20046v1 -
409 07-26 Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs Kolumbianische Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Empfehlungen von LLMs Colombia Worress y juéces canadienses:LLM公司在占领建议中的性别和乡村差别 2505.02456v2 -
410 07-26 FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression FAEDKV: Unendliche Window Fourier-Transformation für unvoreingenommene KV-Cache-Kompression FAEDKV: 用于无偏见的 KV 缓存压缩的无限窗口 Fleier 变换 2507.20030v1 -
411 07-26 Selective Prompt Anchoring for Code Generation Selektive Prompt-Ankerung für die Code-Generierung 代代代代代代代代代代代代代代代代代 代代代代代代代代代代代代代 代代代代代代代代代代代代 2408.09121v6 -
412 07-26 Preference learning made easy: Everything should be understood through win rate Vorliebe Lernen leicht gemacht: Alles sollte durch Win-Rate verstanden werden 首选学习容易:人人都应通过双赢率来理解一切 2502.10505v2 -
413 07-26 Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach Anomalieerkennung in der menschlichen Sprache durch Meta-Learning: Ein wenig heißer Ansatz 通过元学习在人文语言中异常探测: “ 几热 “ 方法 2507.20019v1 -
414 07-26 A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio Eine Praxis des Post-Trainings auf Llama-3 70B mit optimaler Auswahl des zusätzlichen Sprachmischverhältnisses Llama-3-70B培训后做法,最佳选择其他语言混合比率 2409.06624v2 -
415 07-26 MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning MeTHanol: Modularisiertes Denken von Sprachmodellen mit Intermediate Layer Thinking, Decodierung und Bootstrapping Reasoning METHanol:含有中间层思考、解毒和诱导理由的模块化思维语言模型 2409.12059v5 -
416 07-26 VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering VLQA: Der erste umfassende, große und hochqualitative vietnamesische Datensatz für die Beantwortung rechtlicher Fragen VLQA:用于法律问题解答的第一综合、大、高质量越南数据集 2507.19995v1 -
417 07-26 Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model Verbesserung der Leistungsfähigkeit sequentieller Empfehlungssysteme mit einem erweiterten Großsprachenmodell 利用扩展大语言模式改进序列建议系统的绩效 2507.19990v1 -
418 07-26 Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge Robustes Daten-Wasserzeichen in Sprachmodellen durch Einspritzen fiktiver Kenntnisse 在语言模型中,通过输入有说服力的知识在语言模型中进行强力数据水上标记 2503.04036v3 -
419 07-26 Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization Leveraging Fine-Tuned Large Language Models for Interpretable Pankreatic Cystic Lesion Feature Extraction and Risk Categorization 利用微量使用大语言模型来利用可解释性恐慌性锥性电磁性悬浮物地物采掘和风险分类 2507.19973v1 -
420 07-26 Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text Text2Vis: Ein anspruchsvolles und vielfältiges Benchmark zur Generierung multimodaler Visualisierungen aus Text Text2Vis: 从文本中生成多式视觉化的质疑性和多样化基准 2507.19969v1 -
421 07-26 KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models KLAAD: Verfeinerung von Aufmerksamkeitsmechanismen zur Reduzierung gesellschaftlicher Bias in generativen Sprachmodellen CPRAD: 完善关注机制,在产生语言模式中减少社会偏见 2507.19962v1 -
422 07-26 Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v4 -
423 07-26 Large Language Models Are Human-Like Internally Große Sprachmodelle sind menschlich-innerlich 大语言模型是人与人之间的内部大语言模型 2502.01615v2 -
424 07-26 Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA Aufmerksamkeitsköpfe vor dem Zusammenführen ausrichten: Ein effektiver Weg, MHA in GQA umzuwandeln 合并主题前对齐关注头部对齐:将MAHA转换为GQA的有效途径 2412.20677v2 -
425 07-26 Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung 结合人口偏见和示范解释评估LLM 说明的可靠性 2507.13138v2 -
426 07-26 Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report Frontier AI Risk Management Framework in der Praxis: Ein technischer Bericht zur Risikoanalyse 《国际边界风险管理框架实际操作:风险分析技术报告》 2507.16534v2 -
427 07-26 The Impact of Fine-tuning Large Language Models on Automated Program Repair Die Auswirkungen von Feinabstimmungen großer Sprachmodelle auf die automatisierte Programmreparatur 微调大语言模型对自动方案维修的影响 2507.19909v1 -
428 07-26 CaliDrop: KV Cache Compression with Calibration CaliDrop: KV Cache-Kompression mit Kalibrierung CaliDrop: KV 缓存压缩加校准 2507.19906v1 -
429 07-26 A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs Ein Gold Standard Datensatz und Evaluation Framework für Depression Erkennung und Erklärung in Social Media mit LLMs 利用LLMM公司在社会媒体中发现和解释抑郁症的黄金标准数据集和评价框架 2507.19899v1 -
430 07-26 Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs Automatisieren der mathematischen Proof-Generierung mit Large Language Model Agents und Wissensgraphen 使用大语言模型代理和知识图 2503.11657v2 -
431 07-26 Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam Zero-shot Leistung von Generative KI in brasilianischer portugiesischer medizinischer Prüfung 巴西葡萄牙医学考试中创用AI的零弹性能 2507.19885v1 -
432 07-26 Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning Causal Sufficiency und Necessity verbessert Kette-of-Thought-Reasoning C. 因果关系和必要性 改进审议链 理由 2506.09853v2 -
433 07-26 FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models FactReasoner: Ein probabilistischer Ansatz zur Langform-Faktivitätsbewertung für große Sprachmodelle 事实研究者:对大语言模式长期实际评估的概率办法 2502.18573v2 -
434 07-26 The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment Der polnische Vokabular-Größentest: Ein neuartiger adaptiver Test für die rezeptive Vokabular-Bewertung 波兰词汇大小测试:接受词汇评估的新适应性测试 2507.19869v1 -
435 07-26 DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments DRIVE: Disfluency-Rich Synthetic Dialog Data Generierung Framework für intelligente Fahrzeugumgebungen DIVE: 智能车辆环境数据生成框架 2507.19867v1 -
436 07-26 Agentic Reinforced Policy Optimization Agentische verstärkte politische Optimierung 强化政策优化 2507.19849v1 -
437 07-26 Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs Gemeinsames Verständnis von Fehlausrichtung im zielorientierten Dialog: Eine Fallstudie mit Ubuntu Chat Logs 理解目标导向对话框中的共同点不匹配:与Ubuntu聊天日志的案例研究 2503.12370v2 -
438 07-26 AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition AutoSign: Direkte Pose-zu-Text-Übersetzung für die kontinuierliche Erkennung von Zeichensprachen 自动签名: 用于持续手语识别的直导 Pose-to- Text 翻译 2507.19840v1 -
439 07-26 HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs HCAtention: Extreme KV Cache Compression via Heterogenes Aufmerksamkeitsrechnen für LLMs HCAttention:通过不同式注意计算法对LLMs进行极端KV缓存压缩 2507.19823v1 -
440 07-26 A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy Ein strukturierter Bangla-Datensatz von Krankheits-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit 改善诊断准确性疾病 – – 症状协会结构化孟加拉数据集 2506.13610v3 -
441 07-26 LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models LLM-Barber: Block-Aware Rebuilder für Sparsity Maske in One-Shot für große Sprachmodelle LLM-Barber:大语言模型单点单层面罩块件重建器 2408.10631v2 -
442 07-26 Flora: Effortless Context Construction to Arbitrary Length and Scale Flora: Müheloser Kontext Aufbau zu willkürlicher Länge und Skala Flora: 以任意长度和规模建造环境以达到任意长度和规模 2507.19786v1 -
443 07-26 UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities UloRL:Ein Ultra-Long-Output-Verstärkungs-Lernansatz zur Förderung großer Sprachmodelle UloRL: 推进大语言模式解释能力超长输出强化学习方法 2507.19766v1 -
444 07-26 Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs Sind Sie dort Gott? Leichte narrative Anmerkung der christlichen Fiction mit LMs 轻量量级的基督教小说和LMs 2507.19756v1 -
445 07-26 JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models JT-Math: Ein Multi-Stage-Framework für fortgeschrittene mathematische Vernunft in großen Sprachmodellen JT- Math:大语言模型高级数学理由多阶段框架 2507.19748v1 -
446 07-26 Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation Assembly Your Crew: Automatisches Multi-Agenten-Kommunikationstopologie-Design über autoregressive Graphen-Generierung 通过自动递减图形生成将您的组群组合成:自动多剂多剂通信地形设计 2507.18224v2 -
447 07-25 (5) Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs Ta-G-T: Subjektivitätserfassung in Tabelle zur Textgenerierung über RDF Graphen TaG-T:通过 RDF 图表生成文本的表格中主观性捕获 2507.19710v1 -
448 07-25 Scalable MatMul-free Language Modeling Skalierbare MatMul-freie Sprachmodellierung 可缩放 MatMul 无语言建模 2406.02528v7 -
449 07-25 Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks Towards Inclusive NLP: Bewertung komprimierter Mehrsprachiger Transformer über unterschiedliche Sprach-Benchmarks 实现包容性的《国家语言规划:评估跨越不同语文基准的压压压多语种变换器》 2507.19699v1 -
450 07-25 Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks Salsa als nonverbale Sprache – Der CoMPAS3D Datensatz und Benchmarks Salsa 作为一种非语言的成形语言 – – CoMPAS3D数据集和基准 2507.19684v1 -
451 07-25 Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research Navigation auf die Risiken der Verwendung großer Sprachmodelle für die Textannotation in der sozialwissenschaftlichen Forschung 利用大语言模式在社会科学研究中使用文字说明的风险 2503.22040v2 -
452 07-25 Benchmarking Linguistic Diversity of Large Language Models Benchmarking Linguistische Vielfalt großer Sprachmodelle 衡量大语言模式语言多样性的基准 2412.10271v2 -
453 07-25 Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs Haben große Sprachmodelle einen englischen Akzent? Bewertung und Verbesserung der Natürlichkeit von mehrsprachigen LLMs 大语言模式是否有英语中心? 2410.15956v3 -
454 07-25 RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams RoD-TAL: Ein Benchmark für die Beantwortung von Fragen in rumänischen Führerscheinprüfungen RoD-TAL:在罗马尼亚驾驶执照考试中回答问题的基准 2507.19666v1 -
455 07-25 Code-Switching and Syntax: A Large-Scale Experiment Code-Schalten und Syntax: Ein groß angelegtes Experiment 代码开动和语法:大规模实验 2506.01846v2 -
456 07-25 Minimal Pair-Based Evaluation of Code-Switching Minimale Pair-basierte Auswertung von Code-Switching 对代码转换的最小对等评价 2506.01840v2 -
457 07-25 Summarization of Opinionated Political Documents with Varied Perspectives Zusammenfassung opinionierter politischer Dokumente mit unterschiedlichen Perspektiven 具有不同观点的有见解的政治文件概述 2411.04093v2 -
458 07-25 OneShield – the Next Generation of LLM Guardrails OneShield – die nächste Generation der LLM-Guardrails OneShild – – 下一代LLM护卫车 2507.21170v1 -
459 07-25 Data Caricatures: On the Representation of African American Language in Pretraining Corpora Daten Karikaturen: Zur Darstellung der afroamerikanischen Sprache im Vortraining Corpora 数据制图:关于非洲裔美国人语言在预科公司中的代表性 2503.10789v2 -
460 07-25 Opacity as Authority: Arbitrariness and the Preclusion of Contestation Opacity as Authority: Willkür und die Präklusion der Anfechtung 作为权力的不透明度:仲裁和排除争议 2507.22944v1 -
461 07-25 MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks MCIF: Multimodale Crosslingual Instruction-Following Benchmark aus wissenschaftlichen Vorträgen MCIF: 科学会谈的多模式跨语言教学基准 2507.19634v1 -
462 07-25 LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung LoX:低Rank外推法强力推力LLM 安全防止微调 2506.15606v3 -
463 07-25 HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track HITSZs End-to-End-Sprachübersetzungssysteme zur Kombination von Sequenz-zu-Sequenz-Auto-Spracherkennungsmodell und indic Large Language Model für IWSLT 2025 in Indic Track HITSZ的端到端语音翻译系统,将序列到序列自动语音识别模型和2025 IWSLT Indic Track IWSLT 2025 的指数式大语言模型结合起来 2507.19616v1 -
464 07-25 MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? MOCHA: Sind Code-Sprachenmodelle gegen multi-Turn bösartige Coding-Prompts robust? MOCHA:守则语言模型是否强力打击多发恶意编码的提示? 2507.19598v1 -
465 07-25 Efficient Attention Mechanisms for Large Language Models: A Survey Effiziente Aufmerksamkeitsmechanismen für große Sprachmodelle: Eine Umfrage 高效率关注大语言模式机制:调查 2507.19595v1 -
466 07-25 Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning Geospatielles Wissen abmildern Halluzination in großen Sprachmodellen: Benchmarking und Dynamische Faktizität Ausrichtung 减轻大语言模式中的地理空间知识幻觉:基准和动态事实对齐 2507.19586v1 -
467 07-25 MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents MMBench-GUI: Hierarchischer Mehrplattform-Evaluierungsrahmen für GUI-Agenten MMMBench-GUI:图形用户界面代理器的等级多平台评价框架 2507.19478v1 -
468 07-25 Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts Weiterentwicklung der Event-Prognose durch massives Training von großen Sprachmodellen: Herausforderungen, Lösungen und breitere Auswirkungen 通过大规模培训大语言模式:挑战、解决办法和更广泛影响 2507.19477v1 -
469 07-25 Long-Form Answers to Visual Questions from Blind and Low Vision People Langform-Antworten auf visuelle Fragen von Blinden und Sehbehinderten 对盲人和低视力者视觉问题的长期答复 2408.06303v2 -
470 07-25 Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models Gespräche sind schief gegangen, aber dann? Evaluieren von Gesprächsvorhersagemodellen 对话消失,但后来呢?评价对话预测模型 2507.19470v1 -
471 07-25 RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale RADLADS: Schnelle Aufmerksamkeitsdestillation zu linearen Aufmerksamkeitsdecodern auf Scale RADLADS: 缩放线性引引代码的快速注意蒸馏 2505.03005v3 -
472 07-25 GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning GEPA: Reflektierende Prompt-Evolution kann Verstärkungs-Lernen übertreffen GEPA: 反思即时进化能够超过成绩的强化学习 2507.19457v1 -
473 07-25 A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies Ein Diagramm-Review-Prozess unterstützt durch natürliche Sprachverarbeitung und Multi-Wave adaptive Sampling zur Beschleunigung der Validierung von Code-basierten Algorithmen für große Datenbankstudien 借助自然语言处理和多波适应性取样的图表审查过程,以加快大型数据库研究代码算法的验证工作 2507.22943v1 -
474 07-25 Distillation Scaling Laws Destillationsskalierungsgesetze 强化法律 2502.08606v2 -
475 07-25 TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability TokenSmith: Verstärkte Datenbearbeitung, Suche und Inspektion für großformatige Sprachmodellschulungen und -dolmetschbarkeit TokenSmitth:简化数据编辑、搜索和检查,以进行大型语文模式培训和解释 2507.19419v1 -
476 07-25 Towards Domain Specification of Embedding Models in Medicine Auf dem Weg zur Domain-Spezifikation von Einbettungsmodellen in die Medizin 走向医学嵌入模型的域域指定 2507.19407v1 -
477 07-25 CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback CodeEvo: Interaktionsgetriebene Synthese codezentrierter Daten durch hybrides und iteratives Feedback 代码化:通过混合和循环反馈对以代码为中心的数据进行互动驱动合成 2507.22080v1 -
478 07-25 Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question Vielfältige LLMs oder unterschiedliche Frageinterpretationen? Das ist die Assembling-Frage 不同的LLMs或不同的问题解释? 2507.21168v1 -
479 07-25 Data Augmentation for Spoken Grammatical Error Correction Datenvergrößerung für gesprochene Grammatical Error Correction 语音语法错误校正的数据增强 2507.19374v1 -
480 07-25 LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences LOTUS: Ein Leaderboard für detaillierte Bildunterschriften von Qualität zu gesellschaftlichen Bias und Benutzereinstellungen LOTUS: 从质量到社会偏见和用户首选的详细图像描述领导板 2507.19362v1 -
481 07-25 SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models SpeechIQ: Sprachintelligenz Quotient über kognitive Ebenen im Sprachverständnis von großen Sprachmodellen 语音理解大语言模式中不同认知层次的语音情报引号 2507.19361v1 -
482 07-25 SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model SALM-Duplex: Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell SALM-Duplex:语音对语音语言模式的高效和直接双重模式 2505.15670v4 -
483 07-25 Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization Verbesserung der Sprach-Emotions-Erkennung Auslevering Aligning Timestamps von ASR-Transkriptionen und Sprecher-Diarisierung 利用ASR记录稿和议长对称的调和时标 2507.19356v1 -
484 07-25 DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue DoctorAgent-RL: Ein multi-agent-kollaboratives Verstärkungs-Lernsystem für den multi-Turn-Klinischen Dialog DocrAgentor-RL:多轮临床对话多机构合作强化学习系统 2505.19630v2 -
485 07-25 Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks Smooth Reading: Die Lücke von LLM zur Selbstaufmerksamkeit von LLM bei langen Kontextaufgaben überbrücken 平滑阅读:弥合经常LLM与长期任务自用LLM之间的差距 2507.19353v1 -
486 07-25 Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation Externes Wissen in den vernünftigen Prozess zu spritzen verbessert die retrieval-angereicherte Generation 将外部知识注入说明过程,加强检索-提款一代 2507.19333v1 -
487 07-25 References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation Referenzen Materie: Untersuchung der Auswirkungen von Referenzsatzvariationen auf die Bewertung der Zusammenfassung 参考参考物质:调查参照标准差异对总结评价的影响 2506.14335v2 -
488 07-25 AutoPCR: Automated Phenotype Concept Recognition by Prompting AutoPCR: Automatisierte Erkennung von Phänomenen durch Prompting 自动PCR:通过提示自动地识别基因型概念 2507.19315v1 -
489 07-25 The Eloquence team submission for task 1 of MLC-SLM challenge Die Eloquence-Team-Einreichung für die Aufgabe 1 der MLC-SLM-Herausforderung 刚果解运-解运挑战任务1的评分小组提交 2507.19308v1 -
490 07-25 Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns Identifizierung feinkörniger Formen des Populismus im politischen Diskurs: Eine Fallstudie zu Donald Trumps Präsidentschaftswahlen 确定政治讨论中精美的民粹主义形式:关于唐纳德·特朗普总统运动的个案研究 2507.19303v1 -
491 07-25 A Markov Categorical Framework for Language Modeling Ein kategorisches Markov-Rahmenwerk für Sprachmodellierung 用于语言建模的 Markov 语言建模分类框架 2507.19247v1 -
492 07-25 Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation Jailbreaking Large Language Diffusion Models: Enthüllen versteckter Sicherheitsfehler bei der Diffusion-basierten Textgenerierung 大语言传播模式:在以传播为基础的文本生成中披露隐藏的安全条 2507.19227v1 -
493 07-25 How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework Wie viel Cheat bei der Evaluation eines großen Sprachmodells? Benchmarking-Überschätzung im Rahmen des One-Time-Pad-basierten Frameworks 大语言模式在评价方面有多大的热量? 以单一时间为基础的框架为高估基准 2507.19219v1 -
494 07-25 3LM: Bridging Arabic, STEM, and Code through Benchmarking 3LM: Arabisch, MINT und Code durch Benchmarking überbrücken 3LM:通过基准确定连接阿拉伯语、STEM和代码 2507.15850v3 -
495 07-25 SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology SigBERT: Kombination narrativer medizinischer Berichte und rough Path Signature Theory zur Einschätzung des Überlebensrisikos in der Onkologie SigBERT: 将叙述性医疗报告与肿瘤学生存风险估算的粗路签名理论相结合 2507.22941v1 -
496 07-25 Towards Multimodal Social Conversations with Robots: Using Vision-Language Models Auf dem Weg zu multimodalen sozialen Gesprächen mit Robotern: Mit Vision-Sprachen-Modellen 走向与机器人的多模式社会对话:使用视觉语言模型 2507.19196v1 -
497 07-25 Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models? Kann Small-Scale-Datenvergiftung Dialect-Linked Biases in großen Sprachmodellen exazerbieren? 在大语言模型中,小范围数据中毒加剧分解链接的分界线能否成为大语言模型? 2507.19195v1 -
498 07-25 Natural Language Processing for Tigrinya: Current State and Future Directions Natürliche Sprachverarbeitung für Tigrinya: Aktueller Zustand und zukünftige Richtungen 提格里尼亚的自然语言处理:现状和未来方向 2507.17974v2 -
499 07-25 Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them Scalpel vs. Hammer: GRPO verstärkt bestehende Fähigkeiten, SFT ersetzt sie 缩略图与锤子:GROPO 放大现有能力,SFT 替换 2507.10616v2 -
500 07-25 An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case Eine empirische Untersuchung der Geschlechterstereotypdarstellung in großen Sprachmodellen: Der italienische Fall 对大语言模式中性别陈规定型观念代表性的经验调查:意大利案例 2507.19156v1 -
501 07-25 Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings Beschleunigung multimodaler Großsprachenmodelle über Dynamic Visual-Token Exit und die Empirical Findings 通过动态直视退出和实证结论加速多模式大语言模型 2411.19628v2 -
502 07-25 Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes Vertrauenswürdige Begründung: Bewertung und Verbesserung der tatsächlichen Genauigkeit in LLM-Intermediate-Thought-Prozessen 值得信赖的理由:评估和加强LLM中级思考程序中的事实准确性 2507.22940v1 -
503 07-25 OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth? OS-MAP: Wie weit können Computer-verwendende Agenten in Breadth und Tiefe gehen? OS-MAP:计算机用户在面包和深度上能走多远? 2507.19132v1 -
504 07-25 Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning Destillieren der impliziten Multi-Branch-Struktur in LLMs’ Reasoning durch Verstärkungslernen 通过强化学习,将LLMs的隐含多层结构提炼在“通过强化学习推理”中 2505.16142v3 -
505 07-25 Objectifying the Subjective: Cognitive Biases in Topic Interpretations Objektivierung des Subjektiven: Kognitive Biasen in thematischen Interpretationen 表示主观性: 专题解释中的认知性分界线 2507.19117v1 -
506 07-25 Relation Extraction with Instance-Adapted Predicate Descriptions Verhältnis-Extraktion mit instance-adapted Prädikat Beschreibungen 采掘与原创性预言性说明的关系 2503.17799v2 -
507 07-25 Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy Ensemble Debiasing Across Class und Sample Levels für eine gerechtere Genauigkeit 公平促进准确性 2503.05157v4 -
508 07-25 Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case Vergleich von Pipeline-, Sequenz-zu-Sequenz- und GPT-Modellen für die End-to-End-Relation-Extraktion: Experimente mit dem Einsatzfall der seltenen Krankheiten 管道、序列到序列和终端到终端关系提取GPT模型的比较:与罕见疾病使用案例的实验 2311.13729v3 -
509 07-25 Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation Destillieren eines kleinen Utility-Based Passage Selectors zur Verbesserung der Retrieval-Augmented Generation 蒸馏一个小型以公用事业为基础的通道选择器,以加强回收-提款一代 2507.19102v1 -
510 07-25 How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction? Wie wichtig ist Domain Specificity in Sprachmodellen und Instruction Finetuning für die biomedizinische Beziehungsextraktion? 在生物医学关系采掘的语言模式和教学教学调整中,域的具体特点有多重要? 2402.13470v2 -
511 07-25 JCAPT: A Joint Modeling Approach for CAPT JCAPT: Ein gemeinsamer Modellierungsansatz für CAPT JCAPT: CAPT的联合示范方法 2506.19315v2 -
512 07-25 LLMs are Also Effective Embedding Models: An In-depth Overview LLMs sind auch effektive Einbettungsmodelle: Eine ausführliche Übersicht LLM项目也是有效的嵌入模型:深入概述 2412.12591v2 -
513 07-25 Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents Debating Truth: Debattieren-getriebene Behauptungsverifizierung mit mehreren Large Language Model Agents 讨论真相:由辩论驱动的与多语种示范语言代理核查索赔要求 2507.19090v1 -
514 07-25 Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement Arg-LlaDA: Argumentationszusammenfassung über Large Language Diffusion Models und Sufficiency-Aware Refinement ARG-LLADA:通过大语言传播模型和充足软件精炼进行参数汇总 2507.19081v1 -
515 07-25 Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny Re:Form – Reduzierung menschlicher Priore bei skalierbarer formaler Software-Verifikation mit RL in LLMs: Eine Vorstudie zu Dafny Re:形式 – – 在可扩展的正式软件核查中减少人类前科,LLL女士:关于Dafny的初步研究 2507.16331v2 -
516 07-25 ToolACE: Winning the Points of LLM Function Calling ToolACE: Die Punkte des LLM-Funktionsaufrufs gewinnen 工具ACE:赢得LLLM函数调用点 2409.00920v2 -
517 07-25 GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness GOAT-SLM: Ein gesprochenes Sprachmodell mit paralinguistischem und Lautsprechercharakteristischem Bewusstsein GOAT-SLM:具有多语言语言和议长特点意识的口语模式 2507.18119v2 -
518 07-25 XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare XAI4LLM. Lassen Sie Modelle für maschinelles Lernen und LLMs für verbessertes In-Context-Lernen im Gesundheitswesen zusammenarbeiten XAI4LLLM. 让机器学习模式和LLM合作促进保健领域加强内文学习 2405.06270v4 -
519 07-25 T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation T2ISafety: Benchmark für die Bewertung von Fairness, Toxizität und Datenschutz in der Bildgenerierung T2ISafetty:评估图像生成中的公平、毒性和隐私的基准 2501.12612v3 -
520 07-25 Closing the Modality Gap for Mixed Modality Search Schließen der Modalitätslücke für gemischte Modalitätssuche 缩小混合方式搜索模式差距 2507.19054v1 -
521 07-25 PARROT: An Open Multilingual Radiology Reports Dataset PARROT: Ein offener Mehrsprachiger Röntgenbericht Datensatz 开放多语种放射学报告数据集 2507.22939v1 -
522 07-25 FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems FD-Bench: Eine Full-Duplex-Benchmarking-Pipeline für volle Duplex-Gesprochene Dialogsysteme FD-Bench:为全双口孔对话系统设计的全自动基准管道 2507.19040v1 -
523 07-25 MLLM-based Speech Recognition: When and How is Multimodality Beneficial? MLLM-basierte Spracherkennung: Wann und wie ist Multimodalität vorteilhaft? 基于MLLM的语音识别:多式联运何时和如何受益? 2507.19037v1 -
524 07-25 A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents Ein Graph-basierter Ansatz für Multi-Modal-Fragebeantwortungen aus Flussdiagrammen in Telecom-Dokumenten 以图表为基础的电信文件流动图表多模式问题解答方法 2507.22938v1 -
525 07-25 Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems Akustisch präzises Hesitations-Tagging ist für End-to-End-Transkriptionssysteme unerlässlich 终端至终端逐字记录翻译系统至关重要的隐含精确言辞 2506.04076v2 -
526 07-25 Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations Töten Sie zwei Vögel mit einer Klappe: generalisierte und robuste KI-generierte Texterkennung durch dynamische Störungen 以一石一石杀死两鸟:通过动态扰动,普遍和有力地检测AI产生的文本 2504.21019v2 -
527 07-25 Advancing biomolecular understanding and design following human instructions Verbesserung des biomolekularen Verständnisses und Designs nach menschlichen Anweisungen 按照人类的指示,推动生物分子理解和设计 2410.07919v2 -
528 07-25 HIVMedQA: Benchmarking large language models for HIV medical decision support HIVMedQA: Benchmarking großer Sprachmodelle für die medizinische HIV-Entscheidungsunterstützung HIVMedQA:确定艾滋病毒医疗决策支助大语言模式的基准 2507.18143v2 -
529 07-25 Verbalized Representation Learning for Interpretable Few-Shot Generalization Verbalisiertes Repräsentationslernen für verdolmetschbare wenige-heiße Verallgemeinerung 以口头方式进行代表性学习,为可口译的少或偏的普及化提供口译 2411.18651v2 -
530 07-25 Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation Bewertung von LLM-Fehlern, für die Personalisierte Disinformationsgenerierung missbräuchlich verwendet zu werden 评价LLMM 利用LLM 个人化信息生成不当利用他人造成个人化信息的脆弱性 2412.13666v2 -
531 07-25 CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering CoE-Ops: Zusammenarbeit von LLM-basierten Experten für AIOps Frage-Antwort 欧委会行动:以LLM为基础的专家协作处理AIOps问题 2507.22937v1 -
532 07-25 MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts MultiSocial: Mehrsprachiger Benchmark der maschinengenerierten Texterkennung von Social-Media-Texten 多社会多语言:社会-媒体文本机制文本检测多语言基准 2406.12549v2 -
533 07-25 A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation Eine Toolbox, kein Hammer – Multi-TAG: Skalierung der Mathematik mit Multi-Tool-Aggregation 一个工具箱, 不是锤锤 – – 多TAG: 使用多工具聚合的量性数学解释 2507.18973v1 -
534 07-25 Spike No More: Stabilizing the Pre-training of Large Language Models Spike No More: Stabilisierung der Vorausbildung großer Sprachmodelle Spike No No More: 稳定大语言模式培训前 2312.16903v4 -
535 07-25 A Similarity Measure for Comparing Conversational Dynamics Eine Ähnlichkeitsmessung für den Vergleich von Konversationsdynamiken 比较相互动态的相似性措施 2507.18956v1 -
536 07-25 MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model MedicalBERT: Verbesserung der biomedizinischen natürlichen Sprachverarbeitung mit vorgebildetem BERT-basiertem Modell 医学BERT:利用预先培训的BERT模式,加强生物医学自然语言处理 2507.08013v2 -
537 07-25 Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection Zusammenfassung des Rechtsdokuments: Verbesserung der richterlichen Effizienz durch Automatisierungserkennung 法律文件摘要:通过自动检测提高司法效率 2507.18952v1 -
538 07-25 Adaptive Learning Systems: Personalized Curriculum Design Using LLM-Powered Analytics Adaptive Lernsysteme: Personalisierte Lehrplangestaltung mit LLM-Powered Analytics 适应性学习系统:利用LLM能动分析器的个人化课程设计 2507.18949v1 -
539 07-25 TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models TreeReader: Ein Hierarchischer Akademischer Papierleser Powered by Language Models 树形阅读器:一个按语言模式授权的等级学术论文阅读器 2507.18945v1 -
540 07-25 LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation LLaVA-NeuMT: Selektive Schicht-Neuron-Modulation für effiziente multimodale Mehrsprachigkeit LLAVA-NeUMT: 选择性多语层-Neuron 高效多语种多语种多模式翻译的调整 2507.18940v1 -
541 07-25 Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks Benchmarking des multimodalen Verständnisses und der komplexen Begründung für ESG-Aufgaben 确定环境组合组合任务多式联运理解和复杂理由的基准 2507.18932v1 -
542 07-25 Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters Seed-X: Starke Mehrsprachige Übersetzung LLM mit 7B-Parametern aufbauen 种子-X:利用7B参数建立强有力的多语种翻译LLM 2507.13618v3 -
543 07-25 Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders Entdeckt Cross-Linguistic Disparities in LLMs mit Sparse Autoencodern 使用 Sparse 自动编码器在 LLM 中解封跨语言差异 2507.18918v1 -
544 07-25 Mining Contextualized Visual Associations from Images for Creativity Understanding Bergbau Kontextualisierte visuelle Assoziationen aus Bildern für Kreativität Verständnis 利用图像促进创造性理解的采矿背景化视觉协会 2507.18915v1 -
545 07-25 A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions Eine systematische Überprüfung der Systeme der wichtigsten retrieval-Augmented Generation (RAG): Fortschritt, Lücken und Zukunftsrichtungen 系统审查关键回收-养代(RAG)系统:进展、差距和未来方向 2507.18910v1 -
546 07-25 Large language models provide unsafe answers to patient-posed medical questions Große Sprachmodelle bieten unsichere Antworten auf patientenbezogene medizinische Fragen 大型语言模式为病人提出的医疗问题提供不安全的答案 2507.18905v1 -
547 07-25 SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models SLoW: Wählen Sie niederfrequente Wörter aus! Automatische Wörterbuchauswahl für Übersetzungen auf großen Sprachmodellen SLOW: 选择低频单词! 用于大语言模型翻译的自动词典选择 2507.18902v1 -
548 07-25 REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research? REPRO-Bench: Können Agentische KI-Systeme die Reproduzierbarkeit der sozialwissenschaftlichen Forschung bewerten? REPRO-BENCH: AI系统能否评估社会科学研究的可减少性? 2507.18901v1 -
549 07-25 Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs Kann LLMs Citation Intent voraussagen? Eine experimentelle Analyse des In-Context-Lernens und Feinabstimmungens auf offenen LLMs LLMs 预测引文意图:对开放式LMs的内文学习和微调的实验分析 2502.14561v3 -
550 07-25 A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen 全面评价未受过训练语言模式和人文的语义关系知识 2412.01131v4 -
551 07-25 NUTMEG: Separating Signal From Noise in Annotator Disagreement NUTMEG: Trennen von Signalen von Geräuschen in Annotator-Uneinigkeit NUTMEG: 在通知器中从噪音中分离信号 2507.18890v1 -
552 07-25 MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service MindFlow+: Ein selbstständiger Agent für den E-Commerce-Kundendienst Mind Flow+:电子商务客户服务自我发展代理 2507.18884v1 -
553 07-25 An Investigation of Prompt Variations for Zero-shot LLM-based Rankers Eine Untersuchung von Prompt-Variationen für Null-Schuss LLM-basierte Ranker 调查零射中LLM中士的迅速变化情况 2406.14117v4 -
554 07-25 Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction Phoneme-Level Visuelle Spracherkennung über Point-Visual Fusion und Sprachmodellsanierung 通过点-视点融合和语言模式重建确认电话级视觉讲话 2507.18863v1 -
555 07-25 PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning PrismRAG: Steigerung der RAG-Faktizität mit Distraktorresilienz und geschichteter Vernunft PrismRAG:提高RAG事实质量,使其具有抗力和策略性合理性 2507.18857v1 -
556 07-24 (4) The Curious Case of Class Accuracy Imbalance in LLMs: Post-hoc Debiasing via Nonlinear Integer Programming Der Kuriose Fall der Klasse Genauigkeit Ungleichgewicht in LLMs: Post-hoc-Debiasing über nichtlineare Integer-Programmierung LLMLM中分类准确性不平衡的怪案:通过非线性整数编程进行热后脱偏性 2405.07623v7 -
557 07-24 R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning R-Stitch: Dynamische Trajektorien-Stitching für effiziente Vernunft R-Stitch: 高效理性的动态轨迹切换 2507.17307v2 -
558 07-24 Toward Super Agent System with Hybrid AI Routers Auf dem Weg zum Super Agent System mit Hybrid-KI Routern 向超级代理系统过渡 2504.10519v2 -
559 07-24 CueBuddy: helping non-native English speakers navigate English-centric STEM education CueBuddy: Hilfe für nicht-native englische Referenten navigieren Englisch-centric STEM Bildung CueBuddy:帮助非母语英语者掌握以英语为中心的STEM教育 2507.18827v1 -
560 07-24 Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models Promptomatix: Ein automatisches Optimierungs-Framework für große Sprachmodelle 即时表达式:大语言模型自动快速优化框架 2507.14241v3 -
561 07-24 Analyze Feature Flow to Enhance Interpretation and Steering in Language Models Feature Flow analysieren, um Interpretation und Steuerung in Sprachmodellen zu verbessern 分析地貌流动,以加强语言模型的口译和指导 2502.03032v3 -
562 07-24 Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs Palme: Ein kulturell inklusiver und sprachlich vielfältiger Datensatz für arabische LLMs 棕榈:阿拉伯文LLMLM具有文化包容性和语言多样性的数据集 2503.00151v2 -
563 07-24 Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models Plan für Geschwindigkeit: Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle 速度计划: 遮蔽传播语言模型的饱和日程安排 2506.19037v3 -
564 07-24 Evaluating Code-Mixing in LLMs Across 18 Languages Bewertung von Code-Mixing in LLMs in 18 Sprachen 评估18种语言的LLMs混合编码 2507.18791v1 -
565 07-24 Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis Bewertung großer Sprachmodelle (LLMs) in Financial NLP: Eine vergleichende Studie zur Analyse von Finanzberichten 评价金融中大语言模型:财务报告分析比较研究 2507.22936v1 -
566 07-24 A Fisher’s exact test justification of the TF-IDF term-weighting scheme Genaue Begründung des TF-IDF-Term-Wichtungssystems durch einen Fisher A Fisher公司对TF-IDF术语加权办法的精确测试理由 2507.15742v2 -
567 07-24 ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting ylmmcl bei Mehrsprachiger Textentgiftung 2025: Lexikon-geführte Entgiftung und Klassifikator-gestrichenes Umschreiben 2025年多语言文本解毒:Lexicon-Guid解毒和分类法改写 2507.18769v1 -
568 07-24 Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience Auf dem Weg zu strukturiertem Wissen Reasoning: Kontrastive retrieval-erweiterte Generation auf Erfahrung 实现结构化知识理由:反向取回-积累经验的一代人 2506.00842v2 -
569 07-24 The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages Die Rolle der Orthografiekonsistenz in mehrsprachigen Einbettungsmodellen für die Textklassifizierung in Arabisch-Script-Sprachen 阿拉伯文和克里普特语文文本分类多语种嵌入模型中正统一致性的作用 2507.18762v1 -
570 07-24 Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition Lärm Kontrastive Schätzung-basiertes Matching Framework für die Erkennung von Low-Resource-Sicherheitsangriffen 低资源安保攻击模式识别比对框架 2401.10337v4 -
571 07-24 Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement Spezifikation Selbst-Korrektion: Eindämmung von In-Context-Belohnung Hacken durch Test-Zeit-Verfeinerung 规格自我校正:通过试验-时间精炼进行减速的背负冲洗 2507.18742v1 -
572 07-24 AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v3 -
573 07-24 An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning Effizientes Sparse-Fine-Tuning mit geringem Quantisierungsfehler über Neural Network Pruning 通过神经网络节制低量错误的高效粗简精细调整 2502.11439v2 -
574 07-24 Checklists Are Better Than Reward Models For Aligning Language Models Checklisten sind besser als Belohnungsmodelle für die Ausrichtung von Sprachmodellen 核对列表比奖励模型更好调整语言模型 2507.18624v1 -
575 07-24 TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards TRPrompt: Bootstrapping Query-Aware Prompt Optimierung von Textbelohnungen TRPropt: 从文本奖励中促进解答询问软件快速优化 2507.18618v1 -
576 07-24 SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning SynC: Synthetische Bildunterschrift Datensatzverfeinerung mit ein-zu-vielen Mapping für Zero-shot Bildunterschrift 合成图像说明: 合成图像说明数据集精化,用一到多个绘图进行零光图像说明的合成图像说明 2507.18616v1 -
577 07-24 BEARCUBS: A benchmark for computer-using web agents BEARCUBS: Benchmark für computergestützte Web-Agenten BEARCUBS:计算机使用网络代理器的基准 2503.07919v3 -
578 07-24 Trusted Knowledge Extraction for Operations and Maintenance Intelligence Vertrauenswürdige Wissensgewinnung für Operationen und Wartungsintelligenz 行动和维持情报可信赖的知识采掘 2507.22935v1 -
579 07-24 Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs Sparse Logit Sampling: Beschleunigung der Wissensdestillation in LLMs 粗略的登录抽样:加速在LLMs中进行知识蒸馏 2503.16870v2 -
580 07-24 Deep Learning Approaches for Multimodal Intent Recognition: A Survey Deep Learning Ansätze zur multimodalen Intent-Erkennung: Eine Umfrage 多种形式本能识别的深学习方法:调查 2507.22934v1 -
581 07-24 What Makes You CLIC: Detection of Croatian Clickbait Headlines Was macht Sie CLIC: Erkennung von kroatischen Clickbait Schlagzeilen 是什么让你成为CLIC:发现克罗地亚点击头条头条 2507.14314v2 -
582 07-24 AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs AQuilt: Verweben von Logik und Selbstinspektion in Low-Cost, High-Relevance-Datensynthese für Spezialisten LLMs Anilt:将逻辑和自我检查编织成低成本高相关性数据合成,供专家LLMs使用 2507.18584v1 -
583 07-24 DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data DR.EHR: Dense Retrieval für elektronische Gesundheitsdaten mit Wissensinjektion und synthetischen Daten DR.EHR: 具有知识注射和合成数据的电子健康记录大量检索 2507.18583v1 -
584 07-24 System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition Systembericht für CCL25-Eval Task 10: SRAG-MAV für feinkörnige chinesische Hassspracherkennung 供CCL25-Eval任务10使用的系统报告:关于中华恶言识别的SRAG-MAV系统报告 2507.18580v1 -
585 07-24 P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应 2406.12548v3 -
586 07-24 Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs Weit-in, schmal-out: Wiederverwertbare Dekodierung für effiziente und effektive DLLMs 宽放, 窄出: 为高效和有效DLLMs而可撤销的解码 2507.18578v1 -
587 07-24 LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs LingBench++: Ein linguistisch-informiertes Benchmark- und Reasoning-Framework für mehrstufige und kulturübergreifende Schlussfolgerungen mit LLMs LingBench++:与LLMs的多层次和跨文化推理语言综合基准和理由框架 2507.16809v2 -
588 07-24 PosterMate: Audience-driven Collaborative Persona Agents for Poster Design PosterMate: Audience-getriebene Kollaborative Persona Agenten für Poster-Design PosterMate:由观众驱动的海报设计合作人员代理 2507.18572v1 -
589 07-24 Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods Hybride Tokenisierungsstrategie für DNA-Sprachmodell mit Byte Pair Encoding und K-MER Methoden 使用字节对等编码和K-MER方法的DNA语言模型混合化战略 2507.18570v1 -
590 07-24 GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation GIIFT: Graph-geführte induktive Bildverarbeitungsfreie multimodale maschinelle Übersetzung GIIFT: 图表制导感性不含图像的无图像多式机器翻译 2507.18562v1 -
591 07-24 Identity-related Speech Suppression in Generative AI Content Moderation Identitätsbezogene Sprachunterdrückung in der Generativen KI-Inhaltsmoderation 在产生AI 内容调节中禁止与身份有关的言语 2409.13725v3 -
592 07-24 Augmented Vision-Language Models: A Systematic Review Augmented Vision-Language Models: Eine systematische Bewertung 增强愿景-语言模型:系统审查 2507.22933v1 -
593 07-24 FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification FinMarBa: Ein marktinformierter Datensatz für die Einstufung von Finanzsentimenten FinMarba:用于金融敏感度分类的市场化数据集 2507.22932v1 -
594 07-24 LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important LagKV: Lag-Relative Information des KV-Cache erzählt, welche Token wichtig sind LagKV: KV 缓存告诉哪个 Tokens 重要, 而 KV 缓存的拉格- 相对信息Name 2504.04704v2 -
595 07-24 GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface GLiNER2: Ein effizientes Multi-Task-Informationsextraktionssystem mit Schema-gesteuerter Schnittstelle GLINER2:具有Schema-Driven界面的高效多任务信息提取系统 2507.18546v1 -
596 07-24 Effective Multi-Task Learning for Biomedical Named Entity Recognition Effektives Multi-Task-Lernen für die biomedizinische benannte Entitätserkennung 有效多任务学习促进生物医学命名实体的识别 2507.18542v1 -
597 07-24 The Moral Gap of Large Language Models Die moralische Kluft großer Sprachmodelle 大语言模式的道德差距 2507.18523v1 -
598 07-24 GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks GCC-Spam: Spam-Erkennung über GAN, Kontrastives Lernen und Charaktergleichheitsnetzwerke 海合会-Spam:通过全球大气监测网、反竞争学习和特征相似网络探测垃圾邮件 2507.14679v2 -
599 07-24 Exploiting individual differences to bootstrap communication Nutzung individueller Unterschiede zur Bootstrap-Kommunikation 利用个人差异进行靴套通信 2504.05211v2 -
600 07-24 Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models Nicht alle Funktionen widmen sich der Aufmerksamkeit: Graphengeführtes Abhängigkeitslernen für tabellarische Datengenerierung mit Sprachmodellen 并非所有值得注意的地物:用语言模型编制图表数据时的图表指导依赖性学习 2507.18504v1 -
601 07-24 LLM-based Embedders for Prior Case Retrieval LLM-basierte Embedders für frühere Fallwiederherstellung 用于先前个案检索的LLM 以LLM为基础的嵌入器 2507.18455v1 -
602 07-24 Generation of Synthetic Clinical Text: A Systematic Review Generieren von synthetischem klinischem Text: Ein systematischer Test 合成临床文本的生成:系统审查 2507.18451v1 -
603 07-24 Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language Wiederherstellung des Rhythmus: Pünktlichkeitsrestaurierung mit Transformer-Modellen für Bangla, eine Sprache mit geringer Ressource 恢复时速:使用孟加拉国低资源语言 “ 孟加拉 “ 变压器模型恢复脉冲 2507.18448v1 -
604 07-24 AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data AraTable: Benchmarking von LLMs’ Vernunft und Verständnis arabischer Tabellendaten 阿拉伯表格:按基准确定LLM女士对阿拉伯表格数据的理由和理解 2507.18442v1 -
605 07-24 IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation IPCGRL: Sprachgestütztes Verstärkungslernen für die verfahrenstechnische Level-Generierung ICPCGRL: 程序生成阶段语言教学强化学习 2503.12358v4 -
606 07-24 DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten DFAME: 与多式联运专家进行动态证据法检查 2412.10510v4 -
607 07-24 How do language models learn facts? Dynamics, curricula and hallucinations Wie lernen Sprachmodelle Fakten? Dynamik, Lehrpläne und Halluzinationen 语言模式如何了解事实?动态、课程和幻觉 2503.21676v2 -
608 07-24 FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs FinDPO: Finanz-Sentiment-Analyse für algorithmischen Handel durch Preference-Optimierung von LLMs FinDPO:通过优惠优化LLMs,分析通过高利贷交易的金融敏感度 2507.18417v1 -
609 07-24 ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models Explica: Explizite kausale Vernunft in großen Sprachmodellen bewerten ExpliCa:在大语言模型中评估明确的原因原因 2502.15487v3 -
610 07-24 Enhancing RAG Efficiency with Adaptive Context Compression Steigerung der RAG-Effizienz durch adaptive Kontextkompression 提高RAG效率,同时采取适应性环境压缩措施 2507.22931v1 -
611 07-24 Factual Inconsistencies in Multilingual Wikipedia Tables Tatsächliche Inkonsistenzen in mehrsprachigen Wikipedia-Tabellen 多语言维基百科表格中的事实不一致 2507.18406v1 -
612 07-24 CLEAR: Error Analysis via LLM-as-a-Judge Made Easy CLEAR: Fehleranalyse über LLM-as-a-Judge leicht gemacht CLLEAR:通过LLM-as-a法官进行错误分析 2507.18392v1 -
613 07-24 Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games 原因:在公共货物运动会中,理性语言模式成为自由骑手 2506.23276v2 -
614 07-24 Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs Beyond Profile: Von Oberflächen-Fakten zur tiefen Persona-Simulation in LLMs 超越简介:从地平面事实到深人模拟LLMM 2502.12988v3 -
615 07-24 Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection Schutz gefährdeter Stimmen: Synthetische Datensatzgenerierung zur Selbstdetektion 保护弱势声音:为自我披露检测合成数据集生成 2507.22930v1 -
616 07-24 Mechanistic Indicators of Understanding in Large Language Models Mechanistische Indikatoren des Verstehens in großen Sprachmodellen 大语言模型中理解力的机械指标 2507.08017v3 -
617 07-24 Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence Hybride Annotation für Propagandaerkennung: Integration von LLM-Vorannotationen mit menschlicher Intelligenz 宣传探测混合说明:将LLM预告与人类情报相结合 2507.18343v1 -
618 07-24 TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning TDR: Task-decoupled Retrieval mit feinkörnigem LLM-Feedback für das In-Context-Lernen TDR: 以精细的LLM反馈方式进行任务减缩的检索,以便进行内容学习 2507.18340v1 -
619 07-24 Uncertainty Quantification for Evaluating Machine Translation Bias Ungewissheit Quantifizierung für die Auswertung von maschinellen Übersetzungs-Bias 评价机器翻译偏见的不确定性定量 2507.18338v1 -
620 07-24 EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow EH-Benchmark Ophthalmische Halluzination Benchmark und Agent-getriebene Top-Down-Rückverfolgbarkeit Workflow EH-Benchmark Ophthalmic 幻觉基准和代理Dripreven 顶底可追踪合理理由工作流程 2507.22929v1 -
621 07-24 A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1 Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1 关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1 2507.08621v2 -
622 07-24 BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit BadReasoner: Pflanzung Tunable Überdenken Hintertüren zu großen Grundmodellen für Spaß oder Gewinn BadReasoner: 将金枪鱼可变性过度思考的后门规划成娱乐或利润的大理由模型 2507.18305v1 -
623 07-24 LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models LoRA-Leak: Membership Inferenz Angriffe gegen LoRA fein abgestimmte Sprachmodelle LoRA-Leak:对LORA精调语言模式的成员推论攻击 2507.18302v1 -
624 07-24 DocTER: Evaluating Document-based Knowledge Editing DocTER: Dokumentbasierte Wissensbearbeitung bewerten 评价基于文件的知识编辑 2308.09954v2 -
625 07-24 Step-Audio 2 Technical Report Schritt-Audio 2 Technischer Bericht 技术报告 2507.16632v2 -
626 07-24 VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks VolDoGer: LLM-unterstützte Datensätze für Domain-Verallgemeinerung in Vision-Language-Aufgaben VolDoGer:LLM辅助数据集,用于视野语言任务中通用域的LLM辅助数据集 2407.19795v2 -
627 07-24 StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer StyleAdaptedLM: Weiterentwicklung der Anleitung nach Modellen mit effizienter Stylistik-Übertragung StypeAddapedLM:按照高效立体转让模式加强教学 2507.18294v1 -
628 07-24 How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding Wie denkt die Kette des Denkens? Mechanistische Interpretierbarkeit von Chain-of-Thought-Reasoning mit Sparse Autoencoding 思维链思维链是如何思考的? 2507.22928v1 -
629 07-24 Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil Null-Schuss OCR Genauigkeit der niedrig-Ressourcen Sprachen: Eine vergleichende Analyse auf Sinhala und Tamil 低资源语言的准确性:僧伽罗语和泰米尔语比较分析 2507.18264v1 -
630 07-24 Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models Locate-and-Focus: Verbesserung der Terminologieübersetzung in Sprachmodellen 目的和重点:加强语言语言模式术语翻译 2507.18263v1 -
631 07-24 Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning Multimodale Verhaltensmusteranalyse mit Eye-Tracking und LLM-basierter Vernunft 以眼跟踪和基于LLM的理由进行多模式行为模式分析 2507.18252v1 -
632 07-24 Meta Prompting for AI Systems Meta Prompting für KI-Systeme AI 系统的模拟模拟 2311.11482v8 -
633 07-24 Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation Prune&Comp: Kostenloses Mittagessen für Layer-Pruned LLMs über iterative Pruning mit Magnitude Compensation Prune & Comp: 通过模拟谨慎与磁度补偿为由层驱动的LMs免费午餐 2507.18212v1 -
634 07-24 Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen 利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变 2505.20658v2 -
635 07-24 Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation Untersuchung der Auswirkungen von Instruction-Tuning auf die Anfälligkeit von LLM für Fehlinformationen 探讨指导指导对LLM对错误信息易感性的影响 2507.18203v1 -
636 07-24 Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection Sicherung von RAG-Pipelines mit GMTP: Eine gradient-basierte maskierte Token-Wahrscheinlichkeitsmethode für vergiftete Dokumentenerkennung 使用GMTP来保护RAG管道:一种基于渐进式蒙面的中毒文件检测概率方法 2507.18202v1 -
637 07-24 Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization Integration eines ISO30401-konformen Wissensmanagementsystems in bestehende Geschäftsprozesse einer Organisation 将符合ISO30401的知识管理系统纳入一个组织的现有业务流程 2507.18197v1 -
638 07-24 SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models ANWENDUNGSBEREICH: Stochastische und gegensätzliche Wahlplatzierung für die Bewertung großer Sprachmodelle SCOPE:评估大语言模式的施虐和反偏见选择安置 2507.18182v1 -
639 07-24 Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models Das Mittel halten: Sticky Tokens in Text-Embedding-Modellen erkennen 坚持平均值:在文本嵌入模型中检测粘力 2507.18171v1 -
640 07-24 Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges Jüngste Trends bei der Ferngesprächserkennung: Ein Rückblick auf die Herausforderungen CHiME-7 und 8 DASR 最近对不同政见的语音识别趋势:对CHiME-7和8DASR挑战的回顾 2507.18161v1 -
641 07-24 A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects Eine Umfrage über die Kausalitätsidentifizierung: Taxonomie, Herausforderungen, Bewertung und Perspektiven 事件原因识别调查:分类、挑战、评估和前景 2411.10371v5 -
642 07-24 Large Language Models in Argument Mining: A Survey Große Sprachmodelle im Argumentbergbau: Eine Umfrage 争议采矿大语言模型:调查 2506.16383v4 -
643 07-24 Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models Auf dem Weg zu größerer Hebelwirkung: Skalierungsgesetze für effiziente Mixture-of-Experts-Sprachmodelle 争取更大程度的利用:提高有效混合专家语言模式法的规模 2507.17702v2 -
644 07-24 MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning Mathopeval: Ein feinkörniger Evaluations-Benchmark für visuelle Operationen von MLLMs in mathematischer Reasoning MathOPEval:数学理由中MLLMs视觉操作精美评价基准 2507.18140v1 -
645 07-24 OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v4 -
646 07-24 Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes Aktive Bewertung und Erlernen der Unterscheidungen, die wichtig sind: Vakzin-Sicherheitssignalerkennung aus Not-Triage-Notizen 积极评价和学习重要的区别:疫苗安全信号从紧急分级记录中探测到的疫苗安全信号 2507.18123v1 -
647 07-24 When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems Wenn Autonomie Rogue: Vorbereitung auf Risiken der Multi-Agenten-Kollusion in sozialen Systemen 当自治时,罗格:准备应对社会系统中多机构串通的风险 2507.14660v2 -
648 07-24 Agentic AI framework for End-to-End Medical Data Inference Agentische KI-Framework für Ende-zu-Ende medizinische Datenableitung 最终至最终医疗数据推断的AA AA 框架框架 2507.18115v1 -
649 07-24 A New Pair of GloVes Ein neues Paar GloVes 新的地球之对 2507.18103v1 -
650 07-24 Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation Lang-Short-Distanz Graph Neural Networks und verbessertes Curriculum-Lernen für Emotionserkennung im Gespräch 长短距离远距神经神经网络和改进课程学习,以在对话中认识情感 2507.15205v2 -
651 07-24 ELITE: Enhanced Language-Image Toxicity Evaluation for Safety ELITE: Verbesserte Sprach-Image-Toxizitätsbewertung für Sicherheit ELITE:加强语言-图像安全毒性评价 2502.04757v3 -
652 07-24 Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints Hybrides und einheitliches Feintuning von großen Sprachmodellen: Methoden und Benchmarking unter Ressourcenbeschränkungen 大语言模式统一调整和统一调整适用:在资源限制下的方法和基准 2507.18076v1 -
653 07-24 BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz BlockDiaect: 节能LLM 推论的粗件精细混合格式量化 2501.01144v5 -
654 07-24 TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios TELEVAL: Ein dynamischer Benchmark für gesprochene Sprachmodelle in chinesischen interaktiven Szenarien TELEVAL:为中文互动假想中的口语模式设计的一个动态基准 2507.18061v1 -
655 07-24 Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias Causally Testing Gender Bias in LLMs: Eine Fallstudie über berufsbezogene Bias 《LLMM中因果测试性别偏见:职业偏见案例研究》 2212.10678v4 -
656 07-24 A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models Ein Multi-Faceted-Evaluierungsrahmen für die Bewertung synthetischer Daten, erzeugt durch große Sprachmodelle 评估由大语言模型生成的合成数据多面评价框架 2404.14445v2 -
657 07-24 Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs Privacy-Preserving Synthetic Review Generation mit unterschiedlichen Schreibstilen mit LLMs 使用LLMMs以多种写作风格生成的隐私-保护合成审查 2507.18055v1 -
658 07-24 From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems Von der Hypothese zur Veröffentlichung: Eine umfassende Umfrage zu KI-getriebenen Forschungsunterstützungssystemen 从假设到出版物:AI-Driven研究支助系统综合调查 2503.01424v3 -
659 07-24 RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models EINGEDENK: Ein ungebundener Ressourcenverbrauchsangriff auf große Visions-Sprachenmodelle 回顾:对大型愿景-语言模型的无约束资源消费攻击 2507.18053v1 -
660 07-24 Segmentation-free Goodness of Pronunciation Segmentierungsfreie Güte der Aussprache 读音良好 2507.16838v2 -
661 07-24 Synthetic Data Generation for Phrase Break Prediction with Large Language Model Synthetische Datengenerierung für Phrase Break Prediction mit großem Sprachmodell 制作用于大语言模范大语言时段间断预测的合成数据 2507.18044v1 -
662 07-24 GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs GrAInS: Gradient-basierte Zuordnung zur Inferenz-Zeitlenkung von LLMs und VLMs GrAInS:LLMs和VLMs的推论时间指导的逐步归属 2507.18043v1 -
663 07-24 AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark AIR-Bench: Automatisierte Heterogene Information Retrieval Benchmark AIR-Bench:自动异源信息检索基准 2412.13102v4 -
664 07-24 NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database NeuralDB: Skalierung von Wissen in LLMs auf 100.000 Fakten mit neuraler KV-Datenbank NeuralDDB: 将知识编辑在LLM 中到 100,000 千兆瓦的Neural KV 数据库中 2507.18028v1 -
665 07-24 GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures GRR-CoCa: LLM-Mechanismen in multimodalen Modellarchitekturen nutzen GRR-CoCa:在多模式建模中利用LLM机制 2507.18009v1
Article 0
Title@2025-07-31 (4): Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Title: Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities | Cascaded Information Disclosure for Generalized Evaluation of Problem Lösing Capabilities | 用于对解决问题能力通用评价的连锁信息披露 2507.23776v1 |
Authors (3): Yunxiang Yan, Tomohiro Sawada, Kartik Goyal
While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.
问题解答-(QA)基准性能是比较LLMS的自动和可扩缩的方法,但它是一种间接的方法,用来评价其解决问题的基本能力,因此,我们提议了一个基于\emph{cassaed question disability}的全面和可概括化的框架,在保持可缩放性和自动化的同时,对模型解决问题的能力作出更准确的估计。这个方法以分阶段的方式收集模型答复,每个阶段都揭示了旨在引致LLMS普遍推理的问题的部分信息。我们发现,我们的方法不仅改善了LMS的比较,而且与标准的QA范式相比,在模型中产生了更好的中间痕迹。我们通过对不同大小和家庭的LMS进行不同的比较,对不同的推理和知识重QA数据集进行了实验性地核实。我们的方法缩小了标准QA评价环境中观察到的业绩差距,表明普遍的间接QA评价范式高估了各种模型的绩效差异。我们通过广泛的反差研究进一步证实了我们的调查结果。
Article 1
Title@2025-07-31 (4): SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model
Title: SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model | SimuRA: Auf dem Weg zu einem General Goal-Oriented Agent über Simulative Reasoning Architecture mit LLM-basiertem Weltmodell | SimurRA:通过使用以LLM为基础的世界模型的模拟合理理由结构,努力实现以一般目标为导向的代理 2507.23773v1 |
Authors (7): Mingkai Deng, Jinyu Hou, Yilin Shen, Hongxia Jin, Graham Neubig, Zhiting Hu, Eric Xing
AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0\% to 32.2\%. World-model-based planning, in particular, shows consistent advantage of up to 124\% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.
以大型语言模型(LLMs)为基础的AI代理商有着巨大的希望,但目前的做法侧重于一号任务一号试剂方法,不仅不能达到可缩放性和普遍性,而且还受到自动递减性LMs的根本限制。另一方面,人类是一般的代理商,其原因是在精神上模拟其行动和计划的结果。向更普遍和强大的AI代理商的方向发展,我们引入了Simura,这是一个面向普遍代理推理的面向目标的结构。基于任何环境中最佳代理商的原则性配方, 模型名通过模拟引入世界规划模式克服了自动递增推理的局限性。通用世界模型模型使用LLMM来实施,它可以在广泛的环境中灵活规划,使用丰富的自然语言概念潜伏空间。在困难的网络浏览任务上进行的实验表明,SimuRA将改进飞行搜索的成功程度,从0到32.2。 特别是基于世界模型的规划,显示在自动递增性规划方面达到124的优势,通过模拟采用世界模型模型进行规划,展示世界模型模型模型的优势,在单一的LMsimimal 上,我们可以进行基础的测试。
Article 2
Title@2025-07-31 (4): Perception-Aware Policy Optimization for Multimodal Reasoning
Title: Perception-Aware Policy Optimization for Multimodal Reasoning | Perception-Aware Policy Optimization für multimodale Reasoning | 对多式联运理由的观念-认知软件政策优化 2507.06448v3 |
Authors (11): Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.
以可验证的奖励(RLVR)加强学习已证明是运用大型语言模型(LLMS)的高度有效战略,具有强大的多步推理能力。然而,其设计和优化仍然适合纯文本域,导致在应用多式推理任务时表现不优于最优性。特别是,我们发现,当前多式联运推理中的一个主要错误来源在于对视觉投入的认识。为了解决这一瓶颈问题,我们建议PAPO, 一种鼓励模型在学习时学会理解的新型政策梯度演算法。具体地说,我们以KL差异术语的形式引入隐含的感知损失:可以无缝地插入RLVR的算法,例如GROPO和DAPO。 值得注意的是,PAPO并不依赖额外的数据曲线、奖赏模型或更强的教师模型。为了进一步加强PAPO的培训稳定性,我们引入了双环流损失,从而在不损害业绩的情况下有效地规范新的KL目标。尽管它简单,但PAPO的推理学基础框架在总体上显著改进了4.4%-17.5 % ,在多种多式联运认识30度基准下,我们更精确地减少了一个更高的实验目标。
Article 3
Title@2025-07-31 (4): CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
Title: CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks | CoT-Self-Instruct: Aufbau hochwertiger synthetischer Aufforderungen zur Begründung und zu nicht-vernünftigen Aufgaben | COT-自学教学:为推理和非理由性任务建立高质量的合成提示 2507.23751v1 |
Authors (9): Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu
We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.
我们建议采用Cot-自制数据生成方法,即根据给定的种子任务,首先引导LLMS理性地通过Thought链(Cot)进行规划,然后生成一个质量和复杂性相似的合成新速度,供LLM培训使用,然后用自动指标过滤高质量数据。 在可核查的推理中,我们的合成数据大大优于现有的培训数据集,如在MATH500、AMC23、AME24和GPQA-Diamond之间,S1k和OpenMathReasoning。 对于不可核查的教学执行任务,我们的方法超过了在AlpacaEval2.0和Arena-Hard方面的人类或标准自制工具的性能。
Article 4
Title@2025-07-31 (4): Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs
Title: Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs | Regel2Text: Natürliche Sprache Erklärung der logischen Regeln in Wissensgraphen | 规则2案文:知识图中逻辑规则的自然语言解释 2507.23740v1 |
Authors (4): Nasim Shirvani-Mahdavi, Devin Wingfield, Amin Ghasemi, Chengkai Li
Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.
确定逻辑规则不仅提高了知识图的完整性,而且能够发现潜在的错误,揭示了微妙的数据模式,并提高了总体推理和解释能力;然而,这些规则的复杂性,加上每个知识图独特的标签公约,可能使人类难以理解这些规则;在本文件中,我们探索大型语言模型产生逻辑规则自然语言解释的潜力;具体地说,我们利用基准数据集FB15k-237和两个大型数据集FB-CVT-REV和FB+CVT-REV。我们审查各种提示战略,包括零和几发提示性战略,包括不同实体类型和思维链推理。我们根据正确性、清晰性和幻觉,对产生的解释进行全面的人文评价,并评估大型语言模型作为自动法官的使用情况。我们的结果显示在解释正确性和清晰性方面有希望的业绩,尽管在http/httpsurity/G_BAR_G_G_BAR_RR)中,所有数据都用于未来研究。
Article 5
Title@2025-07-31 (4): How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment
Title: How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment | Wie KI-Ideen die Kreativität, Vielfalt und Evolution menschlicher Ideen beeinflussen: Beweise aus einem großen, dynamischen Experiment | AI Ideas如何影响人类思想的创造性、多样性和演变:大规模动态实验的证据 2401.13481v3 |
Authors (5): Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, Eric Gilbert
Exposure to large language model output is rapidly increasing. How will seeing AI-generated ideas affect human ideas? We conducted an experiment (800+ participants, 40+ countries) where participants viewed creative ideas that were from ChatGPT or prior experimental participants and then brainstormed their own idea. We varied the number of AI-generated examples (none, low, or high exposure) and if the examples were labeled as ‘AI’ (disclosure). Our dynamic experiment design – ideas from prior participants in an experimental condition are used as stimuli for future participants in the same experimental condition – speaks to the interdependent process of cultural creation: creative ideas are built upon prior ideas. Hence, we capture the compounding effects of having LLMs ‘in the culture loop’. We find that high AI exposure (but not low AI exposure) did not affect the creativity of individual ideas but did increase the average amount and rate of change of collective idea diversity. AI made ideas different, not better. There were no main effects of disclosure. We also found that self-reported creative people were less influenced by knowing an idea was from AI and that participants may knowingly adopt AI ideas when the task is difficult. Our findings suggest that introducing AI ideas may increase collective diversity but not individual creativity.
对大语言模型输出的接触正在迅速增加。看到AI产生的想法将如何影响人类思想?我们进行了一个实验(800+参与者,40+国家),参与者观看了来自ChatGPT或先前实验参与者的创造性想法,然后集思广益。我们改变了AI产生的例子数量(没有、低或高暴露),如果这些例子被贴上“AI”(披露)标签,那么这些例子的数量也各不相同。我们的动态实验设计 – – 实验状态中的前参与者的想法被作为同一实验条件下未来参与者的刺激因素 – – 谈到文化创造的相互依存过程:创造性思想是建立在先前思想基础上的。因此,我们捕捉到将LLOMS“纳入文化循环”的复合效应。我们发现,AI的高度接触(但不低接触)并没有影响个人思想的创造力,而是增加了集体思想多样性的平均数量和变化率。AI做了不同,没有更好的解释。披露没有产生任何主要影响。我们发现,自我报告创造性的人受到了解AI思想的影响较小,但是参与者在任务困难的时候可以知情地接受AI。我们发现,集体发现,引入AI的想法可能增加个人创造力。
Article 6
Title@2025-07-31 (4): Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
Title: Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving | Seed-Prover: Tiefe und breite Begründung für automatisierte Theorem Proving | 种子文献:用于自动理论论证的深度和广度理由 2507.23726v1 |
Authors (36): Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu
LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.
LLMS通过利用长期思维链的强化学习,展示了很强的数学推理能力,但是,由于在仅仅使用自然语言时缺乏明确的监督信号,它们仍然在与理论进行斗争。Lian等专门的域名语言通过正式核实证据提供明确的监督,通过强化学习进行有效培训。在这项工作中,我们建议采用一个脂素式的全度防盗推理模型\ textbf{Seed-Prover}。种子-Prover可以反复地改进其基于Lean反馈、经证明的Lemma和自我合成的证据。为了解决海事组织一级的竞争问题,我们设计了三个测试时间推论战略,既能进行深入又广泛的推理。种子-Prover证明过去海事组织问题正规化的78.1美元,能通过强化学习进行有效培训。我们在Putnambench上提出了超过50美分的全方位全方位推理模型。为了解决Lean的地理测量支持不足的问题,我们引入了一种几何推理机引擎{Seud-Gegraphy-Gegraphy
Article 7
Title@2025-07-31 (4): RecGPT Technical Report
Title: RecGPT Technical Report | Technischer Bericht des RecGPT | RecGPT 技术报告 2507.22879v2 |
Authors (54): Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Sunhao Dai, Wen Chen, Wenjun Yang, Yuning Jiang, Zhujin Gao, Bo Zheng, Chi Li, Dimin Wang, Dixuan Wang, Fan Li, Fan Zhang, Haibin Chen, Haozhuang Liu, Jialin Zhu, Jiamang Wang, Jiawei Wu, Jin Cui, Ju Huang, Kai Zhang, Kan Liu, Lang Tian, Liang Rao, Longbin Li, Lulu Zhao, Na He, Peiyang Wang, Qiqi Huang, Tao Luo, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Yang Li, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yinnan Song, Yuchen Li, Yujie Luo, Yujin Yuan, Yuliang Yan, Zhengyang Wang, Zhibo Xiao, Zhixin Ma, Zile Zhou, Ziqi Zhang
Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent. This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users’ evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models (LLMs) into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.
推荐系统是人工智能中影响最大的应用,是连接用户、商人和平台的关键基础设施;然而,目前大多数工业系统仍然严重依赖历史共同模式和逻辑调整目标,即优化以往用户互动,而没有明确模拟用户意图。这种逻辑调整方法往往导致过度适应狭隘的历史偏好,无法捕捉用户不断变化的潜在利益。结果,它强化过滤泡沫和长尾现象,最终损害用户经验,威胁整个建议生态系统的可持续性。为了应对这些挑战,我们重新思考推荐系统的总体设计模式,并提议一个下一代框架RecGPT,将用户的意向置于建议管道的中心。通过将大型语言模型(LLLMS)纳入用户兴趣挖掘、项目检索和解释生成的关键阶段,RecGPT将符合逻辑的建议转化为一个以意图为中心的过程。为了有效地将通用LMS与以上特定领域的全面建议任务相协调,REGPT包含一个多阶段培训模式,整合了推荐系统的总体设计模式,将用户的意向转换成一个更具有可持续性的、更清晰的升级的、更清晰的升级的日历和升级的流程。
Article 8
Title@2025-07-31 (4): Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length
Title: Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length | Nicht zu vergessen: Proaktive Interferenz offenbart Arbeitsspeichergrenzen in LLMs jenseits der Kontextlänge | 无法忘却: 事外长长的LLMM 中主动干扰流出工作内存限制 2506.08184v3 |
Authors (2): Chupei Wang, Jiaqiu Vince Sun
Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs’ ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models’ ability to suppress irrelevant content during retrieval.
大语言模型(LLMs)中的信息检索日益被公认为与生成能力而不是仅仅看一看相交织在一起。虽然人们往往认为较长的环境可以改进检索,但文本内干扰的影响仍然没有得到充分研究。为了解决这个问题,我们调整了认知科学中的主动干预(PI)范式,早期信息扰乱了对更新更新的记忆的记忆。在人类中,对这种干扰的易感性与工作记忆能力有反向联系。我们引入了PI-LLM,这一评价按顺序流出与语义相关的关键价值更新和查询最后值。虽然这些最后值在查询之前的位置很明确,但LLM检索精度随着干扰的积累而将记录线性降低到零;错误产生于对先前重写价值的检索。试图通过即时工程(例如指示模型忽略早期输入)来减少干扰,但成效有限。这些结论揭示了LLMs解动干扰和灵活调控信息的能力的根本制约,意味着工作记忆瓶颈超出了仅仅上下文访问的范围。这要求采取一些方法,以加强模型在检索过程中抑制不相关内容的能力。
Article 9
Title@2025-07-31 (4): TextQuests: How Good are LLMs at Text-Based Video Games?
Title: TextQuests: How Good are LLMs at Text-Based Video Games? | TextQuests: Wie gut sind LLMs bei textbasierten Videospielen? | 文本Quests: 文本视频游戏的LLMs效果如何? 2507.23701v1 |
Authors (4): Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks
Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.
在反映现实世界挑战的复杂、互动环境中评价AI代理商对于了解其实际能力至关重要。虽然现有的代理商基准有效评估工具使用或结构化任务绩效等技能,但往往不能完全掌握该代理商在探索环境中自主运作的能力,这种探索环境需要长期和不断增长的持续、自我引导的推理。为了促进发展能够在长视野上进行更强有力的内在推理的代理商,我们引入了一个基于Infocom交互式虚拟游戏套件的基准“TextQuests ”。这些基于文本的冒险,它可以花费人类玩家30小时以上的时间,需要数百次精确的行动来解决,作为评价AI代理商重点突出、明确的任务的有效代理。该基准具体旨在评估LLM代理商通过排除使用外部工具自行解决问题的能力,从而侧重于在探索环境中的内在长文本推理能力,其特征是需要试用和学习并在单一互动会议上持续解决问题。我们在https://textquests.ai发布TextQuests。
Article 10
Title@2025-07-31 (4): TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses
Title: TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses | TweakLLM: Eine Routing-Architektur für dynamisches Tailoring von Cached Responses | TweakLLLM: 快速快速定制快速响应的运行结构 2507.23674v1 |
Authors (6): Muhammad Taha Cheema, Abeer Aamir, Khawaja Gul Muhammad, Naveed Anwar Bhatti, Ihsan Ayyub Qazi, Zafar Ayyub Qazi
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.
大型语言模型(LLM)每天处理数以百万计的询问,使高效的响应为降低成本和延时提供了令人信服的优化,但是,由于聊天机互动的个性性质和语义相似性搜索的准确性有限,很难保持与用户查询的相关性,但是,为了解决这个问题,我们提出了TweakLLM,这是一个新型的路线结构,它使用轻量级LM来动态地调整缓存的响应速度。通过全面评价,包括用户研究,同时进行平行比较、满意投票以及多剂LLM辩论,我们证明TweakLLM保持了与前沿模型相似的反应质量,同时大大提高了缓存效率。我们跨越现实世界数据集的结果突出表明TweakLLM是高容量LM部署的可扩展性、资源高效的缓冲解决方案,同时又不损害用户的经验。
Article 11
Title@2025-07-31 (4): Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning
Title: Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning | Arabische Hass-Spracherkennung und Maskenbildung in sozialen Medien mit Deep-Learning-Modellen und vortrainierten Modellen Feinabstimmung | 利用深学习模式和预培训模式进行微调,在社会媒体中识别和遮掩阿拉伯仇恨言论 2507.23661v1 |
Authors (2): Salam Thabet Doghmash, Motaz Saad
Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.
近年来,社交媒体中的仇恨言论识别已成为一个日益重要的问题。 在这项研究中,我们处理两个问题:(1) 检测阿拉伯文本中的仇恨言论,(2) 清除仇恨言论的文本。这里清洁的含义是根据每个字字母的数量用恒星替换每个坏字。关于第一个问题,我们用深学习模型和变压器进行若干实验,以确定F1分的最佳模式。关于第二个问题,我们认为这是一个机器翻译任务,输入的内容是含有脏字的句子,输出的内容是用脏字遮掩的相同句子。所介绍的方法在仇恨言论检测方面实现了最佳模式,用92 Mac F1分和95准确度。关于文字清理实验,仇恨言论遮盖模式的最佳结果在BLEU分中达到0.3分,为1克,与艺术机器翻译系统的状况相比,这是一个良好的结果。
Article 12
Title@2025-07-31 (4): DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Title: DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures | DocPolarBERT: Ein vortrainiertes Modell zum Dokumentverständnis mit relativer Polarkoordinate Kodierung von Layoutstrukturen | DocPolarBERT:一个预先培训的文件理解模式,其布局结构的相对极地协调编码 2507.08606v3 |
Authors (4): Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
我们引入了Doc PolarBERT, 这是一种具有布局意识的BERT文件理解模式,消除了绝对2D定位嵌入的需要。我们自我关注,以考虑到相对极地协调系统而不是笛卡尔协调系统中的文本块位置。尽管在数据集方面接受过比广泛使用的IT-CDIP系统小六倍多的预先培训,但Doc PolarBERT取得了最新的结果。这些结果表明,精心设计的注意机制可以弥补培训前数据的减少,为文件理解提供高效有效的替代方法。
Article 13
Title@2025-07-31 (4): Who’s important? – SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation
Title: Who’s important? – SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation | Wer ist wichtig? – SUnSET: Synergistisches Verständnis von Stakeholdern, Ereignissen und Zeit für die Timeline Generation | 谁重要? - SUNSET:对利益攸关方、事件和时间的协同理解,以产生时间表。 2507.21903v2 |
Authors (4): Tiviatis Sim, Kaiwen Yang, Shen Xin, Kenji Kawaguchi
As news reporting becomes increasingly global and decentralized online, tracking related events across multiple sources presents significant challenges. Existing news summarization methods typically utilizes Large Language Models and Graphical methods on article-based summaries. However, this is not effective since it only considers the textual content of similarly dated articles to understand the gist of the event. To counteract the lack of analysis on the parties involved, it is essential to come up with a novel framework to gauge the importance of stakeholders and the connection of related events through the relevant entities involved. Therefore, we present SUnSET: Synergistic Understanding of Stakeholder, Events and Time for the task of Timeline Summarization (TLS). We leverage powerful Large Language Models (LLMs) to build SET triplets and introduced the use of stakeholder-based ranking to construct a $Relevancy$ metric, which can be extended into general situations. Our experimental results outperform all prior baselines and emerged as the new State-of-the-Art, highlighting the impact of stakeholder information within news article.
由于新闻报道日益全球化和分散,追踪多个来源的相关事件带来了重大挑战。现有新闻汇总方法通常使用大语言模型和基于文章的摘要图形方法。然而,这没有效果,因为仅考虑类似日期文章的文字内容来理解事件要点。为了应对有关各方缺乏分析的问题,必须制定一个新框架,通过相关实体衡量利益攸关方的重要性和相关事件的联系。因此,我们介绍了SUNSET:对利益攸关方、事件和时间的协同理解,以完成时间线的总结(TLS)任务。我们利用强大的大语言模型(LLLMs)来建立SET三重数据,并引入基于利益攸关方的排名,以构建一个能推广到一般情况的$levelance$衡量标准。我们的实验结果超越了以往的所有基线,并成为新的国家艺术,突出了利益攸关方信息在新闻文章中的影响。
Article 14
Title@2025-07-31 (4): How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
Title: How Can I Publish My LLM Benchmark Without Giving the True Answers Away? | Wie kann ich meinen LLM-Benchmark veröffentlichen, ohne die wahren Antworten wegzugeben? | 我怎样才能公布我的LLM基准而不给出正确的答案? 2505.18102v2 |
Authors (3): Takashi Ishida, Thanawat Lodkaew, Ikko Yamane
Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.
在互联网上公布一个大型语言模型(LLM)基准可能会污染未来的LLM:基准可能是无意的(或有意的)用于培训或选择一个模型。一个共同的缓解措施是保持基准的隐私,让参与者向组织者提交模型或预测。然而,这一战略需要信任一个组织,并仍然允许通过反复询问过度测试。为了克服这一问题,我们建议一种公布基准的方法,而不完全披露对问题的地面真相答案,同时保持公开评估LLMS的能力。我们的主要想法是通过编制几个逻辑正确的答案来给答案注入随机性,并且只将其中之一作为基准的解决方案。这降低了基准的最佳准确性,即Bayes准确性。这不仅有助于我们不披露地面真相,而且这一方法也为探测数据污染提供了一个测试。原则上,即使完全有能力的模型也不应该超过Bayes的准确性。尽管有这一预期,但模型超过这一上限,这是数据污染的强烈信号。我们提出实验性证据,说明我们的方法能够准确检测数据污染的范围很广的基准、培训模型和方法。
Article 15
Title@2025-07-31 (4): Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation
Title: Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation | Splits! Ein flexibler Datensatz und Evaluationsrahmen für die soziokulturelle Linguistische Untersuchung | 社会文化语言调查灵活数据集和评价框架 2504.04640v2 |
Authors (3): Eylon Caplan, Tania Chakraborty, Dan Goldwasser
Variation in language use, shaped by speakers’ sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. However, the computational study of these Sociocultural Linguistic Phenomena (SLP) has often been limited to bespoke analyses of specific groups or topics, hindering the pace of scientific discovery. To address this, we introduce Splits!, a 9.7 million-post dataset from Reddit designed for systematic and flexible research. The dataset contains posts from over 53,000 users across 6 demographic groups, organized into 89 discussion topics to enable comparative analysis. We validate Splits! via self-identification and by successfully replicating several known SLPs from existing literature. We complement this dataset with a framework that leverages efficient retrieval methods to rapidly validate potential SLPs (PSLPs) by automatically evaluating whether a given hypothesis is supported by our data. Crucially, to distinguish between novel and obvious insights, the framework incorporates a human-validated measure of a hypothesis’s ``unexpectedness.’’ We demonstrate that the two-stage process reduces the number of statistically significant findings requiring manual inspection by a factor of 1.5-1.8x, streamlining the discovery of promising phenomena for further investigation.
语言使用的变化,由发言者的社会文化背景和使用的具体背景所决定,形成了语言使用的变化,为文化观点、价值观和观点提供了丰富的视角;然而,对这些社会文化语言特征(SLP)的计算研究往往仅限于对特定群体或专题进行简单分析,从而阻碍科学发现的速度。为此,我们引入了Slips!,这是Reddit为系统灵活研究设计的970万个数据集。数据集包含来自6个人口群体中53 000多个用户的职位,分为89个讨论专题,以便能够进行比较分析。我们通过自我识别和成功复制现有文献中若干已知的 SLPs,验证Slips!我们用一个框架来补充这一数据集,利用高效的检索方法快速验证潜在的SLPs(PLPs),方法是通过自动评估我们的数据是否支持一个给定的假设!关键是,为了区分新的和明显的洞察力,这个框架包含了一个人类有价值的假设“意外”的尺度。我们证明,两阶段进程减少了需要通过1.5.8个因素对具有前瞻性的统计意义的调查结果进行进一步精简。
Article 16
Title@2025-07-31 (4): ILID: Native Script Language Identification for Indian Languages
Title: ILID: Native Script Language Identification for Indian Languages | ILID: Native Script Language Identification für indische Sprachen | ILID:印第安人语言的土著脚本语言识别 2507.11832v2 |
Authors (2): Yash Ingle, Pruthwik Mishra
The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release baseline models using state-of-the-art approaches in machine learning and fine-tuning pre-trained transformer models. Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task. The dataset and the codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface open source libraries.
语言识别任务是国家语言平台中至关重要的基本步骤。 语言识别任务通常是广泛使用的国家语言平台应用程序的预处理步骤,如多语种机器翻译、信息检索、问答和文本总和。语言识别的核心挑战在于杂音、短语和代码混合环境中的区分语言。如果印度多种语言表现出词汇和语音相似,但有不同之处,这更加困难。许多印度语言有着相同的脚本,使得任务更具挑战性。考虑到所有这些挑战,我们编制并发布一套由23种语言组成的250K句数据集,其中包括英文和所有22种印度官方语言及其语言标识符号,其中大多数语言的数据都是新创建的。我们还在机器学习和微调前训练变异模型中,使用最先进的方法开发和发布基线模型。我们的模型超越了用于语言识别任务的、最先进的预培训变异模型。数据集和代码可在https://yashingle-ai.github.io/LIID/和Hugging 公开源库中查阅。
Article 17
Title@2025-07-31 (4): Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Title: Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates | Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Assessments | 具有不确定性估计值的临床试验的深入学习预测 2507.23607v1 |
Authors (4): Tien Huu Do, Antoine Masquelier, Nae Eoun Lee, Jonathan Crowther
Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.
临床试验是评估新药物或新疗法的安全和效能的系统性努力。进行这种试验通常需要大量的资金投资和仔细规划,强调准确预测试验结果的必要性。准确预测病人入学是试验成功的一个关键因素,这是规划阶段的主要挑战之一。在这项工作中,我们提出一种新的深层次的基于学习的方法来应对这一重大挑战。我们作为一种神经网络模型采用的方法,利用预先训练的语言模型(PLM)来捕捉临床文件的复杂性和细微差别,将其转化为直观的表述。然后,这些演示与通过关注机制编码的表格特征相结合。为了说明入学预测的不确定性,我们根据伽马分布,用概率层加强模型,从而能够进行范围估计。我们采用拟议的模型来预测临床试验期限,假设现场一级的招生遵循Poisson-Gamma进程。我们对现实世界临床试验数据进行了广泛的实验,并表明拟议的方法可以有效地预测在一定的临床试验地点注册的病人人数,超过既定基线模型。
Article 18
Title@2025-07-31 (4): Inside-Out: Hidden Factual Knowledge in LLMs
Title: Inside-Out: Hidden Factual Knowledge in LLMs | Inside-Out: Verstecktes Sachwissen in LLMs | 内外:LLM中隐藏的事实知识 2503.15299v3 |
Authors (8): Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) put a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.
这项工作提供了一个框架,用于评估大型语言模型(LLMS)是否在其参数中比其产出所表述的内容更加真实的知识。虽然有几项研究暗示了这种可能性,但没有一项研究明确定义或展示这种现象。我们首先提出知识的正式定义,将它量化给一个特定问题,即正确答案对对答的分数,正确答案的分数较高。这产生了外部和内部知识,这取决于对个别应答人进行评分所使用的信息:要么是模型的可观测象征性概率,要么是其中间计算。当内部知识超过外部知识时,就会出现隐藏知识。然后我们提出案例研究,在封闭式QA设置中将这一框架应用于三种受欢迎的开放重量LMS。我们的结果显示:(1) LMS在内部一贯地将事实知识纳入比外表表示的更多内容,平均相对差距为40%。(2) 令人惊讶的是,一些知识的深度隐藏到模型内部可以完全了解答案,但即使大规模地重复抽样抽样,也未能产生这种知识。这揭示了三个广受欢迎的开放性 LLMSMS的深度测试中的基本限制,因为我们不断的深度的抽样测试是难解的常规测试,因此仍然有相当的深度的精确的试测测。
Article 19
Title@2025-07-31 (4): DiffLoRA: Differential Low-Rank Adapters for Large Language Models
Title: DiffLoRA: Differential Low-Rank Adapters for Large Language Models | DiffLoRA: Differential-Low-Rank-Adapter für große Sprachmodelle | DiffLORA:用于大语言模型的差别型低兰克适应器 2507.23588v1 |
Authors (4): Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina
Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.
最近有人提议采用差异变换器,通过取消音效的注意机制取消噪音,以提高变异器模型的性能。在这项工作中,我们引入了DiffLora,这是对差异注意机制的参数效率的调整,低级适应器具有正负两方面的注意。这种方法保持了Lora的效率,同时着眼于从不同注意的性能收益中获益。我们评估了DiffLora,执行了一系列广泛的国家劳工政策任务,包括一般基准、多发式的文字学习、RAG和长文测试。我们注意到,虽然DiffLora在大多数评价任务中都未达到其他参数效率的微调方法,但在某些领域(HenEval的LORA的+11 pts)取得了令人感兴趣的结果。我们分析了关注后调整模式,以确定这种行为的原因。
Article 20
Title@2025-07-31 (4): T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text
Title: T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text | T-Detect: Tail-Aware Statistische Normalisierung zur robusten Erkennung von maschinengeneriertem Text | T-检测:用于对反转机制文本进行强力探测的尾件软件统计标准化 2507.23577v1 |
Authors (6): Alva West, Luodan Zhang, Liuliu Zhang, Minjun Zhu, Yixuan Weng, Yue Zhang
The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student’s t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9\% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.
复杂的文本生成模型的扩散要求开发能够识别机器生成内容的可靠检测方法,特别是旨在逃避通过对抗性直角扰动探测的文本。现有的零发检测器往往依赖隐含假设高斯分布的统计措施,而这一假设在面对激烈的统计手工艺特征时会动摇。本文介绍T-检测,这是一种创新的检测方法,从根本上重新设计了以曲线为基础的探测器的统计核心。我们的主要创新是取代标准高斯的正常化,代之以从学生的 T 分布中得出的严重快速差异分数。从理论上讲,这一方法所依据的是实证观察,即对抗性文本表现出显著的利普托科松散,使传统的统计假设变得不够充分。T-检测通过使一段通道与预期的高度分布时间的逻辑相似性平准,为统计外端探测器提供较强的弹性。我们关于对抗性能的RAID基准和全面HATCT数据设置的精确度分数。 实验显示,T-Serverialal-alalal-lax a lagal lavel lax a stal deal descristrational develyal laft as the laft laft as the laview stal-deal-laview laviewal-deal-deal-deal-deal-deal-deal-labal-ladal-deal-lax slations ladal ladals ladal laislation ladal ladal ladal ladaldaldaldal ladaldaldaldaldal ladal lads ladal ladaldaldal ladal ladaldaldaldaldaldaldal ladaldal ladal ladaldal ladaldal ladaldaldaldaldaldal ladaldal ladaldaldaldaldaldal ladal ladal ladaldaldaldaldaldaldaldaldaldaldals ladals las ladal
Article 21
Title@2025-07-31 (4): Neutral Residues: Revisiting Adapters for Model Extension
Title: Neutral Residues: Revisiting Adapters for Model Extension | Neutrale Rückstände: Adapter zur Modellerweiterung | 中立残留物:重新审视适应器,用于示范推广 2410.02744v3 |
Authors (3): Franck Signe Talla, Edouard Grave, Hervé Jégou
We address the problem of extending a pretrained large language model to a new domain that was not seen during training. Standard techniques, such as finetuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain. Here, we revisit and improve adapters to extend LLMs from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperform competing approaches such as finetuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.
我们处理将预先培训的大型语言模式扩大到培训期间没有看到的新领域的问题。标准技术,如微调或低级别适应(LORA)在领域适应方面是成功的,但并不正式增加模型能力。这往往导致在新领域表现良好与原始领域表现有辱人格之间取舍。在这里,我们重新审视和改进适应器,将LLMS从三个角度扩大:数据、结构和培训程序,这三者是共同考虑的优势。由此产生的方法,称为中性残留物,将适应器改变为导致每个新的剩余块到原始领域接近零的输出。当将最初接受英语培训的先进模型改造成新语言时,这一解决方案将带来强有力的结果。中性残留物在学习新语言和不忘英语之间的交易中,大大超越了微调、LORA或香草适应器等相互竞争的方法。
Article 22
Title@2025-07-31 (4): Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
Title: Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation | Kann LLMs mit Ambiguity helfen? Eine quantitative Bewertung verschiedener großer Sprachmodelle auf Word Sense Disambiguation | LLMs能否协助其模糊性? 量化评估关于 “ Word Sense Disanderation “ 的各种大语言模型。 2411.18337v4 |
Authors (3): T. G. D. K. Sumanathilaka, Nicholas Micallef, Julian Hough
Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.
现代数字通信中经常出现含糊不清的词句。由于数据有限,传统的Word Sense Dismendation(WSD)方法存在明显的模糊性。因此,翻译、信息检索和问答系统的效率受到这些限制的阻碍。本研究报告调查了使用大语言模型(LLMs)改进WSD的新办法,将系统的迅速增强机制与知识库(KB)相结合,由不同感知解释组成。拟议方法包括了快速增强的“人与人”方法,这种方法得到“语言部分”标记、模糊词的同义词、基于侧面感的过滤以及指导LLM的几发提示。通过采用“几发式思维链(COT)”的提示性方法,这项工作表明业绩有了很大的改进。评价是利用FEWS测试数据和感官标记进行的。这项研究在社会媒体和数字通信中促进了准确的字解释。
Article 23
Title@2025-07-31 (4): Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning
Title: Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning | Med-R$^3$: Verbesserung der medizinischen Retrieval-Augmented Reasoning von LLMs durch Progressive Verstärkung Lernen | 3美元Med-R$3美元:通过逐步加强学习加强医疗取回-增加LLMs的理据 2507.23541v1 |
Authors (8): Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Huang Leng
In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce Med-R$^3$, a Medical Retrieval-augmented Reasoning framework driven by progressive Reinforcement learning. In this framework, we first develop the model’s ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model’s retrieval and reasoning coordination. Extensive experiments indicate that Med-R$^3$ could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.
在医学假设中,有效检索外部知识并利用外部知识进行严格的逻辑推理非常重要,尽管现有工作具有潜力,但主要侧重于加强模型孤立的检索或推理能力,很少注意联合优化,导致两个进程之间的协调有限;此外,目前的方法严重依赖监管的微调(SFT),这可以使模型记住现有的解决问题路径,从而在面临新问题的情况下限制其概括化能力;此外,虽然一些研究探索了通过强化学习改进一般领域的回收-强化推理能力,但其奖赏功能设计没有充分反映医疗领域的具体需求;为应对这些挑战,我们引入了Med-R3美元3,一个Med 质评** 实质性微调(SFT),这可以促使模型在面临新问题的情况下限制其普及能力;此外,尽管一些研究探索了通过强化学习,在一般领域改进回收-PT3的推理学,但它们的奖赏参数不能充分捕捉到医疗领域的具体需求;为了应对这些挑战,我们根据这个基础,我们调整了回收能力,R3 R3 降-R3 3 比例,我们引入了成本 和MMMM ,最终将业绩的升级能力,同时我们通过外部信息的特性进行更精确的推理化。
Article 24
Title@2025-07-31 (4): PurpCode: Reasoning for Safer Code Generation
Title: PurpCode: Reasoning for Safer Code Generation | PurpCode: Begründung für eine sicherere Code-Generierung | PurpCode:更安全代码生成的理由 2507.19060v2 |
Authors (14): Jiawei Liu, Nirav Diwan, Zhe Wang, Haoyu Zhai, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Muntasir Wahed, Yinlin Deng, Hadjer Benkraouda, Yuxiang Wei, Lingming Zhang, Ismini Lourentzou, Gang Wang
We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerability-free code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Meanwhile, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.
我们引入了PurpCode(PurpCode)(PurpCode)(PurpCode)(PurpCode)(PurpCode)(Purcledge Learning)(这是培训安全代码推理模型的第一个培训后指南)(PurpCode)(这是培训安全代码推理模型的第一批培训后配方),旨在生成安全代码和防范恶意网络活动。PurpCode(Purp Learning)将一个推理模型分为两个阶段:(一) 规则学习,明确教授参考网络安全规则模式,以生成无脆弱性代码,避免为恶意网络活动提供便利;(二) 强化学习(Sergment Learning)(Sergment)(通过多种多目标奖励机制优化模式安全模式,维护模型的实用性,通过综合网络安全数据使培训管道具备能力,我们内部红队(refusal)将基于现实世界任务的全面和高覆盖性提示器,同时维护代码生成和共同安全知识的模型实用性。
Article 25
Title@2025-07-31 (4): MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
Title: MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks | MECAT: Ein Multi-Experten-Benchmark für feinkörnige Audio-Verstandsaufgaben | MECAT: 完善的音频理解任务多专家基准 2507.23511v1 |
Authors (10): Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan
While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat
虽然大型的音频模型提高了开放的音频理解程度,但它们仍然没有达到细微的人类理解水平,这一差距依然存在,主要是因为目前的基准受到数据说明和评价指标的限制,无法可靠地区分通用和高度详细的模型产出。为此,这项工作引入了MECAT, 即精细读音频理解任务多专家构建基准。通过将专业专家模型的分析与深层次语言链大模型推理相结合的管道生成的,MECAT提供了多视角、精细的字幕和开放式的问答配对。该基准得到了新的指标的补充:DATE(差异性-强化音频文本评价)。该指标将单模类语义相似性和交叉分布相容性结合起来,对通用术语和详细描述进行处罚。还介绍了对最新音频模型的全面评价,提供了对其当前能力和限制的新认识。数据和代码可在https://github.com/sioomim-commexemexearch/cat。
Article 26
Title@2025-07-31 (4): LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Title: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning | LLaVA-MORE: Eine vergleichende Studie von LLMs und visuellen Backbones für verbesserte visuelle Instruktions-Tuning | LLAVA-MORE:用于强化视觉教学的LLM和视觉背骨比较研究 2503.15621v2 |
Authors (7): Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs – including Phi-4, LLaMA-3.1, and Gemma-2 – to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.
在多式大型语言模型(MLLM)中,最近的进展突出了视觉骨干和基本语言模型的关键作用。虽然先前的工作主要侧重于将这些组成部分扩大到数十亿个参数,但模型规模、结构和性能之间的权衡取舍仍未得到充分探讨。此外,培训数据和评价协议的不一致妨碍了直接比较,使得很难得出最佳设计选择。在本文件中,我们引入了LalVA-MORE,这是MLLMM的新组合,将最新语言模型与不同的视觉骨干结合起来。为了确保公平比较,我们采用了在所有建筑中一致应用的统一培训协议。我们的分析系统地探索中小型LLMS – – 包括Phi-4、Lama-3.1和Gemma-2 – – 来评估模型规模、结构和性能之间的权衡,同时审查模型规模与性能之间的关系。除了评估LLLMMMMMM(LIP)对最终结果的影响外,我们还对各种视觉模型进行了全面研究,从基于CLIP的架构到DINO2、SigLIP和SigLIP-2等替代方法。我们的分析系统地探索中中,进一步研究了中我们经过培训的图像分析的模型的模型的模型分析结果,为我们的图像分析框架的模型的改进提供了更多的分析结果的改进。
Article 27
Title@2025-07-31 (4): A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
Title: A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains | Ein neuartiger Bewertungs-Benchmark für medizinische LLMs: Beleuchtende Sicherheit und Wirksamkeit in klinischen Bereichen | 医疗LLMs新颖的评价基准:临床域的引明安全和有效性 2507.23486v1 |
Authors (38): Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
大型语言模型(LLMS)在临床决策支持方面很有希望,但在安全评估和有效性验证方面面临重大挑战。我们制定了临床安全有效双轨基准(CSEDB),这是一个基于临床专家共识的多层面框架,包括30项标准,涵盖关键疾病识别、遵守准则、药品安全等关键领域,并附有加权后果措施。32名专家医生制定并审查了符合这些标准的2 069个开放的A项目,涵盖26个临床部门,以模拟现实世界情景。对6个LMS的基准测试显示,总体绩效中等(平均共57.2%、安全54.7%、有效性62.3%),高风险情景中绩效显著下降13.3%(p < 0.0001 ),具体针对特定领域的医疗LMS显示,业绩优于通用模式,安全性最高分数(0.912)和有效性(0.861)相对较高。该研究的结果不仅为评价医疗LMS临床应用提供了标准化衡量标准,便利比较分析、风险暴露识别和改善不同情景方向,而且还有可能促进在医疗保健环境中更安全和更有效地部署大型语言模型。
Article 28
Title@2025-07-31 (4): Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Title: Role-Aware Language Models for Secure and Contextualized Access Control in Organizations | Role-Aware Sprachmodelle für sichere und kontextualisierte Zugriffskontrolle in Organisationen | 各组织内安全和环境化出入控制使用控制实用语言模式 2507.23465v1 |
Authors (7): Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto
As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.
随着大型语言模型(LLMs)越来越多地被应用于企业环境,控制基于用户作用的模型行为成为一项基本要求。现有安全方法通常采取统一的准入方式,侧重于预防有害或有毒产出,而不解决特定角色准入限制。在这项工作中,我们调查LMs是否可以进行微调,以产生反映与不同组织作用相关的准入特权的应对措施。我们探讨三种示范战略:基于BERT的分类器、基于LLM的分类器和基于角色的生成。为了评估这些方法,我们建立了两个互补数据集。第一个数据集是通过集群和角色标签从现有的指令调整公司团体中改编的,第二个是合成生成的,以反映现实的、对角色敏感的企业情景。我们评估不同组织结构的示范业绩,分析快速注入、角色错配和破例尝试的稳健性。
Article 29
Title@2025-07-31 (4): Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
Title: Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems | Counterfactual Evaluation für Blindangriffserkennung in LLM-basierten Evaluationssystemen | 以LLM为基础的评价系统中盲人攻击探测的反事实评价 2507.23453v1 |
Authors (7): Lijia Liu, Takumi Kondo, Kyohei Atarashi, Koh Takeuchi, Jiyi Li, Shigeru Saito, Hisashi Kashima
This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.
本文调查基于LLM的评估系统防范迅速注射的防御性。 我们正式确定了一类威胁,称为盲目袭击,候选人的回答是独立于欺骗评估者的真实答案之外的。 为了对付这些袭击,我们提出了一个框架,用反事实评估来补充标准评估,根据蓄意的虚假地面真相回答来重新评估提交材料。如果系统根据标准和反事实条件验证一个答案,就会发现攻击。 实验显示,虽然标准评估非常脆弱,但我们的SE+CFE框架通过以最低性能权衡来提升攻击检测,大大改善了安全。
Article 30
Title@2025-07-31 (4): EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework
Title: EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework | BildungQ: Bewertung der Lehrfähigkeiten von LLMs durch Multi-Agent Dialograhmen | 教育Q:通过多机构对话框架评价LLMS的教学能力 2504.14928v3 |
Authors (3): Yao Shi, Rongkeng Liang, Yong Xu
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.
大型语言模式(LLMS)日益成为教育工具,然而,由于教师与学生之间互动的资源密集、环境依赖、方法复杂,评估其教学能力仍具有挑战性。我们引入了教育Q,这是一个多媒介对话框架,通过模拟动态教育情景,有效评估教学能力,由教学、学习和评价的专门代理人组成。测试主要的独立组织(OpenAI、Meta、Google、Anthrotic等)的14个LLMS,涉及13个学科和10个难度层次的1 498个问题,显示教学效力与模型规模或一般推理能力没有线性关系。一些较小的开放源模式在教学环境中比较大的商业对应方表现要好。这一发现凸显了当前评价中的一个关键差距,这种评价将知识的回顾放在互动教学、学习和评价的优先地位之上。我们混合方法的评价,将定量指标与定性分析和专家案例研究相结合,确定了最高业绩模型(例如精密的问询策略、适应性反馈机制)所使用的不同教学优势。人类专家评价表明,78%的人同意我们对有效教学行为进行自动化的质量分析,验证我们下一个方法。
Article 31
Title@2025-07-31 (4): The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models
Title: The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models | Der Pragmatische Geist der Maschinen: Auf der Spur des Entstehens der Pragmatischen Kompetenz in großen Sprachmodellen | 机器的实用思维:追踪大语言模式中实用能力的出现 2505.18497v2 |
Authors (6): Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, Rob Voigt
Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution and theory-of-mind reasoning, both of which require substantial pragmatic understanding. However, how LLMs acquire this pragmatic competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two equally plausible yet pragmatically divergent continuations and requires the model to (i) infer the speaker’s intended meaning and (ii) explain when and why a speaker would choose one utterance over its alternative, thus directly probing pragmatic competence through contrastive reasoning. We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic scenarios. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.
目前大型语言模型(LLMS)显示了社会情报任务中新出现的能力,包括隐含的解析和思维理论推理,两者都要求实质性的务实理解;然而,LLMS在整个培训过程中如何获得这种务实的能力,对此仍不甚了解;在这项工作中,我们引入了基于实用替代概念的数据集ALTPRAG,以评价不同培训阶段的LLMS能否准确地推导出演讲者的意图。每个实例都对两种同样合理但务实的延续,要求模式:(一) 推断发言者的预期意义,以及(二) 解释发言者何时和为什么选择一种表达方式取代其替代办法,从而通过对比推理直接探索务实的能力。我们系统地评估了三个关键培训阶段的22 LLMS:在培训前、监督的微调(SFT)和偏好,以审查务实能力的发展。我们的结果表明,即使是基础模型也显示出对务实的提示的显著敏感性,这些提示随着模型和数据规模的提高而不断提高。SFT和RHFFFFF将促进进一步的成果,特别是在认知-Revalmaical imiming imingal impal impresulation impal impal impal impresulation。
Article 32
Title@2025-07-31 (4): Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
Title: Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration | Über passives kritisches Denken hinaus: Förderung proaktiver Befragungen zur Verbesserung der Mensch-KI-Kollaboration | 超越被动的批判性思考:促进积极主动的提问,以加强人类与大赦国际的协作 2507.23407v1 |
Authors (7): Ante Wang, Yujie Lin, Jingyao Liu, Suhang Wu, Hao Liu, Xinyan Xiao, Jinsong Su
Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B’s accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.
然而,先前的工作主要侧重于被动批判性思维,模型只是拒绝有问题的询问,而没有采取建设性步骤来应对用户的要求。在这项工作中,我们引入了积极主动的批判性思维,即模式积极寻找用户丢失的信息或澄清用户提供的信息以更好地解决其询问的范例。为了评估这一能力,我们提出了基于GSM-MC和GSM-MC-MC的基于GS8K的两个基于GSM8K的新基准,以便在不完整或误导的条件下评估数学推理。GSM-MC包含1 368个数学问题,关键变量被故意删除,需要模型识别和请求缺失的信息。GSM-MCE进一步增加了难度,引入了不相关的细节,以测试抗分心的强健性。关于Quen3和Llama系列模型的实验表明,尽管这些模型在传统的推理任务中表现优异于广泛的培训后和推导时间缩,但它们与积极主动的批判性思维,特别是较小的思维。然而,我们证明,强化学习(RL)能够大大改进这一能力。我们利用强化的RL算法,实现了实质性的成绩,我们从GSMMMM3-1的精确度思考了G-198的进度,我们从0.98到G-1.10的精确度思考了G-11的进度。
Article 33
Title@2025-07-31 (4): RAVine: Reality-Aligned Evaluation for Agentic Search
Title: RAVine: Reality-Aligned Evaluation for Agentic Search | RAVine: Realitätsorientierte Bewertung für die Agentische Suche | RAVine: 化学搜索的现实统一评价 2507.16725v2 |
Authors (4): Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao
Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine – a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model’s interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.
作为一种更自主和适应性的增强检索模式,机械搜索正在推动智能搜索系统的演进,但是,现有的评价框架未能与代理搜索的目标相一致。首先,当前基准中常用的复杂查询往往与现实用户搜索情景不同。其次,先前的方法往往在为端到端评价提取地面真相时引入噪音,导致微小评估的扭曲。第三,大多数当前框架仅侧重于最终答案的质量,忽视了对代理搜索所固有的迭接过程的评价。为克服这些限制,我们提议了RAVine – – 一个用于搜索的代理LLMS的真实性-统一电子估价框架。RAVine针对更好地反映用户意图的多点查询和长式答案,并提出了可归属的地面真相构建战略,以提高精细评估的准确性。此外,RAVine还检查了模型在整个迭接过程中与搜索工具的相互作用,忽略了对效率因素的核算。我们用RAVine为一系列模型设定基准,并提出了若干洞察力,我们希望这将推动代理搜索系统的发展。
Article 34
Title@2025-07-31 (4): Enhanced Arabic Text Retrieval with Attentive Relevance Scoring
Title: Enhanced Arabic Text Retrieval with Attentive Relevance Scoring | Verbesserte arabische Text-Retrieval mit aufmerksamer Relevanz Scoring | 阿拉伯强化文本检索, 带有启动相关性显示器 2507.23404v1 |
Authors (5): Salah Eddine Bekhouche, Azeddine Benlamoudi, Yazid Bounab, Fadi Dornaika, Abdenour Hadid
Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.
阿拉伯语因其复杂的形态学、选择性偏差学以及现代标准阿拉伯语和各种方言的共存,对自然语言处理和信息检索构成特别的挑战。尽管阿拉伯语在全球的重要性日益增大,但在国家标准阿拉伯语的研究和基准资源中仍然代表不足。在本文中,我们提出了一个专门为阿拉伯语开发的强化的“通过检索”框架。我们的方法的核心是一个新的“强化相关性分级”(ARS),用适应性评分功能取代标准互动机制,后者更有效地模拟问题和段落之间的语义相关性。我们的方法结合了经过预先训练的阿拉伯语模型和建筑改进,以提高检索性能,并在回答阿拉伯语问题时大大提高排序准确性。代码在以下网站公布:https://github.com/Bekhouche/APRGitHub}。
Article 35
Title@2025-07-31 (4): MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization
Title: MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization | MRGSEM-Sum: Ein unbeaufsichtigtes Multi-Dokument Zusammenfassungsrahmen basierend auf Multi-Relational Graphen und struktureller Entropie Minimierung | MRGSEM-Sum:基于多关系图和结构元件最小化的无人监督的多文件概括框架 2507.23400v1 |
Authors (6): Yongbing Zhang, Fang Nan, Shengxiang Gao, Yuxin Huang, Kaiwen Tan, Zhengtao Yu
The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.
多文件总和面临的核心挑战在于文件之间的关系的复杂性和信息冗余的存在。图表群集是解决这一问题的有效范例,因为它模拟了使用图表结构的文件之间的复杂关系,并通过集成,取得了显著的研究进展,减少了信息冗余。然而,现有方法往往只考虑单一关系图,需要预先界定的组群数量,这妨碍了它们充分代表丰富的关联信息和适应性分解组以减少冗余的能力。为了克服这些局限性,我们提议MRGSEM-Sum是一个不受监督的多文件总和框架,以多关系图和结构质量最小化为基础,建立一个不受监督的多文件总和框架。具体地说,我们构建了一个多关系图,通过集成和讨论各句子之间的关系,全面建模;然后,我们采用二维结构最小化的最小化算法,自动确定组群集的最佳数量,并有效地将判决组织成一致的组群。最后,我们引入了一种定位压缩机制,将每个组群集进行分解,生成简明和内容精准的多质量最小化;具体地,我们构建了一个多语系的多语系关系图,在四个基准组群集中进行广泛的实验,用比标化的模型,并展示了我们以往的MILS-A-B-B-B-B-B-B-B-C-C-S-C-C-C-C-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-A-S-S-S-S-S-S-S-S-Axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 36
Title@2025-07-31 (4): Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators
Title: Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators | Beyond the Cloud: Bewertung der Vorteile und Nachteile lokaler LLM-Einsatzmöglichkeiten für Übersetzer | 云云之外:评估为笔译员部署当地LLM的利弊 2507.23399v1 |
Authors (1): Peter Sandrini
The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.
大语言模型的迅速扩散为翻译领域带来了机遇和挑战。虽然商业的、以云为基础的人工聊天机在翻译研究中引起了极大关注,但对于数据隐私、安全和公平获取的关切要求探索替代部署模式。本文件调查了当地可部署的免费语言模型的可行性和性能,作为专有的、以云为基础的人工智能解决方案的可行替代方案。本研究报告评估了在基于CPU的平台上安装的三种开放源模式,对照商业上可提供的在线聊天机进行比较。评价的重点是功能性业绩,而不是对已经广泛研究的人力资源翻译质量进行比较分析。所评估的平台是根据其可及性和在各种操作系统中的易用性而选择的。虽然当地部署带来了自己的挑战,但加强数据控制、改进隐私和减少对云服务的依赖的好处是令人信服的。本研究报告的结论有助于增加关于AI技术民主化的知识,并为今后旨在使更多的用户更容易接触和实用的LLMS进行研究和开发努力提供信息,特别是侧重于个别笔译员和小企业的需要。
Article 37
Title@2025-07-31 (4): Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models
Title: Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models | Causal2Vec: Verbessere Dekoder-nur LLMs als vielseitige Einbettungsmodelle | Causal2Vec:改进只有解码器的LLMs作为Versatile嵌入模型 2507.23386v1 |
Authors (3): Ailiang Lin, Zhuoyun Li, Kotaro Funakoshi
Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model’s ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM’s input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.
解码器的大型语言模型(LLMS)正越来越多地被用于构建嵌入模型,将自然语言文本的语义信息有效编码成密集的矢量演示,用于各种嵌入任务。然而,许多现有方法主要侧重于消除LLM中因果关注面罩,以便双向关注,有可能破坏该模型提取在预培训前获得的语义信息的能力。此外,领先的单向方法往往依赖额外的输入文本来克服因果关系的内在局限性,不可避免地增加计算成本。在这项工作中,我们提议一个通用嵌入模型Causal2Vec,这是一个通用嵌入模型,专门用来在不改变其原始结构或引入重大计算间接成本的情况下,提高只使用LMSMs的性能。具体地说,我们首先使用一个轻量的BERT型模型,将输入文本预编码成单一的背景符号,然后将它预先定位到LMSM的输入序列,允许每一种信号通过不参与未来符号来获取背景信息。此外,我们提议通过最后的粘贴式集合和LMSLMS的直观来减轻在最后时间模型中引入的偏移偏差偏差,然后帮助在S-inal-inal-li-inal-li-lical-li-li-li-li-licalcol-lical-lical-licol-licol-licol-inal-lical 上,同时将S-lical 将S-inal-licol-inal-inal-inal 将Sild 将Sildald-inal-in 将Slicold-licolvicild 将S-in 将S-in 将S-in 将S-in 将S-ial-inal-ial-li-li-li-li-li-li-li-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-in-inal-inal-inal-inal-lical-inal-inal-inal-inal-inal-lical-inal-inal-
Article 38
Title@2025-07-31 (4): MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models
Title: MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models | MPCC: Ein neuartiger Benchmark für multimodale Planung mit komplexen Einschränkungen in multimodalen großen Sprachmodellen | MPCC:具有多种多语言模式复杂限制的多式联运规划新基准 2507.23382v1 |
Authors (6): Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che
Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs’ ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.
为解决这些问题,我们引入了多种模式规划能力,这是系统地评估MLLMS在规划中处理多式联运制约的能力的第一个基准。为了应对第一个挑战,MCC侧重于三种现实世界任务:飞行规划、日历规划和会议规划。为了解决第二个挑战,我们在这些任务中引入了复杂的制约因素(例如预算、时间和空间),这些制约因素(例如预算、时间和空间),有等级化的难度水平(EASY、MEDIUM、HARD)将制约性复杂性与搜索空间扩展分开。对13个高级MLLMs的实验揭示了重大挑战:封闭源模型只达到21.3%的可行计划,而开放源模式平均低于11 %。此外,我们发现MLLMS对制约复杂性非常敏感,传统的多式联运催化战略在多式制约性应用中失败。我们的工作需要正式的僵化的MLLM逻辑。
Article 39
Title@2025-07-31 (4): Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
Title: Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models | Theorem-of-Thought: Ein Multi-Agenten-Framework für abduktive, deduktive und induktive Vernunft in Sprachmodellen | 所探讨的理论理论:语言模式中指导、贬低和诱导理由的多机构框架 2506.07106v2 |
Authors (4): Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin
Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.
大型语言模型(LLMS)在自然语言推理任务中表现出很强的性能,但其推理过程仍然微弱,难以解释。 催化技术(如CoT)通过引出中间推理步骤或综合多种产出来提高可靠性。 但是,它们缺乏执行逻辑结构和评估内部一致性的机制。 我们引入了推理理论(toTh)的新框架,该理论框架将推理作为三个平行代理人之间的协作模式,每个推理过程模拟一种独特的推理模式:绑架、推理和感化。每个代理人都产生推理跟踪,它的结构是正式推理图表。为了评价一致性,我们应用由自然语言推理(NLI)指导的贝叶信仰传播,给每个步骤分配信任分数。选择最连贯的图表来得出最后答案。 我们的符号(WebOFLies)和数字(MultiArith)推理基准实验显示,TT始终超越CT、自求和CT-Decoding 跨多个LMS,同时制作可解释和逻辑推理推理的逻辑推理的推理。 我们的发现/CLILGMLIG/M的发现提出了有希望的方向。 在可理解性/CMLIGIGIG/CS/CS/CS/CS/CLGIGIGM的推论。
Article 40
Title@2025-07-31 (4): WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation
Title: WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation | WildSpeech-Bench: Benchmarking von Audio-LLMs im natürlichen Sprachgespräch | WirdSpeech-Bench:为自然演讲对话中的音频LMs设定基准 2506.21875v2 |
Authors (6): Jian Zhang, Linhao Zhang, Bokai Lei, Chuhan Wu, Wei Jia, Xiao Zhou
Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech’s unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.
GPT-4o等最新多式大语言模型(LLMs)展示了直接语音互动的强大能力,然而,对于终端至终端语音LLM评价缺乏专门和全面的基准,妨碍了在现实应用中优化音频LMLM的用户经验。现有的评价方法经常调整基于文本的基准,忽略了语言的独特性和挑战,包括流言、同声、静语和不同的用户期望。在这里,我们提出了一个新颖的方法,在实际语音对话中彻底评价LLMS。我们系统地整理与发言情景相关的真实世界聊天数据,引入语音属性和声学条件的多样性,并以特定语言现象补充数据集。我们进一步设计了一种有查询觉的评价方法,使用定制的评价清单,并迅速提高自动评价的准确性。我们全面测试和详细分析各种主流语言模型,揭示不同语音情景在示范性表现上的巨大差异。使用查询评估进一步使得在各种特定语音情景下进行精细的评估。我们的基准可以为语音模型的开发和评价提供宝贵的洞察力。
Article 41
Title@2025-07-31 (4): Holistic Evaluations of Topic Models
Title: Holistic Evaluations of Topic Models | Ganzheitliche Bewertungen von Themenmodellen | 专题模式整体评价 2507.23364v1 |
Authors (1): Thomas Compton
Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a ‘black box’, where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models
商业和学术对专题模型总结大量无结构化文本的能力越来越感兴趣。作为不受监督的机器学习方法,这些模型使研究人员能够探索数据,帮助一般用户理解大型文本收藏中的关键主题。然而,它们有可能成为一个“黑箱”,用户输入数据,接受输出为准确的概要而无需仔细审查。本文章从数据库的角度评价专题模型,从1140 BERTopic 模型运行中得出见解。目的是确定在优化模型参数方面的权衡,并思考这些结论对专题模型的解释和负责任使用意味着什么。
Article 42
Title@2025-07-31 (4): Robust and Fine-Grained Detection of AI Generated Texts
Title: Robust and Fine-Grained Detection of AI Generated Texts | Robuste und feinkörnige Erkennung von KI-generierten Texten | 对 AI 生成文本的强力和精细探测 2504.11952v3 |
Authors (14): Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Drishti Sharma, Siddhant Gupta, Jebish Purbey, Ashay Srivastava, Subhasya TippaReddy, Arvind Reddy Bobbili, Suraj Telugara Chandrashekhar, Modabbir Adeeb, Srinadh Vura, Suman Debnath, Hamza Farooq
An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models’ performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.
理想的机器生成内容检测系统应适用于任何发电机,因为许多更先进的LLM每天都存在。现有系统往往难以准确识别AI产生的内容,而不能精确辨别较短的文本。此外,并非所有文本都可能完全由人或LLM编写,因此我们更侧重于部分案例,即人-LLM共同编写的文本。我们的文件介绍了一套为象征性分类任务而设计的模型,这些模型经过培训,涉及大量人体-机器共同编写的文本,这些文本在看不见域的文本、看不见的生成者、非母语发言人的文本和有对抗性投入的文本方面表现得非常出色。我们还引入了2.4M以上这类文本的新数据集,这些文本大多由超过23种语言的几个受欢迎的专利LMM共同编写。我们还介绍了我们模型对每个域和生成器的每种文本的绩效调查结果。其他调查结果包括对照每一种对抗方法的绩效、输入文本的长度和生成文本的特点与原始人类撰写的文本的对比。
Article 43
Title@2025-07-31 (4): SWE-Exp: Experience-Driven Software Issue Resolution
Title: SWE-Exp: Experience-Driven Software Issue Resolution | SWE-Exp: Erfahrungsgetriebene Software-Ausgabeauflösung | SWE-Expl:经验丰富的软件问题决议 2507.23361v1 |
Authors (10): Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, Qianxiang Wang
Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.
大型语言模型(LLM)代理最近的进展表明,在软件问题的解决方面取得了显著进展,利用了多剂协作和蒙特卡洛树搜索等先进技术。然而,目前代理作为没有记忆的探险家,在不保留或重复以往修复经验的知识的情况下分别处理每个问题,从而导致对失败的轨迹进行重复探索,并错过了将成功解决问题的方法适应类似问题的机会。为解决这一问题,我们引入SWE-Exporation(SWE-Exporation),一种强化方法,从以前的代理轨迹中提取简明和可操作的经验,使各种问题能够不断学习。我们的方法引入了一个多面的经验库,记录成功和失败的修复尝试。具体地说,它提取了不同层次的可重复的解决问题知识――从高层次的问题理解到具体的代码变化。实验表明,SWE-Explex在开放源代理框架下对SWE-bench-Verizer化的SWE-pass@1,在SWE-bench-vicer 框架下,我们的方法建立了一个新的范例,使自动软件工程代理系统积累和利用修复专门知识,从试验和驱动的解决方案问题从根本上转向战略探索。
Article 44
Title@2025-07-31 (4): VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
Title: VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning | VL-Cogito: Progressives Curriculum-Verstärkungslernen für fortgeschrittene multimodale Vernunft | VL-Cocito:先进多式联运理由的渐进课程强化学习 2507.22607v2 |
Authors (12): Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong
Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.
近期的研究工作逐渐将这一模式扩大到多式联运的推理任务。由于多式联运任务,特别是语义内容和问题配方的内在复杂性和多样性,现有模式在各个领域和困难程度中往往表现不稳定。为解决这些限制,我们建议VL-Cogito,这是一个通过新的多阶段进步课程强化学习(PCuRL)框架培训的先进的多式推理模型。PCuRL通过逐步增加难度、大幅度提高不同多式联运背景的推理能力等任务,系统地指导模型。框架引入了两项关键创新:(1) 在线困难软加权机制,动态调整整个RL培训阶段的培训困难;(2) 动态奖励机制,鼓励模型根据任务复杂性调整其推理过程长度,从而平衡推理效率与正确性。实验性评估表明,VL-Cogito在数学、科学、逻辑和一般理解等主流多式基准方面,始终或超过现有的推理模式。该框架引入了两项关键创新,即:(1) 在线困难软加权机制,动态调整了培训难度,并验证了我们的方法的有效性。
Article 45
Title@2025-07-31 (4): Text-to-SQL Task-oriented Dialogue Ontology Construction
Title: Text-to-SQL Task-oriented Dialogue Ontology Construction | Text-zu-SQL Aufgabenorientierter Dialog Ontologie Konstruktion | 以任务为导向的对话肿瘤构建 2507.23358v1 |
Authors (8): Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin, Shutong Feng, Nurul Lubis, Milica Gasic
Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.
大型语言模型(LLMS)被广泛用作通用知识来源,但它们依赖参数知识,限制解释性和可信度。在任务导向对话(TOD)系统中,这种分离是明确的,使用由明确的本体学构建的外部数据库来确保解释性和可控制性。然而,建立这种本体学需要人工标签或监督培训。我们引入了TeQODO:一个文本到SQL的任务导向式对话本体构建方法。在这里,一个LM自主地从零开始建立一种TOD本体,没有监督,使用它固有的 SQL 编程能力,加上快速提供的对话理论。我们显示TeQODO超越了传输学习方法,其构建本体学在下游对话状态跟踪任务中具有竞争力。吸收研究显示了对话理论的关键作用。TeQDO还进行了规模,以便构建大得多的本体论,我们在Wike和ArXiv数据集上对此进行了调查。我们将此视为朝着更广泛地应用本体学来提高LLM的解释性迈出了一步。
Article 46
Title@2025-07-31 (4): KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities
Title: KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities | KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung | KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v2 |
Authors (4): Hruday Markondapatnaikuni, Basem Suleiman, Abdelkarim Erradi, Shijing Chen
Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.
在再培训大型语言模型(LLMS)以纳入更多的知识时,微调是一个非常耗资的过程。虽然已经开发了许多微调技术来减少所涉及的时间和计算成本,但挑战依然存在,因为LLMS在规模和复杂性上继续增长。要解决这个问题,就需要在LLMS中采用新的知识扩展方法。检索增强的一代(RAG)提供了一个这样的替代办法,将外部知识储存在一个数据库中,并重新获取相关数据块以支持回答问题。然而,对RAG的天真的实施在可缩缩缩缩和回答准确性方面面临着重大限制。本文介绍了KeyKKKnowledgeraG(K2RAG),这是一个旨在克服这些限制的新框架。受分解和正拼模式的启发,K2RAG整合了密集和稀薄的矢量搜索、知识图表和文本合成,以提高检索质量和系统效率。框架还包括一个预处理步骤,以总结培训数据,大大减少培训时间。K2RAG的天平调度评估是通过多HOG(M)的缩缩缩略数据集,其中拟议的编订了最高执行时间,并测试了KLADADAD的进度。
Article 47
Title@2025-07-31 (4): SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution
Title: SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution | SWE-Debatte: Wettbewerbsfähige Multi-Agenten-Debatte für die Lösung von Software-Problemen | SWE-Debate:解决软件问题竞争性多机构辩论 2507.23348v1 |
Authors (9): Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, Qianxiang Wang
Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents’ independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.
由于大型语言模型(LLMs)的先进推理能力,问题解决取得了显著进展。最近,SWE代理商等基于代理商的框架进一步推进了这一进展,使自动使用工具的代理商能够应对复杂的软件工程任务。虽然现有基于代理商的问题解决方法主要基于代理商的独立探索,但它们往往被困在本地解决方案中,无法查明跨越代码库不同部分的问题模式。为了应对这一限制,我们提议SWE-Debate,这是一个竞争性多代理商辩论框架,鼓励多种推理路径,实现更综合的问题本地化。SWE-Debate首先通过绘制代码依赖性图表,生成多重错误传播痕迹,作为本地化建议。然后,它组织专门代理商之间的三轮辩论,每个都体现了与错误传播跟踪相关的不同推理观点。这种结构竞争使代理商能够就综合固定计划开展合作。最后,这一综合固定计划被纳入基于MCTS的代码修改工具,用于补丁生成。SWE-Debate在SWE-Bench基准上进行的实验表明,SWE-Deate在开放代理商基准框架中实现了新的州差幅。
Article 48
Title@2025-07-31 (4): Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
Title: Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance | Mehrsprachige Fähigkeiten mit kulturellem und lokalem Wissen in großen Sprachmodellen verbessern und gleichzeitig die Leistungsfähigkeit der Ureinwohner verbessern | 提高多语言多语言能力,在提高土著绩效的同时,利用大语言模式的文化和地方知识,同时提高土著绩效 2504.09753v3 |
Authors (9): Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Siddhant Gupta, Drishti Sharma, Jebish Purbey, Kanwal Mehreen, Muhammad Arham, Suman Debnath, Hamza Farooq
Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.
大型语言模型(LLMS)表现出了非凡的能力,但其开发主要侧重于英语和其他高资源语言,使许多语言得不到充分服务。我们展示了印度语-英语双语LLM \ textbf{Mantra-14B}的最新印度语-英语双语言LLM \ textbf{Mantra-14B},两者的基准分数平均提高了3,优于其规模的两倍。我们使用由485K样本的英语和印地语教学数据组成的经整理的数据集,指导了Quen-2.5-14B-Instruct和Phid语等经调整的模型,以提高英语和印地语的绩效。我们涵盖7个不同参数大小的不同LLMs和140多个培训尝试的实验表明,在不损及本地绩效的前提下大幅改进多种语言的绩效是可能的。此外,我们的方法避免了词汇扩展或建筑改造等资源密集型技术,从而使模型规模小。我们的结果表明,对文化和当地知情数据进行适度的微调,可以弥补绩效差距,而不会产生重大的计算间接费用。我们发布了我们的培训代码、数据集和模型,用于协助低语言的研究。
Article 49
Title@2025-07-31 (4): DSBC : Data Science task Benchmarking with Context engineering
Title: DSBC : Data Science task Benchmarking with Context engineering | DSBC : Data Science-Aufgabe Benchmarking mit Kontext-Engineering | DSBC: 数据科学任务与背景工程基准 2507.23336v1 |
Authors (6): Ram Mohan Rao Kadiyala, Siddhant Gupta, Jebish Purbey, Giulio Martini, Suman Debnath, Hamza Farooq
Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.
大型语言模型(LLMS)的最近进展对数据科学工作流程产生了重大影响,产生了专门的数据科学代理物,目的是实现分析任务的自动化。尽管迅速采用,但评估这些代理物的功效和局限性的系统基准仍然很少。在本文件中,我们引入了一个全面基准,专门通过观察我们商业应用的使用情况来反映实际用户与数据科学代理物的相互作用。我们评估了三个LLMs:Claude-4.0-Sonnet、Gemini-2.5-Flash和OpenAI-o4-Mini,这三种方法包括:环境工程零弹射、环境工程多步和SmolAgency。我们的基准评估了八个数据科学任务类别的业绩,另外探讨了模型对共同提示问题的敏感性,例如数据泄漏和略微模糊的指示。我们进一步调查了温度参数对每个模型和方法的总体和具体任务结果的影响。我们的调查结果揭示了评价模型和方法之间不同的业绩差异,突出了影响实际应用的关键因素。我们在此介绍的基准数据集和评价框架的目的是为未来研究更可靠和有效的数据科学代理物提供基础。
Article 50
Title@2025-07-31 (4): MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation
Title: MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation | MUST-RAG: MUSical Text Question Beantwortung mit retrieval Augmented Generation | MOST-RAG: 以回取增加的一代人回答的中文本问题 2507.23334v1 |
Authors (3): Daeyong Kwon, SeungHeon Doh, Juhan Nam
Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs’ effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs’ music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.
大型语言模型(LLMS)的近期进步表明,大语言模型(LLMS)在不同领域都取得了显著进步,尽管在各种任务上表现得非常零,但LLMS在音乐相关应用方面的效力仍然有限,因为其培训数据中音乐专用知识所占比例相对较小。为解决这一局限性,我们提议MUST-RAG,一个基于再获取增强一代(RAG)的综合框架,以将通用LMS改编成仅供文字解答的音乐问题解答(MQA)任务。RAG是一种技术,通过在生成问题答案时检索相关背景信息,为LLMS提供外部知识。为了优化音乐领域的RAG,我们(1) 提议MusWikiDB,即用于检索阶段的音乐专用矢量数据库,以及(2) 在推论和微调过程中利用背景信息,将通用LMMMLMSMS(MQA)MQA(MQA)调控域能力(MSUWIA)的常规微调方法大大超越了LMMMMMMA(GA)的适应能力,并大大改进了我们的普通和高级QA(MQA)计算效率标准。
Article 51
Title@2025-07-31 (4): Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette
Title: Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette | Kulturelle Palette: Pluralisierung der Kulturausrichtung über Multi-Agenten-Palette | 文化调色板:通过多试剂调色板实现多元化文化协调 2412.11167v3 |
Authors (7): Jiahao Yuan, Zixiang Di, Shangzixin Zhao, Zhiqing Cui, Hanqing Wang, Guisong Yang, Usman Naseem
Large language models (LLMs) face challenges in aligning with diverse cultural values despite their remarkable performance in generation, which stems from inherent monocultural biases and difficulties in capturing nuanced cultural semantics. Existing methods struggle to adapt to unknown culture after fine-tuning. Inspired by cultural geography across five continents, we propose Cultural Palette, a multi-agent framework that redefines cultural alignment as an adaptive “color-blending” process for country-specific adaptation. Our approach harnesses cultural geography across five continents (Africa, America, Asia, Europe, Oceania) through three key steps: First, we synthesize the Pentachromatic Cultural Palette Dataset using GPT-4o, refining continental-level dialogues with Hofstede’s cultural dimensions to establish foundational cultural representations. Second, five continent-level alignment agents form specialized cultural communities that generate region-specific draft responses. Third, a Meta Agent employs Cultural MoErges to dynamically blend these cultural “colors” through attention-gated parameter merging, akin to mixing pigments on a palette, resolving conflicts while preserving cultural nuances to produce the final culturally-aligned response. Extensive experiments across various countries demonstrate that Cultural Palette surpasses existing baselines in cultural alignment.
大型语言模型(LLMS)在与不同的文化价值观保持一致方面面临着挑战,尽管它们一代人表现出色,这源于固有的单文化偏见和捕捉细微文化语义学的困难。现有方法在微调之后难以适应未知文化。受五大洲文化地理的启发,我们提出文化调和,这是一个多试剂框架,将文化调和重新定义为适合特定国家适应的“彩色混合”进程。我们的方法通过三个关键步骤利用五大洲(非洲、美洲、亚洲、欧洲、大洋洲)的文化地理:第一,我们利用GPT-4o将Pentachromatic 文化调色素数据集合成,改进大陆一级与Hofsstede文化层面的对话,以建立基础文化代表。第二,五个大陆一级的调和剂形成专门的文化社区,产生针对特定区域的回应草案。第三,一个Metagres利用文化调控点,通过引人注意的参数整合,将这些文化“颜色”动态融合为五大洲(非洲、美洲、亚洲、欧洲、大洋洲),通过三个关键步骤,通过三个步骤:第一,我们使用GPPT-4o,在维护文化调和保持文化调,同时解决冲突冲突,同时解决冲突,与Hoclates missionlates bes lavelates bes bes ex extimes ex ex ex各国现有文化调制文化基线。
Article 52
Title@2025-07-31 (4): FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain
Title: FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain | FinGAIA: Ein chinesischer Benchmark für KI-Agenten in der Real-World Financial Domain | 金融界:中国真实世界金融领域AI代理商基准 2507.17186v2 |
Authors (21): Lingfeng Zeng, Fangqi Lou, Zixuan Wang, Jiajie Xu, Jinyi Niu, Mengping Li, Yifan Dong, Qi Qi, Wei Zhang, Ziwei Yang, Jun Han, Ruilun Feng, Ruiqi Hu, Lejie Zhang, Zhengbo Feng, Yicheng Ren, Xin Guo, Zhaowei Liu, Dongpo Cheng, Weige Cai, Liwen Zhang
The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9\%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at https://github.com/SUFE-AIFLM-Lab/FinGAIA.
AI代理商的蓬勃发展为在各个领域使复杂任务自动化提供了前所未有的机会,然而,其金融部门的多步、多工具协作能力仍未得到充分探讨。本文件介绍了FinGAIA,这是旨在评价AI代理商在金融领域实际能力的一个端对端基准。FinGAIA由407项精心设计的任务组成,涉及七个主要的金融次领域:证券、资金、银行、保险、未来、信托和资产管理。这些任务分为三个层次的情景深度:基本业务分析、资产决策支助和战略风险管理。我们在零发环境中对10个AI主流代理商进行了评估。我们的工作为最佳代理商ChatGPT提供了与48.9的总体准确性,该代理商虽然优于非专业人员,但仍落后于金融专家,但仍超过35个百分点。错误分析揭示了五个反复出现的失败模式:跨模式协调不便、金融时序比亚、业务过程认识障碍等。这些模式指向未来研究的关键方向。我们的工作为金融领域提供了第一位代理商基准。ACTGPTGP/FAAA,目标是客观地评估和AFIAFI/BIAA。
Article 53
Title@2025-07-31 (4): Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages
Title: Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages | Multi-Hypothese Destillation von mehrsprachigen Neuralübersetzungsmodellen für ressourcenarme Sprachen | 多语言低资源语言多语言神经翻译模型的蒸馏 2507.21568v2 |
Authors (4): Aarón Galiano-Jiménez, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Víctor M. Sánchez-Cartagena
This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model’s output distribution holds valuable insights for the student, beyond the approximated mode obtained through beam search (the standard decoding method), and present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence. This provides a larger representation of the teacher model distribution and exposes the student model to a wider range of target-side prefixes. We leverage $n$-best lists from beam search to guide the student’s learning and examine alternative decoding methods to address issues like low variability and the under-representation of infrequent tokens. For low-resource languages, our research shows that while sampling methods may slightly compromise translation quality compared to beam search based approaches, they enhance the generated corpora with greater variability and lexical richness. This ultimately improves student model performance and mitigates the gender bias amplification often associated with KD.
本文探讨了多语种预先培训的编码器脱coder-decoder翻译模型的序列级知识蒸馏(KD) 。 我们认为,教师模型的输出分布为学生提供了宝贵的洞察力,超出了通过光束搜索(标准解码方法)获得的近似模式,并提出了多功能分子蒸馏(MHD)方法(MHD),这是一种序列级知识蒸馏(MHD)方法,为每个源句生成多种译文。这为教师模型的分布提供了更大的代表性,使学生模型暴露在更广泛的目标端前缀中。我们利用了来自宝座搜索的最优名单来指导学生的学习,并考察了替代解码方法,以解决诸如低变异性和非常用符号代表不足等问题。关于低资源语言,我们的研究表明,虽然抽样方法可能略微损害翻译质量,而以光谱搜索方法相比,它们会以更大的变异性和词汇丰富度增强生成的化合物。这最终提高了学生模型的性能,并减轻了常常与KD相关的性别偏见。
Article 54
Title@2025-07-31 (4): LLMs and the Human Condition
Title: LLMs and the Human Condition | LLMs und der menschliche Zustand | LLM和人类条件 2402.08403v6 |
Authors (1): Peter Wallis
Theory based AI research has had a hard time recently and the aim here is to propose a model of what LLMs are actually doing when they impress us with their language skills. The model integrates three established theories of human decision-making from philosophy, sociology, and computer science. The paper starts with the collective understanding of reasoning from the early days of AI research - primarily because that model is how we humans think we think, and is the most accessible. It then describes what is commonly thought of as “reactive systems” which is the position taken by many philosophers and indeed many contemporary AI researchers. The third component to the proposed model is from sociology and based on the idea that human intelligence is a collective skill for which individuals are merely actors. The resulting model provides an alternate view of ``mind reading’’ in human communication.
基于理论的AI研究最近经历了一段艰难的时期,这里的目的是提出一个LLMs在以语言技能给我们留下深刻印象时实际在做哪些事情的模式。模型综合了哲学、社会学和计算机科学的三种人类决策既定理论。文件从AI研究的最初几天开始,对推理的集体理解,这主要是因为这个模型是我们人类的想法,也是最容易理解的。然后,它描述了人们通常认为的“反应系统”是许多哲学家和当代许多AI研究人员的立场。提议的模型的第三个组成部分来自社会学,其基础是人类智慧是一种集体技能,个人只是这种技能的参与者。由此产生的模型提供了人类交流中的“微读”的替代观点。
Article 55
Title@2025-07-31 (4): What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content
Title: What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content | Was ist Taboo für Sie? - Eine empirische Bewertung von LLMs Verhalten für Sensitive Inhalte | - 对行为举止为敏感内容的LLMS的 经验评估 2507.23319v1 |
Authors (6): Alfio Ferrara, Sergio Picascia, Laura Pinnavaia, Vojimir Ranitovic, Elisabetta Rocchetti, Alice Tuveri
Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.
虽然先前的研究主要侧重于明确培训模式,以缓和敏感内容并使之解毒,但对于LLMs是否在没有明确指示的情况下暗含净化语言的问题,探索有限,这项研究从经验上分析了GPT-4o-mini在对敏感内容进行反射和评估敏感性变化的程度时的隐含温和行为。我们的实验表明,GPT-4o-mini系统地将内容调低到不太敏感的类别,大幅降低贬低和禁忌语言。此外,我们评估LMs在对判决敏感性进行分类、将其表现与传统方法进行比较方面零弹射能力。
Article 56
Title@2025-07-31 (4): LiMe: a Latin Corpus of Late Medieval Criminal Sentences
Title: LiMe: a Latin Corpus of Late Medieval Criminal Sentences | LiMe: ein lateinischer Corpus der spätmittelalterlichen Strafurteile | Lime:拉丁美洲中世纪晚期刑事判决区 2404.12829v2 |
Authors (6): Alessandra Bassani, Beatrice Del Bo, Alfio Ferrara, Marta Mangini, Sergio Picascia, Ambra Stefanello
The Latin language has received attention from the computational linguistics research community, which has built, over the years, several valuable resources, ranging from detailed annotated corpora to sophisticated tools for linguistic analysis. With the recent advent of large language models, researchers have also started developing models capable of generating vector representations of Latin texts. The performances of such models remain behind the ones for modern languages, given the disparity in available data. In this paper, we present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani, and thoroughly annotated by experts, in order to be employed for masked language model, as well as supervised natural language processing tasks.
拉丁语受到计算语言研究界的注意,多年来,该研究界积累了若干宝贵的资源,从详细的注解公司到复杂的语言分析工具,随着最近大型语言模型的出现,研究人员还开始开发能够生成拉丁文本矢量的模型,鉴于现有数据的差异,这些模型的性能仍然落后于现代语言的模型,在本文中,我们提供了Lime数据集,共有325份文件,摘自一系列中世纪手稿,称为Libri sententententiarum patestatis Mediolani,并得到了专家的详尽说明,以便用于遮罩语言模型以及受监督的自然语言处理任务。
Article 57
Title@2025-07-31 (4): SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy
Title: SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy | SequenzLayer: Sequenzverarbeitung und Streaming von Neuronalen Netzwerken leicht gemacht | 序列激光器:序列处理和串联神经网络变得容易 2507.23292v1 |
Authors (11): RJ Skerry-Ryan, Julian Salazar, Soroosh Mariooryad, David Kao, Daisy Stanton, Eric Battenberg, Matt Shannon, Ron J. Weiss, Robin Scheibler, Jonas Rothfuss, Tom Bagby
We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.
为实现这一目标,我们引入了神经网络层 API 和 序列模型库, 目的是容易地创建可以逐层执行的序列模型( 教师强制培训) 和一步步执行的序列模型( 自动递减抽样 ) 。 为了实现这一点, 层界定了它们随着时间推移的状态的清晰描述( 例如变换器 KV 缓存、 混凝土缓冲、 隐藏的 RNN ) , 并引入一个步骤方法, 该步骤方法将状态化, 测试为给无国籍的分层性职业带来相同结果。 以及序列激光器合同的其他方面使复杂模型能够立即流动, 减轻在串流和平行序列处理中产生的广泛常见的错误, 并且可以在任何深层学习图书馆中实施。 一个可比较和具有宣示性的 API , 连同一个全面的层层和梳理器组合, 将生产规模模型的构建从简单可流成的组件简化, 同时又保持强烈的正确性保证。 我们目前实施的SquecesLayers ( JAX, TensorFlow 2) 可在 httpsrence.
Article 58
Title@2025-07-31 (4): Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics
Title: Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics | Fortschritte in LLMs mit Fokus auf Vernunft, Anpassungsfähigkeit, Effizienz und Ethik | 注重理由、适应性、效率和道德操守的LLMs项目的进展 2506.12365v2 |
Authors (8): Asifullah Khan, Muhammad Zaeem Khan, Saleha Jamshed, Sadia Ahmad, Aleesha Zainab, Kaynat Khatib, Faria Bibi, Abdul Rehman
This survey paper outlines the key developments in the field of Large Language Models (LLMs), including enhancements to their reasoning skills, adaptability to various tasks, increased computational efficiency, and the ability to make ethical decisions. The techniques that have been most effective in bridging the gap between human and machine communications include the Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback. The improvements in multimodal learning and few-shot or zero-shot techniques have further empowered LLMs to handle complex jobs with minor input. A significant focus is placed on efficiency, detailing scaling strategies, optimization techniques, and the influential Mixture-of-Experts (MoE) architecture, which strategically routes inputs to specialized subnetworks to boost predictive accuracy, while optimizing resource allocation. This survey also offers a broader perspective on recent advancements in LLMs, going beyond isolated aspects such as model architecture or ethical concerns. Additionally, it explores the role of LLMs in Agentic AI and their use as Autonomous Decision-Making Systems, and categorizes emerging methods that enhance LLM reasoning, efficiency, and ethical alignment. The survey also identifies underexplored areas such as interpretability, cross-modal integration, and sustainability. While significant advancements have been made in LLMs, challenges such as high computational costs, biases, and ethical risks remain. Overcoming these requires a focus on bias mitigation, transparent decision-making, and explicit ethical guidelines. Future research will generally focus on enhancing the model’s ability to handle multiple inputs, thereby making it more intelligent, safe, and reliable.
本调查文件概述了大语言模型(LLMS)领域的关键发展,包括提高推理技能、适应各种任务的能力、提高计算效率和作出道德操守决定的能力。在弥合人与机器通信之间的差距方面最有效的技术包括“探索链”、“指导教学”和“从人类反馈中强化学习”。多式学习和微小或零发式技术的改进进一步增强了LMS处理复杂工作的能力。一个显著的重点是效率,详细说明了可靠的规模战略、优化技术和有影响力的混合专家(MoE)结构,从战略上将投入转向专门的子网络,以提高预测准确性,同时优化资源分配。这项调查还从更广的角度介绍了LMMS最近的进展,超越了模型结构或道德问题等孤立的方面。此外,它探讨了LMS在AI模型中的作用,以及它们作为自主决策系统的应用。将新的方法分类,加强LMRM的推理、效率和道德一致性。调查还查明了高透明度、高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高成本、提高成本、提高成本、提高成本、提高等领域。调查还继续注重。
Article 59
Title@2025-07-31 (4): Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability
Title: Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability | Iterative Reparatur mit schwachen Verifierern für wenige Aufnahmen in KBQA mit Unbeantwortbarkeit | KBQA 中无法解答的微小投射点校验器的迭代性修补 2406.14313v3 |
Authors (4): Riya Sawhney, Samrat Yadav, Indrajit Bhattacharya, Mausam
Real-world applications of KBQA require models to handle unanswerable questions with a limited volume of in-domain labeled training data. We propose the novel task of few-shot transfer for KBQA with unanswerable questions and contribute two new datasets for performance evaluation. We present FUn-FuSIC - a novel solution for our task that extends FuSIC KBQA, the state-of-the-art few-shot transfer model for answerable-only KBQA. We first note that FuSIC-KBQA’s iterative repair makes a strong assumption that all questions are unanswerable. As a remedy, we propose Feedback for Unanswerability (FUn), which uses iterative repair using feedback from a suite of strong and weak verifiers, and an adaptation of self consistency for unanswerabilty to better assess the answerability of a question. Our experiments show that FUn-FuSIC significantly outperforms suitable adaptations of multiple LLM based and supervised SoTA models on our task, while establishing a new SoTA for answerable few-shot transfer as well.
KBQA 的实时应用要求模型处理无法回答的问题,其内部标记的培训数据数量有限。 我们提议为 KBQA 提供一些无法回答的问题,并贡献两个新的数据集来进行绩效评估。 我们提出了FU-FusIC-FusIC-一个适用于我们的任务的新型解决方案,它扩展了FusICT KBQA, 即只对可回答的 KBQA 采用最先进的微小传输模式。 我们首先注意到, FusIC-KBQA 的迭接式修复有力地假设所有问题都是无法回答的。 作为补救措施,我们提出了“无法回答”反馈,它利用来自强弱核查员的反馈进行迭接式修复,并调整自我一致性以更好地评估问题的可回答性。我们的实验表明,FU-Fus-FusICSICM 大大超出了我们任务中基于并监督的多个LM 软件模型的适当适应性,同时为可回答的几发式转让建立一个新的 SoTA。
Article 60
Title@2025-07-31 (4): AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Title: AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora | AutoSchemaKG: Autonome Wissensgraphenkonstruktion durch dynamische Schemainduktion aus Web-Scale Corpora | AutoSchemaKG:通过网络规模公司动态气相引入,建立自主知识图 2505.23628v2 |
Authors (20): Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
我们展示了AutoSchemaKG,这是一个完全自主的知识图形构建框架,它消除了对预先定义的模型的需求。我们的系统利用大型语言模型,同时从文本中提取三重知识,并直接产生全面的模型,同时对实体和事件进行建模,同时利用概念化来将事件组织成语义类别。处理超过5 000万份文件,我们建造了ATLAS(自动三连和Schema感应),这是一个知识图表系列,拥有9亿+百万节点和59亿边缘。这个方法在多霍QA任务上优于最新水平的基线,并增强了LLM事实质量。值得注意的是,我们的系统感应实现了与人造图的95-语义比对齐,零人工干预,表明10亿级知识图与动态导成的Schemas可以有效地补充大型语言模型的参数知识。
Article 61
Title@2025-07-31 (4): Unveiling Super Experts in Mixture-of-Experts Large Language Models
Title: Unveiling Super Experts in Mixture-of-Experts Large Language Models | Enthüllen Super-Experten in Mixture-of-Experts große Sprachmodelle | 混合专家大语言模型中不懈的超级专家 2507.23279v1 |
Authors (6): Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, Kehong Yuan
Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model’s forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.
快速启动的探索混合模型(MoE)在加强大型语言模型(LLMS)的学习能力方面显示了希望。利用专家之间的内在重要性差异,最近的研究探索了专家级压缩技术,以提高MOE LLMs的效率。然而,现有方法往往依赖经验标准来确定关键专家,缺乏更深入的探索和了解专家的不同重要性。在本研究中,我们首次发现和调查了在模型前方推理期间在基础机制中发挥关键作用的专家群体。这些专家在开放源代码 MoE LLMs中很普遍,尽管专家人数有限,但正在运行这些专家导致模型性能显著下降(例如,运行三个原因导致 Quen3-30B-A3B 产生重复和不具有说服力的产出)。我们把这些专家称为超级专家。我们的全面分析为SEservicials提供了更深入的解析。 (i)Seres decricorations的分布非常罕见,但极富的Sericuldical-dealation在SEretailation中也具有显著的影响力。
Article 62
Title@2025-07-31 (4): AI-Reporter: A Path to a New Genre of Scientific Communication
Title: AI-Reporter: A Path to a New Genre of Scientific Communication | AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation | AI-记者:通向科学通信新一流的道路 2507.05903v2 |
Authors (1): Gerd Graßhoff
The AI-Reporter represents a paradigmatic shift in scientific publication practice. This document demonstrates through a concrete case study how our system transforms academic presentations into publication-ready chapters – in less than three minutes. Using Arno Simons’ lecture on Large Language Models from the ``Large Language Models for the History, Philosophy, and Sociology of Science’’ workshop (NEPI) as an example, we show how technological innovation bridges the gap between ephemeral presentation and permanent scientific documentation.
AI-Reporter代表了科学出版实践的范式转变。本文件通过具体案例研究展示了我们的系统如何在不到3分钟内将学术介绍转变为可供出版的章节。我们以“科学历史、哲学和社会学大语言模型”研讨会(NEPI)为例,利用Arno Simons关于“大语言模型”的讲座,我们展示了技术创新如何弥合时间介绍和长期科学文献之间的差距。
Article 63
Title@2025-07-31 (4): Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis
Title: Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis | Bewertung der Mehrsprachigkeitsfähigkeiten von LLMs für Bengalen: Benchmark-Erstellung und Leistungsanalyse | 评价孟加拉多种语文能力:基准设定和业绩分析 2507.23248v1 |
Authors (5): Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat
Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.
孟加拉语是国家语言方案研究中代表不足的语言。然而,由于语言结构和计算限制的独特性,孟加拉语是一个挑战。在这项工作中,我们系统地调查阻碍孟加拉语国家语言方案业绩的挑战,重点是缺乏标准化的评价基准。我们随后在8个翻译数据集中评估了10个最近开放源码大语言模型(LLMs),并进行了全面的错误分析,以确定其主要失败模式。我们的调查结果显示孟加拉语与英语的绩效差距始终存在,特别是米斯特拉尔等小型模型和具体模型家庭。我们还发现,在某些结构中,如DeepSeek(DeepSeek),保持不同语言更稳定的性能。我们的分析揭示了象征性效率与LLM(LM)准确性之间的反比关系。当投入过于象征性时,模型往往表现更差,而更有效率的缩写效果则导致业绩的改善。这些调查结果突出了当前模型落后的关键领域,并强调了改进数据集质量和评估方法的必要性,特别是针对多种语言背景的小型模型。这项工作将推动对NLP(NLP)进行进一步的研究,有助于全世界使用先进语言技术的民主化。我们的分析揭示了象征性的代码和数据。用于这一公开研究。
Article 64
Title@2025-07-31 (4): P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication
Title: P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication | P-ReMIS: Pragmatische Vernunft in der psychischen Gesundheit und einer sozialen Implikation | P-REMIS: 心理健康和社会影响方面的实用原因 2507.23247v1 |
Authors (2): Sneha Oram, Pushpak Bhattacharyya
There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.
最近,精神健康个人化聊天室的可解释性和发展有所进展,然而,关于解释性和对话讨论的推理方面以前尚未探讨过,因此,我们正在调查该领域大型语言模型(LLMS)的实用推理能力;我们引进P-ReMe数据集,提出精神健康隐含(隐含意义)和预言(隐含假设)的实用现象的修改定义;在定义之后,我们拟订了隐含性的两项任务和预言的一项任务;为数据集和所提出的任务确定基准,我们考虑了四种模式-Llama3.1、Mistral、MentalLALLAMA和Quwen。实验结果表明,Mistral和Qwen在这方面表现出了很强的推理能力。此外,我们还提议StiPRompts研究关于精神健康的污名,与最先进的LMM、GPT-4o mini、Depseek-chat和Claude-3.5haik-haku进行对比。我们的评估结论显示,与LLLA3.5-3.5号与其他的污名比较,是负责任的。
Article 65
Title@2025-07-31 (4): Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents
Title: Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents | Generalisiertes Verstärkungslernen für retriever-spezifische Abfrage-Rewriter mit unstrukturierten Real-World-Dokumenten | 利用无结构的 “ 现实世界文件 “ 检索特定查询卷卷的通用强化学习 2507.23242v1 |
Authors (6): Sungguk Cha, DongWook Kim, Taeseung Hahn, Mintae Kim, Youngsub Han, Byoung-Ki Jeon
Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR’s potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.
(RAG) 系统严重依赖有效的查询配置,以释放外部知识,而优化对多样化、非结构化现实世界文件的查询仍然是一项挑战。我们引入了\ textbf{RL-QR},这是一个强化的检索器特定查询重写的学习框架,它消除了对人附加说明数据集的需求,并扩展了对文本专用数据库和多模式数据库的适用性。通过对情景-问题配对并利用通用的Reward政策优化化(GROPO)、RL-QR火车针对特定检索器的查询重写者,从而提高了不同领域的检索性能。我们对工业内部数据的实验显示出了显著的改进,用$\ text{RL-QL-Qtext{Mult{Muld-modal$在NDCG@3中实现了11 相对增益,用于多模式的RAG和$text{RL-rlexli reliflical relitical },使软件回收者进一步增9。然而,对于语调化和混合检索器精炼机组的挑战依然存在着挑战依然存在着,在Rlical-rchal-rcal-rch的重新整理中,在重新整理中无法改进我们的研究,而提供一种可能的学习。
Article 66
Title@2025-07-31 (4): Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
Title: Cutting Through the Noise: Boosting LLM Performance on Math Word Problems | Schneiden durch den Lärm: Steigerung der LLM-Performance bei Math Word-Problemen | 通过噪音剪切:促进数学字问题LLM的LLM性能 2406.15444v4 |
Authors (6): Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra
Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.
大型语言模型(LLMS)在包括解决数学词问题(MWPs)在内的各种任务方面非常出色,但与包含不相关信息的现实世界问题作斗争。为了解决这个问题,我们建议了一个快速框架,通过添加不相关变量来产生MWP的对抗变体。我们引入了一个数据集,即ProbleMATHIC,包含对抗性和非对抗性MWP。我们的实验显示,LMS很容易被数字噪音分散,导致对抗性MWP的平均相对性能下降~26%。为了减轻这一影响,我们从我们的数据集中提取了对抗性样本的微调LMS(Llama-2,Mistral)。对对抗性培训案例的微调提高了对抗性能的~8%,表明对噪音的强大性能,提高了为推理确定相关数据的能力。最后,为了评估我们快速性框架的通用性,我们引入了GSM-8K-Adv,即GSM-8K基准的对抗性变体。LMS在面对对抗性信息时继续挣扎,将性能降低到6%。
Article 67
Title@2025-07-31 (4): Framing Political Bias in Multilingual LLMs Across Pakistani Languages
Title: Framing Political Bias in Multilingual LLMs Across Pakistani Languages | Framing politische Bias in mehrsprachigen LLMs in pakistanischen Sprachen | 以多语种LLMs多种巴基斯坦语言界定政治偏见 2506.00068v2 |
Authors (3): Afrozah Nadeem, Mark Dras, Usman Naseem
Large Language Models (LLMs) increasingly shape public discourse, yet most evaluations of political and economic bias have focused on high-resource, Western languages and contexts. This leaves critical blind spots in low-resource, multilingual regions such as Pakistan, where linguistic identity is closely tied to political, religious, and regional ideologies. We present a systematic evaluation of political bias in 13 state-of-the-art LLMs across five Pakistani languages: Urdu, Punjabi, Sindhi, Pashto, and Balochi. Our framework integrates a culturally adapted Political Compass Test (PCT) with multi-level framing analysis, capturing both ideological stance (economic/social axes) and stylistic framing (content, tone, emphasis). Prompts are aligned with 11 socio-political themes specific to the Pakistani context. Results show that while LLMs predominantly reflect liberal-left orientations consistent with Western training data, they exhibit more authoritarian framing in regional languages, highlighting language-conditioned ideological modulation. We also identify consistent model-specific bias patterns across languages. These findings show the need for culturally grounded, multilingual bias auditing frameworks in global NLP.
大型语言模式(LLMS)日益影响公共对话,然而,大多数对政治和经济偏见的评价都集中在高资源、西方语言和背景上,这在巴基斯坦等低资源、多语言地区留下了重要的盲点,这些地区的语言认同与政治、宗教和区域意识形态密切相关;我们系统地评价了巴基斯坦五种语言:乌尔都语、旁遮普语、信德语、普什图语和巴洛奇语的13个最先进的LLMS中的政治偏见:乌尔都语、旁遮普语、信德语、普什图语和巴洛奇语。我们的框架将文化上适应的《政治指南测试》(PCT)与多层次的设置分析相结合,既包括意识形态立场(经济/社会轴心),也包括文体框架(内容、语调、重点)。提示与巴基斯坦背景下的11个社会政治主题一致。结果显示,尽管LMS主要反映与西方培训数据相一致的自由左方向,但它们在区域语言中表现出更加专制的设置,强调语言的意识形态调制。我们还确定了不同语言的一致的典型偏见模式模式。这些结论显示,全球国家语言需要基于文化的多语言的多语制审计框架。
Article 68
Title@2025-07-31 (4): AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Title: AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents | AgentSpec: Anpassbare Runtime Enforcement für sichere und zuverlässige LLM-Agenten | 安全可靠LLM代理商的可定制运行时间执法 2503.18666v3 |
Authors (3): Haoyu Wang, Christopher M. Poskitt, Jun Sun
Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identify 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.
建在LLMS上的代理人越来越多地在不同领域部署,使复杂的决策和任务执行自动化;然而,他们的自主性带来了安全风险,包括安全脆弱性、法律违规和意外有害行动; 现有的缓解方法,例如基于模型的保障措施和早期执行战略,缺乏稳健性、可解释性和适应性; 为了应对这些挑战,我们提议Agrespe,这是用于具体规定和强制执行LM代理的运行时间限制的一种轻量级域别语言; 与Agrespe, 用户制定结构化规则,包括触发器、前提和执法机制,确保代理人在预先确定的安全边界内运作; 我们执行各种领域,包括代码执行、体现代理人和自主驾驶,展示其适应性和有效性; 我们的评估表明,在90%以上的代码代理人案件中,Agrespe成功地防止了不安全处决,消除了所有体现代理任务中的危险行动,并强制执行了自动车辆100%的合规性。 尽管有强有力的安全保障,但Agrespec公司仍然计算出较轻的重量,并有毫秒的间接费用。 通过将解释性、模块性和效率结合起来,AstrimSpeceptrosprospect 提供一种实用和可理解性的规则,我们用ALLMSDMSDMS的透明性规则, 也通过透明化了一种可实现了一种可操作性的方法, 。
Article 69
Title@2025-07-31 (4): Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs
Title: Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs | Ermöglichung der weniger scharfen Alzheimer-Krankheit Diagnose auf Tabular Biomarker Daten mit LLMs | 使小热阿尔茨海默氏病的疾病诊断能够用LMS在表示生物标记数据上进行 2507.23227v1 |
Authors (9): Sophie Kearney, Shu Yang, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason Moore, Marylyn Ritchie, Li Shen
Early and accurate diagnosis of Alzheimer’s disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer’s Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.
对阿尔茨海默氏病(AD)这一复杂的神经退化性神经疾病(AD)的早期和准确诊断是复杂的神经退化性疾病,需要分析通常以表格形式呈现的多种生物标志(例如神经成像、遗传风险因素、认知测试和脑脊髓液蛋白),以表格形式对它进行分析。采用灵活的短片推理、多式整合和基于自然语言的可解释性,大型语言模型(LLLMS)提供了前所未有的机会,用结构化的生物医学数据进行预测。我们提议了一个称为TAP-GPT的新型框架,即Tabulal 阿尔茨海默氏氏病的预测GPT,以调整表GPT2,即最初为商业情报任务而开发的多表单专用LMM,用于使用结构化生物标志的结构性生物标志诊断诊断。我们的方法,即利用结构化生物数据学数据和细微的表型模型,将LMMS的先前知识用于更高级的预测基础。
Article 70
Title@2025-07-31 (4): Unveiling the Influence of Amplifying Language-Specific Neurons
Title: Unveiling the Influence of Amplifying Language-Specific Neurons | Enthüllen des Einflusses amplifizierender sprachspezifischer Neuronen | 消除扩增语言特有新元的影响 2507.22581v2 |
Authors (6): Inaya Rahmanisa, Lyzander Marciano Andrylie, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji
Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.
与个别语言密切相关的LLM中语言特有神经元在LLM中被证明通过解除其作用来影响模式行为。然而,他们在扩增中的作用仍未得到充分探讨。这项工作调查了通过18种语言的干预措施扩大语言特有神经元的效果,包括使用三种主要以不同语言培训的模型,使用三种主要以不同语言培训的模型。我们通过使用拟议的语言指导转变评价分数来比较扩增因素在引导目标语言方面的效力,然后评价下游任务:常识推理(XCOPA、XWinograd)、知识(Include)和翻译(FLORES)。最佳扩增因素有效地将产出引向几乎所有经过测试的语言。在下游任务中使用这一因素的干预措施在某些情况下提高了自我语言性能,但通常会降低跨语言结果。这些调查结果突出了语言特有的神经元在多语种行为中的影响,在这些中增益特别有利于低资源语言,但为跨语言转移提供了有限的优势。
Article 71
Title@2025-07-31 (4): LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Title: LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models | LLM-Crowdsourced: Ein Benchmark-freies Paradigma zur gegenseitigen Bewertung großer Sprachmodelle | LLM-文献来源:用于对大语言模式进行相互评价的无基准建模 2507.22359v2 |
Authors (8): Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang
Although large language models (LLMs) demonstrate remarkable capabilities across various tasks, evaluating their capabilities remains a challenging task. Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference. These issues make it difficult to evaluate the LLMs’ true capabilities comprehensively. To tackle these challenges, we propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced. It utilizes LLMs to generate questions, answer independently, and evaluate mutually. This method integrates four key evaluation criteria: dynamic, transparent, objective, and professional, which existing evaluation methods cannot satisfy simultaneously. Experiments on eight mainstream LLMs across mathematics and programming verify the advantages of our method in distinguishing LLM performance. Furthermore, our study reveals several novel findings that are difficult for traditional methods to detect, including but not limited to: (1) Gemini demonstrates the highest original and professional question-design capabilities among others; (2) Some LLMs exhibit ‘‘memorization-based answering’’ by misrecognizing questions as familiar ones with a similar structure; (3) LLM evaluation results demonstrate high consistency (robustness).
尽管大型语言模型(LLMs)在各种任务中表现出非凡的能力,但评价其能力仍然是一项艰巨的任务。现有的评价方法存在数据污染、黑盒操作和主观偏好等问题。这些问题使得很难全面评价LLMs的真正能力。为了应对这些挑战,我们提议了一个新型的无基准评价模式(LLM-Crowds),它利用LLMs提出问题、独立回答和相互评价。这种方法综合了四个关键评价标准:动态、透明、客观和专业,而现有的评价方法无法同时满足。在数学和编程中对八个主流LMs的实验证实了我们区分LLM业绩的方法的优势。此外,我们的研究揭示出一些难以发现传统方法的新发现,包括但不限于:(1) Gemini展示了其他人之间最高的原始和专业问题设计能力;(2) 一些LLMS展示了“以模范为基础的回答”,因为对类似结构熟悉的问题认识不当;(3) LLMM的评价结果显示高度一致(破坏)。
Article 72
Title@2025-07-31 (4): Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders
Title: Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders | Model Directions, keine Worte: Mechanistische Themenmodelle mit Sparse Autoencodern | 模型方向,非单词:使用粗态自动编码器的机械专题模型 2507.23220v1 |
Authors (8): Carolina Zheng, Nicolas Beltran-Velez, Sweta Karlekar, Claudia Shi, Achille Nazaret, Asif Mallik, Amir Feder, David M. Blei
Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.
传统专题模型对于在大型文本集中发现潜在主题十分有效,然而,由于它们依赖一袋字表,因此难以捕捉精度抽象特征。虽然一些神经变异体使用较丰富的表达方式,但它们同样受到以单词列表的形式表达主题的限制,这限制了它们阐述复杂专题的能力。我们引入了机械主题模型(MTMs),这是一组以稀疏自动计算器(SAEs)所学的可解释特征运作的一类专题模型。MTMs通过界定这个精度丰富的空间,可以揭示更深的概念主题,并进行表达性特征描述。此外,在专题模型中,MTMs使得能够使用基于主题的指导矢量进行可控的文本生成。为了恰当地评估基于单词列表的方法的MTM专题,我们建议采用基于LMM的双向比较评估框架。在五个数据集中,MTMs匹配或超过关于一致性指标的传统和神经基线,专题法官一贯选择,并能够有效地指导LM产出。
Article 73
Title@2025-07-31 (4): Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires
Title: Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires | Kulturelle Bias in großen Sprachmodellen: Bewertung von KI-Agenten durch moralische Fragebögen | 大语言模式中的文化偏见:通过道德问卷评估AI代理 2507.10073v2 |
Authors (1): Simon Münker
Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs’ origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn’t consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.
人工智能系统是否真正代表了人类价值观,或者仅仅是在它们中间平均?我们的研究显示,现实存在:尽管有语言能力,大语言模型(LLMs)不能代表不同的文化道德框架。我们通过在19个文化背景中应用道德基础问卷,暴露了人工智能生成的和人类道德直觉之间的巨大差距。将多种最先进的LMs起源与人类基线数据进行比较,我们发现这些模型系统地将道德多样性同化。令人惊讶的是,扩大的模型规模并不能不断提高文化代表性的忠诚性。我们的调查结果挑战了LMs在社会科学研究中越来越多地作为合成人群使用,并凸显了当前人工智能调整方法中的基本局限性。没有数据驱动的一致,这些系统就无法捕捉到细微细的、文化特有的道德直觉。我们的结果要求更加有根基的调整目标和评估指标,以确保人工智能系统代表不同的人类价值观,而不是固化道德景观。
Article 74
Title@2025-07-31 (4): Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples
Title: Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples | Fehler sind die Steinschritte zum Erfolg: Erweitern Sie das wenige-heiße In-Context-Lernen durch die Nutzung negativer Muster | 失败是走向成功的一步步石:通过利用负面样本加强少许热的文体学习 2507.23211v1 |
Authors (4): Yunhao Liang, Ruixuan Ying, Takuya Taniguchi, Zhe Cui
Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection.
大型语言模型展示了强大的点数的文字学习(ICL)能力,但是其性能非常敏感。最近的研究侧重于为每个输入查询检索相应的实例,不仅提高了学习过程的效率和可扩缩性,而且减轻了人工选择实例的固有偏差。然而,这些研究主要强调利用正面样本,同时忽略了负面样本中的额外信息,以便进行背景学习。我们提出了一个新颖的方法,利用负面样本更好地选择积极的样本实例,从而提高少数点数的ICL的性能。最初,我们根据Zero-Shot-Cot, 构建正和负样样子公司。然后,在推断期间,我们采用了基于语义的相似性方法,从正和负形样本中选择最相似的例子,供特定查询。随后,我们进一步从基于语义与负面示例相似的正面样本中获取正面实例,然后将它们与先前选定的正面实例混为一体,作为ICL的演示。实验结果表明,我们的方法超越了我们仅依靠最相似的积极性样本选择方法,通过改进的正面样本来提高积极性实例。
Article 75
Title@2025-07-31 (4): InfAlign: Inference-aware language model alignment
Title: InfAlign: Inference-aware language model alignment | InfAlign: Inference-aware Sprachmodellausrichtung | Infagign: 参考意识语言模型对齐 2412.19792v4 |
Authors (12): Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami
Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.
语言模型的匹配是培训现代基因化语言模型的关键步骤。 匹配目标的目标是提高参照基准模型的匹配模式样本的赢率。 今天,我们越来越多地使用推算时间算法(例如最佳计算方法、受控解码、树搜索)从语言模型解码,而不是标准抽样。 我们显示,这种火车/测试错配使标准的RLHF框架在这种推论时间方法方面达到亚最佳标准RLHF框架。 为此,我们提议了一个推算-觉一致(InfAllign)框架框架框架框架,目的是根据基准模型优化一致政策中的推论时间赢率。 我们证明,对于任何推断-时间解码程序,最佳调整政策是解决标准RLHF问题的办法,而不是标准抽样。 这激励我们提供校准RL(InfAlign-CTRL)框架的校准和变校准方法来解决这一问题,这需要一个奖赏性校准步骤和KL- 定期奖赏步骤,目的是对基准模型进行最优化的修改。 最佳的校准率是最终的校准率。 显示我们最终的校准标准的校准率。
Article 76
Title@2025-07-31 (4): Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages
Title: Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages | Towards Inclusive ASR: Untersuchung der Sprachumwandlung für Dysarthric Speech Recognition in Low-Resource Sprachen | 努力实现包容性的ASR:低资源语言中承认代谢语言语音转换调查 2505.14874v4 |
Authors (10): Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
由于数据稀缺,特别是非英语语言的数据稀缺,对读写语言的自动语音识别(ASR)仍然具有挑战性。为此,我们调整了英语读写语言语言(UASpeech)的语音转换模型,以将发言者的特征和偏差进行编码,然后将其用于将健康的非英语语言(FLEURS)转换成非英语的读写语言(FLEURS),然后将生成的数据用于微调多语种语言的ASR模型(MMS),即大众多语种语言的语音(MMS),以便改进读写语言识别。对PC-GITA(西班牙语)、EasyCall(意大利语)和SSNCE(塔米尔语)的评价表明,语音转换和手动转换都大大超越了现有MMS的功能和常规增强技术,如速度和节奏渗透。对生成数据的客观和主观分析进一步证实,生成的语音模拟了Dysarthric特性。
Article 77
Title@2025-07-31 (4): Explaining vague language
Title: Explaining vague language | Unbestimmte Sprache erklären | 解释含糊措辞 2404.18154v2 |
Authors (2): Paul Égré, Benjamin Spector
Why is language vague? Vagueness may be explained and rationalized if it can be shown that vague language is more useful to speaker and hearer than precise language. In a well-known paper, Lipman proposes a game-theoretic account of vagueness in terms of mixed strategy that leads to a puzzle: vagueness cannot be strictly better than precision at equilibrium. More recently, 'Egr'e, Spector, Mortier and Verheyen have put forward a Bayesian account of vagueness establishing that using vague words can be strictly more informative than using precise words. This paper proposes to compare both results and to explain why they are not in contradiction. Lipman’s definition of vagueness relies exclusively on a property of signaling strategies, without making any assumptions about the lexicon, whereas 'Egr'e et al.’s involves a layer of semantic content. We argue that the semantic account of vagueness is needed, and more adequate and explanatory of vagueness.
为何语言模糊?如果能够证明模糊语言比准确语言更有用,那么模糊语言可能会被解释和合理化。在一份众所周知的报纸上,利普曼提出混杂战略的模糊性游戏理论说明:模糊性绝对不能比平衡的精确性好。最近,斯派克特、斯派克特、莫蒂尔和韦尔希恩(Verheeen)提出了一个模糊性说明,证明使用模糊语言可以比精确语言更严格地说明信息。本文提议比较结果,解释为什么它们不矛盾。利普曼(Lipman)关于模糊性的定义完全依赖于信号战略的属性,而没有对词汇法作任何假设,而“Egrgr'e et al”等则涉及语义内容的层层。我们争辩说,需要模糊性语义的描述,并且更充分、更确切地解释模糊性。
Article 78
Title@2025-07-31 (4): Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
Title: Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks | Geak: Einführung von Triton Kernel AI Agent & Evaluation Benchmarks | Geak:介绍Triton Kernel AI 代理和评估基准 2507.23194v1 |
Authors (10): Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, Emad Barsoum
The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.
对AI 生成的 GPU 内核的需求正在迅速增长,这受到产业和学术界对可升级、硬件优化解决方案的需求的影响。随着深层次学习工作量在复杂和多样性方面不断增加,必须使低层内核开发自动化,以满足业绩和生产力需求。主要云源提供商、半导体公司和研究机构目前正在对GPU的AI驱动代码生成进行大量投资,目的是减少手工优化努力,同时在AMD MI300X等硬件上实现近距离专家业绩。Triton语言是用于GPU编程的基于Python的快速加速、硬件优化解决方案。Triton语言是用于GPUNB内核的多功能目标,因为其业绩平衡和生成方便性能。在这项工作中,我们为基于Triton的 GPUP内核内核和Gech (Generage 高效的 AI-cent GPUN) 生成了一套框架,该框架利用尖端LMM(包括AM MI300X 和 MI250) 的快速性硬化硬化操作码,具体用于AUD内核化的硬化的硬化操作。
Article 79
Title@2025-07-31 (4): EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts
Title: EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts | EgoOops: Ein Datensatz zur Erkennung von Fehlern aus egozentrischen Videos, die sich auf Verfahrenstexte beziehen | EgoOops: 用于从 Egocentic 视频中检测错误动作的数据集, 指程序文字 2410.05343v3 |
Authors (10): Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, Shinsuke Mori
Mistake action detection is crucial for developing intelligent archives that detect workers’ errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Additionally, current mistake datasets rarely use procedural texts for video recording except for cooking. To fill these gaps, this paper proposes the EgoOops dataset, where egocentric videos record erroneous activities when following procedural texts across diverse domains. It features three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. We also propose a mistake detection approach, combining video-text alignment and mistake label classification to leverage the texts. Our experimental results show that incorporating procedural texts is essential for mistake detection. Data is available through https://y-haneji.github.io/EgoOops-project-page/.
现有研究侧重于自由式活动中的目视明显错误,从而导致只用视频方式发现错误;然而,在跟踪文本的活动中,模型不能在不参考文本的情况下确定某些行动的正确性;此外,目前的错误数据集很少使用程序文本进行录像记录,但烹饪除外;为填补这些空白,本文件提议EgoOops数据集,在EgoOops数据集中,以自我为中心的视频记录不同领域遵循程序文本的错误活动。它有三种说明类型:视频文本调整、错误标签和错误描述。我们还提议了一种错误识别方法,将视频文本的校正和错误标签分类结合起来,以利用文本。我们的实验结果表明,纳入程序文本对于发现错误至关重要。数据可通过https://y-haneji.github.io/EgoOops-Project-project-page/查阅。
Article 80
Title@2025-07-31 (4): Leveraging LLMs to Create Content Corpora for Niche Domains
Title: Leveraging LLMs to Create Content Corpora for Niche Domains | LLMs nutzen, um Content Corpora für Niche Domains zu erstellen | 利用LMLM 来为新域创建内容公司 2505.02851v2 |
Authors (3): Franklin Zhang, Sonya Zhang, Alon Halevy
Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.
在本文中,我们引入了一种简化的方法,通过高效获取、过滤、构建和清洁基于网络的数据来生成高质量的、针对具体领域的公司。我们展示了如何利用大语言模型来解决规模化的复杂数据整理问题,并提出了一个包含有结构化内容提取和语义重复的LLM强化技术的战略框架。我们验证了我们在行为教育领域的做法,将它整合为30天Me,即习惯形成应用程序。我们称为30DayGen的数据管道从15K网页上提取和合成了3 531个30天的独特挑战。用户调查显示,5个网页的满意度为4.3分,91%的答卷者表示愿意使用经整理的内容实现习惯形成目标。
Article 81
Title@2025-07-31 (4): LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration
Title: LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration | LENS: Lerne Ensemble Vertrauen aus neuralen Staaten für Multi-LLM-Antwortintegration | LENS:从神经国家学习多LLM应答整合的集合信任 2507.23167v1 |
Authors (1): Jizhou Guo
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.
大型语言模型(LLMS)在各种任务中表现出了令人印象深刻的成绩,不同模型在不同的领域和具体能力方面表现得不同。有效地结合对多个LLMS的预测对于提高系统稳健性和性能至关重要。然而,现有的混合方法往往依赖简单的技术,如投票或登录组合,这些技术忽视了不同情况下模型的不同信心和可靠性。在这项工作中,我们提议LENS(从神经国学习可综合信任),这是一种新颖的方法,通过分析内部代表来评估模型的信心。我们为每个LM公司培训了一个轻量线性线性信心预测器,该预测器能够利用分层的隐藏状态和正常的概率作为投入。这使得能够根据不同背景的可靠性对模型预测进行更细致的加权。我们的方法并不要求修改模型参数,而需要微不足道的额外计算。多曲和布林问答任务的实验结果表明,LENS比传统的混合方法要差很多。我们的研究结果表明,内部代表提供了宝贵的信号,用以确定模型信任度,并且能够有效地利用该软件学习。
Article 82
Title@2025-07-31 (4): Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Title: Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation | Vision-Language-Modelle sind in Bezug auf Expression Generation nicht pragmatisch kompetent | 视觉-语言模型在代言表达式生成中不具备实用能力 2504.16060v3 |
Authors (9): Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai
Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.
表达生成是一项核心任务,用于评价视觉语言系统的实际能力,不仅需要准确的语义基础,还需要遵守合作通信原则(Grice,1975年)。然而,目前对视觉语言模型的评价往往忽略了务实层面,将区域语言模型降低到基于区域的字幕任务,忽视了Gricean的格言。在这项工作中,我们从务实的角度重新审视区域地名组,引入一个新的数据集(RefOI) 1.5k 图像,带有书面和口头参考表达的附加说明。通过系统评估最新VLMs,我们发现务实能力的三个关键缺陷:(1) 无法独家识别参考内容,(2) 包括过度或不相关的信息,(3) 与人的实际偏好不相匹配,例如没有充分利用最低限度的空间提示。我们还表明标准自动评价未能捕捉到这些务实的违规行为,加强了肤浅的提示,而不是真正的偏差的成功。我们的调查结果要求重新关注符合真实人类交流的实用知情模式和评价框架。
Article 83
Title@2025-07-30 (3): User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal
Title: User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal | User Feedback in Human-LLM Dialogen: Ein Objektiv, um die Nutzer zu verstehen, aber laut als Lernsignal | 人类- LLLM 对话框中的用户反馈: 了解用户的镜头, 但将吵闹当作学习信号 2507.23158v1 |
Authors (3): Yuhan Liu, Michael J. Q. Zhang, Eunsol Choi
Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user’s initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.
一旦采用了语言模型(LMS),他们就可以与用户进行长期互动,理想的是,根据他们的反馈不断演变。询问直接用户反馈可能会造成干扰;因此,我们研究从用户-LM互动日志中收集用户反馈;我们在两个用户-LM互动数据集(WildChat和LMSYS)中研究隐含用户反馈。首先,我们在用户-LM对话轨迹中分析用户反馈,提供这种反馈的时间和原因。第二,我们研究从这些隐性用户反馈中收集学习的信号。我们发现,用户反馈的内容(例如,用户希望澄清)不仅仅是极性(例如,用户对先前的模型回应不满意),还可以在短期设计的问题(Bench)中改进模型性能,而不是在更长期和复杂的问题(WildBench)中(WildBench)中改进模型性能。我们还发现,用户反馈的有用性很大程度上与用户最初迅速提供的质量挂钩。我们共同提供隐性用户反馈的深入研究,显示其潜力和局限性。
Article 84
Title@2025-07-30 (3): Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer
Title: Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer | Kann eine Größe für alle passen?: Messfehler in Multi-Document-Zusammenfassung Domain-Transfer | 能够一刀切吗? :在多文件概括性文件转让中衡量失败 2503.15768v2 |
Authors (2): Alexandra DeLucia, Mark Dredze
Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training (“direct”), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer “failure” as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.
抽象的多文件总结(MDS)是用多种文件自动总结信息的任务,从新闻文章到与多位演讲人的对话,目前的MDS模式的培训方法可以分为四种方法:端到端,特别培训前(“直接”),块到端,当下总结,提取-当下总结,以及与GPT式模型的推论。在这项工作中,我们评估MDS模式在各种培训方法、领域和层面(参考类似性、质量和事实质量)之间(参考性、质量和事实质量),分析一个领域培训的模型如何和为什么不能在零发域转移设置中总结另一个领域(新闻、科学和对话)的文件。我们把域传输“失败”定义为事实质量下降,偏离目标程度更高,以及概要质量普遍下降。除了探索MDS模型的域转移外,我们还研究应用大众概括指标出框的潜在问题。
Article 85
Title@2025-07-30 (3): ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans
Title: ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans | ISO-Bench: Benchmarking multimodaler Kausalität in visuellen Sprachmodellen durch verfahrenstechnische Pläne | ISO-Bench:通过程序计划确定视觉语言模型中多式因果关系基准 2507.23135v1 |
Authors (3): Ananya Sadana, Yash Kumar Lal, Jiawei Zhou
Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.
理解不同模式之间的因果关系是现实世界环境中运行的多式联运模式面临的一个核心挑战。我们引入了ISO-Bench,这是评估模型能否推断视觉观察与程序文本之间因果关系的基准。每个实例都展示了任务步骤的图像和计划文字片断,目的是决定视觉步骤是在参考文本步骤之前还是之后发生。十个前沿愿景语言模型的评价结果显示业绩不佳:最佳零点F1仅为0.57,思考推理链在人类(0.98 F1)之后只产生微小的收益(高达0.62 F1)。 我们的分析进一步强调了改善多式联运模型中因果关系的具体方向。
Article 86
Title@2025-07-30 (3): Meta CLIP 2: A Worldwide Scaling Recipe
Title: Meta CLIP 2: A Worldwide Scaling Recipe | Meta CLIP 2: Ein weltweites Scaling-Rezept | Meta CLIP 2: 全球规模扩大食谱 2507.22062v2 |
Authors (16): Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu
Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP’s training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., “curse of multilinguality” that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.
虽然CLIP成功地接受了来自英国世界的10亿比例图像文本对来自英国世界的训练,但将CLIP的培训进一步推广到从世界网络数据中学习,仍然具有挑战性:(1) 没有可处理来自非英语世界的数据点的校正方法;(2) 现有多种语言CLIP的英语表现比其只使用英语的对应方(即LLMMS常见的“多语种的诅咒”)差得多。 我们在这里介绍Meta CLIP 2, 在世界网络规模的图像文本配对中从零到十亿比例的首次配方培训CLIP。 为了概括我们的调查结果,我们进行了严格的推理,但为了应对上述挑战,我们提出了一个能够从来自非英语世界的数据中相互受益的配方。 零点图像网络分类,Meta CLIP 2 VIT-H/14比其只使用英语的对应方(LIP)要高出0.8%和 mSigLIP 2, 图像加0.7 % , 和新状态的CI-QM-QM-QM-QM-G-G-I-G-G-I-G-I-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G
Article 87
Title@2025-07-30 (3): Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity
Title: Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity | Enthüllen der Fragilität von vertrauenswürdigen LLMs durch chinesische Text-Ambiguität | 通过中文文字缩略图,揭开可信赖的LLM 易用性 2507.23121v1 |
Authors (7): Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang
In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.
在这项工作中,我们研究了关于大语言模型(LLMs)的可信赖性的一个关键研究问题:LLMs在遇到模棱两可的叙述性文字时如何表现,特别侧重于中文文本的模糊性;我们通过收集和产生带有上下文和相应的脱节配对的模糊性句子,并代表多种可能的解释,创建了基准数据集;这些附加说明的例子系统地分为三大类和9个子类;通过实验,我们发现LLMs在处理模棱两可时非常脆弱,暴露出与人类有重大差异的行为。具体地说,LLMs无法可靠地区分模棱两极的文字和毫不含糊的文字,在试图理解各种可能的含义时表现出对模棱两可的自信,表现出过度思考。我们的调查结果突出了目前LLMs的基本限制,这些限制对在实际应用中的语言模糊性具有重大影响,要求改进处理语言理解不确定性的方法。这个GitHub储存库可以公开查阅数据集和代码:https://github.com/ictup/LLLM-China-Dext-Dismbulguarguation)。
Article 88
Title@2025-07-30 (3): RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL
Title: RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL | RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL | RASL: 大规模数据库文本到 SQL 的检索增强的相连接表表 2507.23104v1 |
Authors (3): Jeffrey Eben, Aitzaz Ahmad, Stephen Lau
Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.
尽管在数据库的大型语言模型(LLM)的自然语言界面方面取得了进展,但推广到企业一级数据目录仍然是一个未得到充分探讨的挑战。先前应对这一挑战的工作依赖于具体领域的微调(复杂的部署),未能利用数据库元数据中所包含的重要的语义背景。为解决这些局限性,我们引入了一个基于组成部分的检索结构,将数据库的图和元数据分解成独立的语义单位,每个单元单独编制索引,用于有针对性的检索。我们的方法优先考虑有效的表格识别,同时利用列级信息,确保检索的表格总数保持在可管理的背景预算之内。实验表明,我们的方法保持高回溯率和准确性,我们的系统运行基线超过结构不同和可用元数据庞大的数据库。我们的解决办法使实用的文本到SQL系统能够在没有专门微调的情况下在不同的企业环境中部署,从而解决自然语言数据库界面中关键的可扩展差距。
Article 89
Title@2025-07-30 (3): SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity
Title: SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity | SMART-Editor: Multi-Agenten-Framework für menschenähnliche Designbearbeitung mit struktureller Integrität | SMART-编辑:具有结构完整性的多机构设计设计框架 2507.23095v1 |
Authors (5): Ishani Mondal, Meera Bharadwaj, Ayush Roy, Aparna Garimella, Jordan Lee Boyd-Graber
We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.
我们提出了SMART-编辑器,这是一个结构化(海报、网站)和无结构化(自然图像)领域间构件布局和内容编辑的框架。与以往进行本地编辑的模型不同,SMART-编辑器通过两个战略维护全球一致性:Reward-Refine(一种推论-时间奖励制导精炼方法)和RewardDPO(一种使用符合奖励的布局配对的培训-时间偏好优化方法)。为了评价模型性能,我们采用了SMARTEdit-Bench(一种涵盖多数据、层层化编辑情景的基准)。SMART-编辑器超越了教官Pix2Pix和HIVE等强有力的基线,奖励DPO在结构化环境中取得了高达15%的收益,而Reward-Refine则展示了自然图像的优势。自动和人文评价证实了以奖励为指南的规划在生成精度一致和视觉一致的编辑过程中的价值。
Article 90
Title@2025-07-30 (3): Context-aware Rotary Position Embedding
Title: Context-aware Rotary Position Embedding | Context-aware Rotary Position Einbetten | 扶轮位置嵌入式 2507.23083v1 |
Authors (3): Ali Veisi, Delaram Fartoot, Hamidreza Amirzadeh
Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.
定位编码是变换器结构的重要组成部分,使模型能够将序列顺序纳入自我注意机制。 扶轮定位嵌入器(ROPE)因其与相对位置编码和计算效率的兼容性而成为广泛采用的解决办法。 然而, RoPE依赖静态的、投入独立的正弦频率模式,限制了其模拟环境敏感关系的能力。 在这项工作中,我们建议CARoPE(Context-Aware 旋转定位嵌入器)是RoPE(CARoPE)的新通用模型,它动态地生成以象征性嵌入为条件的头部频率模式。这个设计引入了对象征和背景敏感的定位表示方式,同时保持 RoPE 效率和建筑简化。 CARoPE 使用象征性嵌入的捆绑式转换并把它们纳入旋转机制。 我们用GPT-2模型对FineWeb-Edu 10B数据集进行了评估。 实验结果表明,CARPE(C)持续地超越了对标志嵌入器和其他通用的直观定位定位定位定位定位定位显示位置,使得CAR- 更稳定成为了更低的升级。
Article 91
Title@2025-07-30 (3): Exploring In-Context Learning for Frame-Semantic Parsing
Title: Exploring In-Context Learning for Frame-Semantic Parsing | In-Context-Lernen für rahmensemantisches Parsing erforschen | 探索用于框架语义分析的内文学习 2507.23082v1 |
Authors (3): Diego Garat, Guillermo Moncecchi, Dina Wonsever
Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.
框架语义解析( FSP) 包含根据框架语义解析( FSP) 来识别上游和标签其参数。 本文调查使用大语言模型( LLMs) 来实施 FSP 而不进行模型微调的方法。 我们建议了一种方法,可以自动生成用于框架识别( FI) 和框架语义识别( FSRL) 子任务的具体任务提示, 仅依靠 FramtNet 数据库。 这些根据框架定义和附加说明的例子构建的提示, 用于指导六种不同的LMs 。 实验是在与暴力事件有关的一组框架上进行的。 方法取得了竞争性结果, FI 的F1分数为94.3%, FSRL 的得分为77.4%。 研究结果表明, ILL为域特定 FSP 任务的传统微调提供了实用和有效的替代方法。
Article 92
Title@2025-07-30 (3): Math Natural Language Inference: this should be easy!
Title: Math Natural Language Inference: this should be easy! | Math Natural Language Inferenz: das sollte einfach sein! | Math自然语言推论:这应该很容易! 2507.23063v1 |
Authors (7): Valeria de Paiva, Qiyue Gao, Hai Hu, Pavel Kovalev, Yikang Liu, Lawrence S. Moss, Zhiheng Qian
We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only “inference” in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.
我们问当代LLMs是否能够在数学文本上进行自然语言推断(NLI)任务。我们称之为数学NLI问题。我们建造了一组数学NLI配对,其前提来自现有数学文本,其假设和金标签由具有研究水平数学和NLI领域经验的人提供。我们还用同一前提调查公司的质量,但其假设由LLMs自己提供。我们不仅调查各种LLM团体的性能,而且研究其集团的一致性。我们既有正面的,也有负面的。我们的积极发现包括:在某些情况下,使用多数LLLLMs的选票相当于在数学NLI区域使用人类标签数据。消极的一面是:LLMS仍然与数学语言斗争,有时甚至没有基本的推论。目前的模式在我们的数据中不象上一代那样容易使用假设的“推论 ” 。除了我们的调查结果外,我们还提供我们的Coropora的数据,作为未来数学NLILI工作的支持数据。
Article 93
Title@2025-07-30 (3): Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion
Title: Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion | Conan: Ein Chunkwise Online-Netzwerk für Null-Shot Adaptive Voice Conversion | Conan:一个零热适应性语音转换的中远在线网络 2507.14534v3 |
Authors (3): Yu Zhang, Baotong Tian, Zhiyao Duan
Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics. Audio samples can be found at https://aaronz345.github.io/ConanDemo.
零点在线语音转换(VC)为实时通信和娱乐带来了巨大的希望。然而,当前的 VC 模型在实时限制下努力维护语义真实性,提供自然声音转换,并有效地适应隐性扬声器特性。为了应对这些挑战,我们引入了Conan, 这是一种粗略的在线零点声音转换模式,它保存源的内容,同时匹配音调和参考演讲的风格。 Conan 由三个核心部分组成:1) 一种流体内容提取器,它利用Emexex对低纬度流流内容进行编码;2) 一种调制风格编码器,它从参考演讲中提取精细的发光的文理学特征,用于强化风格适应;3) 一种Causal Shuffle Vocoder,它使用像素-shuffle机制来实施完全因果的HIFIGAN。实验性评估表明, Conan 在主观和客观的计量标准中超越基线模型。音样样本见https://aronz345.github.io/ConanDemo。
Article 94
Title@2025-07-30 (3): Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review
Title: Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review | Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Moslems in Large Language Models: A Systematic Review | 减轻大语言模式中针对阿拉伯人和穆斯林的文化偏见的迅速工程技术:系统审查 2506.18199v2 |
Authors (3): Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil
Large language models have demonstrated remarkable capabilities across various domains, yet concerns about cultural bias - particularly towards Arabs and Muslims - pose significant ethical challenges by perpetuating harmful stereotypes and marginalization. Despite growing recognition of bias in LLMs, prompt engineering strategies specifically addressing Arab and Muslim representation remain understudied. This mixed-methods systematic review examines such techniques, offering evidence-based guidance for researchers and practitioners. Following PRISMA guidelines and Kitchenham’s systematic review methodology, we analyzed 8 empirical studies published between 2021-2024 investigating bias mitigation strategies. Our findings reveal five primary prompt engineering approaches: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Although all approaches show potential for reducing bias, effectiveness varied substantially across studies and bias types. Evidence suggests that certain bias types may be more resistant to prompt-based mitigation than others. Structured multi-step pipelines demonstrated the highest overall effectiveness, achieving up to 87.7% reduction in bias, though they require greater technical expertise. Cultural prompting offers broader accessibility with substantial effectiveness. These results underscore the accessibility of prompt engineering for mitigating cultural bias without requiring access to model parameters. The limited number of studies identified highlights a significant research gap in this critical area. Future research should focus on developing culturally adaptive prompting techniques, creating Arab and Muslim-specific evaluation resources, and integrating prompt engineering with complementary debiasing methods to address deeper stereotypes while maintaining model utility.
大型语言模式在各个领域都表现出了非凡的能力,然而,对文化偏见的关切,特别是对阿拉伯人和穆斯林的偏见,通过延续有害的陈规定型观念和边缘化,提出了重大的道德挑战。尽管人们日益认识到LLMS中的偏见,但针对阿拉伯和穆斯林代表性的迅速工程战略仍然没有得到充分研究。这种混合系统审查审查了这些技术,为研究人员和从业人员提供了循证指导。根据PRISMA准则和基切纳姆的系统审查方法,我们分析了2021-2024年调查减少偏见战略期间发表的8项经验研究。我们的调查结果揭示了5种主要的迅速工程方法:文化促进、影响性边缘、自我贬低技术、结构化多步管道和参数优化持续推动。尽管所有方法都显示有可能减少偏见,但各种研究和偏见类型之间有很大差异。根据PRISMA准则和基切纳姆的系统审查方法,我们分析了8项经验,显示了最高的总体效果,减少了87.7%的偏见,尽管它们需要更多的技术专长。文化促进更加广泛的可获取性。这些结果突出表明了在减少文化偏见研究领域获得及时的工程选择,同时发展重大的结构调整研究领域,应该确定一个快速的弹性研究领域。
Article 95
Title@2025-07-30 (3): Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
Title: Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning | Wo man Demos in Deinem Prompt zeigt: Ein positionelles Bias des In-Context-Lernens | 在哪里显示您快速的演示 : 内容学习的定位偏见 2507.22887v1 |
Authors (2): Kwesi Cobbina, Tianyi Zhou
In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. However, it has been found that ICL’s performance can be sensitive to the choices of demos and their order. This paper investigates an unexplored new positional bias of ICL for the first time: we observe that the predictions and accuracy can drift drastically when the positions of demos, the system prompt, and the user message in LLM input are varied. We refer to this bias as DEMOS’ POSITION IN PROMPT (DPP) bias. We design a systematic evaluation pipeline to study this type of positional bias across classification, question answering, summarization, and reasoning tasks. We introduce two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, to quantify net gains and output volatility induced by changes in the demos’ position. Extensive experiments on ten LLMs from four open-source model families (QWEN, LLAMA3, MISTRAL, COHERE) verify that the bias significantly affects their accuracy and predictions: placing demos at the start of the prompt yields the most stable and accurate outputs with gains of up to +6 points. In contrast, placing demos at the end of the user message flips over 30\% of predictions without improving correctness on QA tasks. Smaller models are most affected by this sensitivity, though even large models remain marginally affected on complex tasks.
内文学习(ICL)是大型语言模型(LLMs)的新兴关键能力,在推论期间,通过将一些演示(演示)包含在内,可以进行一些微小的学习。然而,我们发现,ICL的表现可以敏感地考虑演示的选择及其顺序。本文首次调查了ICL的未探索的新定位偏差:我们观察到,当演示、系统快速和LLM投入中的用户信息变小时,预测和准确性就会急剧移动。我们称这种偏差为DEMOS在PROMPT(DPP)中的定位。我们设计了一个系统化的评价管道,研究这种在分类、问题回答、概括和推理方面的立场偏差。我们引入了两个指标,ACCURACY-Change和PREniction-Change,以量化由于降级立场的变化而导致的净收益和产出波动性。我们观察到,四个开源模型家庭的10个LMSMs(QEN、LA3、MIA3、MIA最准确性)中的定位偏差。我们设计了一种系统评价管道对用户的准确性、CERAL6和最精确的预测结果进行核查。
Article 96
Title@2025-07-30 (3): C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations
Title: C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations | C3: Ein zweisprachiger Benchmark für gesprochene Dialogmodelle zur Erforschung von Herausforderungen in komplexen Gesprächen | C3:探讨复杂对话挑战的口头对话模式的双语基准 2507.22968v1 |
Authors (3): Chengqian Ma, Wei Tao, Yiwen Guo
Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users’ spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.
最近,口语对话模式(SDMs)因其直接对用户的询问作出语音反应的能力而引起极大关注。尽管这种模式越来越受欢迎,但在全面了解其在理解和模拟人类对话方面的实际效力的研究方面存在着差距。这与基于文本的大语言模式(LLMs)相比尤其如此,后者得益于广泛的基准;由于口语对话的独特性,人类的语音互动本身就比文字更为复杂。模糊性构成一个挑战,来自多种语言等语义因素,以及声学方面,如地形学、异名学和压力模式。此外,背景依赖性,如疏漏、共同参照和多方向互动,增加了人类对话动态的复杂性。为了阐明目前SDM发展的状况并应对这些挑战,我们在本文件中提供了一套基准数据集,其中包括1 079例英语和中文的病例。这一数据集与基于LM的评估方法密切相关,有助于全面探索SDMs在应对这些实际挑战方面的表现。
Article 97
Title@2025-07-30 (3): GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis
Title: GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis | GeoOutageKG: Ein multimodaler Geospatiotemporaler Wissensgraph für die Multiauflösungsanalyse von Stromausfällen | GeoouteageKG:多分辨率电源外向分析多式地球观测时知识图 2507.22878v1 |
Authors (4): Ethan Frakes, Yinghui Wu, Roger H. French, Mengjie Li
Detecting, analyzing, and predicting power outages is crucial for grid risk assessment and disaster mitigation. Numerous outages occur each year, exacerbated by extreme weather events such as hurricanes. Existing outage data are typically reported at the county level, limiting their spatial resolution and making it difficult to capture localized patterns. However, it offers excellent temporal granularity. In contrast, nighttime light satellite image data provides significantly higher spatial resolution and enables a more comprehensive spatial depiction of outages, enhancing the accuracy of assessing the geographic extent and severity of power loss after disaster events. However, these satellite data are only available on a daily basis. Integrating spatiotemporal visual and time-series data sources into a unified knowledge representation can substantially improve power outage detection, analysis, and predictive reasoning. In this paper, we propose GeoOutageKG, a multimodal knowledge graph that integrates diverse data sources, including nighttime light satellite image data, high-resolution spatiotemporal power outage maps, and county-level timeseries outage reports in the U.S. We describe our method for constructing GeoOutageKG by aligning source data with a developed ontology, GeoOutageOnto. Currently, GeoOutageKG includes over 10.6 million individual outage records spanning from 2014 to 2024, 300,000 NTL images spanning from 2012 to 2024, and 15,000 outage maps. GeoOutageKG is a novel, modular and reusable semantic resource that enables robust multimodal data integration. We demonstrate its use through multiresolution analysis of geospatiotemporal power outages.
检测、分析和预测断电对于电网风险评估和减灾至关重要。每年大量断电,飓风等极端天气事件加剧了这种情况。现有的断电数据通常在县一级报告,限制了其空间分辨率,使其难以捕捉局部模式。然而,它提供了极佳的时间颗粒度。相比之下,夜光卫星图像数据提供了显著更高的空间分辨率,使得能够对断电情况进行更全面的空间描述,提高了评估灾害事件后断电的地理范围和严重程度的准确性。然而,这些卫星数据只能每天提供。将广地视觉和时序数据源整合到统一的知识代表中可以大大改进断电的探测、分析和预测性推理。在本论文中,我们提出了GeoOutageKG,这是将各种数据源(包括夜光卫星图像数据、高分辨率超高分辨率断电流出图)以及州级断流数据断流报告。我们通过将2012年GOOOOOO的源数据与20 GOOOOOOOOOLA、从20 GOOOOO的SOODLA、从2014年的SOOOOOOOLOLOLA、30的SOOLOOOOLOLOOOOOOOOD的20, 20OOLOLOLOLOOOOOOLOLOOOOOOOOOOD的20OO、从20OOOOOOOOOOOOOOOOOOOODLODLLLLODLODLDLLODLLLODLOODLODLLOD)综合数据整合数据整合数据整合了从20000、从20000到20000的多数据到2014的多数据到20000OOOOOOOOOOOOOOOOOOOO的多数据流数据流数据流数据流数据流数据流数据流数据流数据。
Article 98
Title@2025-07-30 (3): FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models
Title: FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models | FRED: Finanzielle Retrieval-erweiterte Erkennung und Bearbeitung von Halluzinationen in Sprachmodellen | FRED: 财务检索-加强发现和编辑语言模型中的幻觉 2507.20930v2 |
Authors (3): Likun Tan, Kuan-Wei Huang, Kevin Wu
Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at https://github.com/pegasi-ai/shield.
大型语言模型的幻觉给需要事实可靠性的应用,特别是金融等高端领域,带来了严峻的挑战。这项工作为根据所提供的背景,探测和编辑模型生成的回复中不正确内容提供了有效方法。鉴于用户定义的域特定误差分类,我们通过将贴标签错误插入财务问题解答公司,然后将四种语言模型(Phi-4、Phi-4-mini、Qwen3-4B和Qwen3-14B)微调四种语言模型(Phi-4、Phi-4-mini、Qwen3-4B和Qwen3-14B),以发现和编辑这些事实不准确之处。我们最优秀的模型(经微调的Phi-4-4)在二进F1评分上实现了8%的改进,总体检测性能比OploaAI-o3增加了30%。 值得注意的是,我们经过微调的Phi-4-mini模型尽管只有40亿个参数,但仍保持竞争性的性能,与OpenAI-o3相比,总体检测下降了0.1%。我们的工作为在发现和编辑财务生成中发现和编辑事实不一致事实不一致提供了实用解决办法的实用解决办法,同时采用通用的通用框架,可以加强信任和调整。
Article 99
Title@2025-07-30 (3): Past Meets Present: Creating Historical Analogy with Large Language Models
Title: Past Meets Present: Creating Historical Analogy with Large Language Models | Vergangenheit trifft Gegenwart: Historische Analogie mit großen Sprachmodellen erstellen | 过去曾出席的会议:创建具有大语言模式的历史分析 2409.14820v2 |
Authors (8): Nianqi Li, Siyu Yuan, Jiangjie Chen, Jiaqing Liang, Feng Wei, Zujie Liang, Deqing Yang, Yanghua Xiao
Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.
历史类比将已知的过去事件与当代但不为人所知的事件进行比较,是帮助人们作出决定和理解世界的重要能力。然而,应用史研究表明,人们很难找到适当的类比。以前在AI社区进行的研究也忽略了历史类比。为了填补这一空白,我们在本文中着重研究历史类比获取任务,目的是为某个特定事件获取类似的历史事件。我们探索根据不同的大语言模型(LLLMs)获取历史类比的检索和生成方法。此外,我们提出一种自我反省方法,以在LLMs产生历史类比时减少幻觉和陈规定型观念。通过人类评估和我们专门设计的自动多维评估,我们发现LLMs通常具有良好的历史类比潜力。通过使用我们的自我反省方法,模型的性能可以进一步改进。
Article 100
Title@2025-07-30 (3): The Incomplete Bridge: How AI Research (Mis)Engages with Psychology
Title: The Incomplete Bridge: How AI Research (Mis)Engages with Psychology | Die unvollendete Brücke: Wie KI-Forschung (Mis) mit Psychologie verstrickt | 不完整的桥梁:人工智能如何研究(Miss)心理学的组合 2507.22847v1 |
Authors (5): Han Jiang, Pengda Wang, Xiaoyuan Yi, Xing Xie, Ziang Xiao
Social sciences have accumulated a rich body of theories and methodologies for investigating the human mind and behaviors, while offering valuable insights into the design and understanding of Artificial Intelligence (AI) systems. Focusing on psychology as a prominent case, this study explores the interdisciplinary synergy between AI and the field by analyzing 1,006 LLM-related papers published in premier AI venues between 2023 and 2025, along with the 2,544 psychology publications they cite. Through our analysis, we identify key patterns of interdisciplinary integration, locate the psychology domains most frequently referenced, and highlight areas that remain underexplored. We further examine how psychology theories/frameworks are operationalized and interpreted, identify common types of misapplication, and offer guidance for more effective incorporation. Our work provides a comprehensive map of interdisciplinary engagement between AI and psychology, thereby facilitating deeper collaboration and advancing AI systems.
社会科学积累了丰富的理论和方法,用于调查人类的思想和行为,同时对人造情报系统的设计与理解提供了宝贵的见解。本研究以心理学为突出案例,通过分析2023年至2025年在首屈一指的AI网站上发表的1 006份与LLM有关的论文以及它们所引用的2 544份心理学出版物,探索AI与该领域之间的跨学科协同作用。我们通过分析,确定了跨学科融合的关键模式,确定了最经常被引用的心理学领域,并突出了尚未探讨的领域。我们进一步审视了心理学理论/框架是如何运作和解释的,找出了应用不当的常见类型,并为更有效的纳入提供了指导。我们的工作为AI与心理学之间的跨学科接触提供了全面的地图,从而促进了更深入的合作和推进了AI系统。
Article 101
Title@2025-07-30 (3): ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer
Title: ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer | ReverBERT: Ein State Space Model für eine effiziente textgesteuerte Sprachübertragung | ReverBERT: 高效发短信语音风格转让国家空间模型 2503.20992v2 |
Authors (3): Michael Brown, Sofia Martinez, Priya Singh
Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.
文本驱动的语音风格传输旨在模拟一个口语表达方式的进化、节奏和节奏,以匹配文本描述中的文体提示。虽然现有方法能够利用大型神经结构或预先培训的语言模型,但计算成本通常仍然很高。在本文中,我们提出一个高效的文本驱动语音风格传输框架,以吸引来自国家空间模型(SSSM)模式的灵感,这种模式的灵感来自基于图像的Wang和Liucite{Wang2024stemamba}。与图像域技术不同,我们的方法在语音空间中运作,并整合了潜在语音特征的离散四倍变换,以促成平稳和连续的风格调制。我们还提出一个新的 & emph{ 以透明为基础的 SSSM} 层,用于连接带有声学属性的文本样式描述符,大大缩短了推断时间,同时保留了高质量的语音特征。关于基准语言囊体的大规模实验表明,在自然性、明确性研究和计算过程中,我们的语音风格转换(We-destrutional-traction)进一步超越了我们的语言风格、直观的基线。
Article 102
Title@2025-07-30 (3): Cross-Modal State-Space Graph Reasoning for Structured Summarization
Title: Cross-Modal State-Space Graph Reasoning for Structured Summarization | Grenzüberschreitende State-Space-Graph-Gründung für strukturierte Zusammenfassung | 结构归纳的跨模式国家空间图 2503.20988v2 |
Authors (3): Hannah Kim, Sofia Martinez, Jason Lee
The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.
从大型和多式联运数据中提取精密、有意义的摘要的能力,对于从视频分析到医学报告等许多应用都至关重要,从大型和多式数据中提取精密、有意义的摘要的能力,对于从视频分析到医学报告等多种应用都至关重要。以往的跨现代合成方法往往受到高计算间接费用和有限解释的影响。在本文中,我们提议了一个\textit{Cross-Modal State-Space图形解释}(\textbf{CSS-GR})框架,该框架将基于图表的信息传递的州-空间模型与以往关于高效的州-空间模型的工作所启发的信息传递纳入其中。与目前依靠纯顺序模型的方法不同,我们的方法构建了一张图表,记录了各种模式之间和内部的关系,允许对文本流和视觉流进行更全面的推理。我们表明,我们的方法在保持计算效率的同时,大大改进了计算质量和可解释性,同时根据标准的多式合成基准进行了验证。我们还提供了全面的模拟研究,以突出每个组成部分的贡献。
Article 103
Title@2025-07-30 (3): Scaling RL to Long Videos
Title: Scaling RL to Long Videos | Skalierung von RL zu langen Videos | 缩放 RL 到长视频 2507.07966v3 |
Authors (14): Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).
我们引入了一个完整的配置框架,将视觉语言模型的推理推理升级到长视频,利用强化学习;我们应对长视频推理的独特挑战,为此整合了三个关键组成部分:(1) 大型数据集LongVideo-Reason,由104K长视频QA配对组成,配有体育、游戏和 vlogs等不同领域的高质量推理说明;(2) 双阶段培训管道,将视频模型的推理范围扩大到有想象力的精细调整(Cot-SFT)和强化学习(RL);(3) 长视频RL,名为多模式强化序列平行(MRSP)的培训基础设施,包括序列平行和基于VLLLM的引擎,用于长视频,用于高效推出和预填。 在我们的实验中,LA-RVIA-R7B在视频基准上取得强劲的成绩,在 RVIA 和RVIL 上,在视频模型上支持我们连续升级的RVA-7B,在视频系统上持续超过 RVA-R-SIL的升级。
Article 104
Title@2025-07-30 (3): MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
Title: MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models | MiniLongBench: Der kostengünstige Long Context Benchmark für große Sprachmodelle verstehen | MiniLongBunench:大语言模式低成本长方背景理解基准 2505.19959v2 |
Authors (5): Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin
Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.
长期背景了解(LCU)是当前大型语言模型(LLMM)中一个重要的探索领域。然而,由于长文本数据固有的长期性质,现有LCULLLML基准往往导致高得令人望而却步的评价费用,如测试时间和推论费用等。通过广泛的实验,我们发现现有的LCU基准存在重大冗余,这意味着评价效率低下。在本文中,我们建议了一种针对信息特征稀少的长文本数据的简明数据压缩方法。我们通过运行众所周知的LCU基准LongBench,我们创建了MiniLongBench。这个基准只包括六个主要任务类别和21项不同任务中的237个测试样品。通过对60多个LLMS的经验分析,MiniLBench实现了平均评价费用降至仅4.5 % ,同时保持了与LongBench结果的0.97的平均相关等级系数。因此,我们的MiniLongBench作为低成本基准,极有可能大大推动今后对LCUMS的能力进行研究。见 https://github.com/MilkTink-LAmb/M.M.M.M.M.
Article 105
Title@2025-07-30 (3): Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization
Title: Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization | Jenseits natürlicher Sprachpläne: Struktur-Bewusst-Planung für Abfrage-fokussierte Tabellenzusammenfassung | 超越自然语言计划: 查询用户使用表的结构-软件规划 2507.22829v1 |
Authors (3): Weijia Zhang, Songgaojun Deng, Evangelos Kanoulas
Query-focused table summarization requires complex reasoning, often approached through step-by-step natural language (NL) plans. However, NL plans are inherently ambiguous and lack structure, limiting their conversion into executable programs like SQL and hindering scalability, especially for multi-table tasks. To address this, we propose a paradigm shift to structured representations. We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi-agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases: 1) Structured Planning to generate TaSoF from a query, 2) Graph-based Execution to convert plan steps into SQL and model dependencies via a directed cyclic graph for parallel execution, and 3) Summary Generation to produce query-focused summaries. Our method explicitly captures complex dependencies and improves reliability. Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single- and multi-table settings, demonstrating the advantages of structured representations for robust and scalable summarization.
为了解决这个问题,我们提议将模式转换为结构化的表述。我们引入了一个新的结构化计划,即TaSoF,受传统多试剂系统形式主义的启发,以及一个框架,即SPaGe,将推理过程在三个阶段正式化:1)结构化规划,从查询中产生TaSoF,2)图表化执行,将计划步骤转换成SQL和模型依赖性,通过定向循环图平行执行,将计划步骤转换为SQL和模型依赖性,3)简要生成,以产生注重查询的摘要。我们的方法明确捕捉了复杂的依赖性,提高了可靠性。关于三个公共基准的实验表明,SPaGe在单一和多表环境中都一贯地超越了先前的模式,显示了结构化代表的优势。
Article 106
Title@2025-07-30 (3): SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
Title: SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs | RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs | 空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v3 |
Authors (8): Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, Jun Wang
Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models show difficulty perception misaligned with human intuition, exhibit dramatic 2Dto-3D performance cliffs, default to formulaic derivation over visualization, and paradoxically suffer performance degradation from Chain-of-Thought prompting in open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.
人类可以直接想象和操控其头脑中的视觉图像,这种能力被称为空间可视化。多式大型语言模型(MLLM)支持基于想象的推理,而空间可视化却仍然没有得到充分评价,这通常体现在更广泛的数学和逻辑评估中。现有的评价往往依靠IQ测试或数学竞赛,这些测试或数学竞赛可能与培训数据重叠,损害评估可靠性。为此,我们引入了空间Viz-Bench,这是一个综合的多式空间可视化多式基准,有四个次功能的12项任务,包括1,180个自动产生的问题。我们对33个最先进的MLLLMS的评估不仅揭示了广泛的性能差异,并展示了基准的强烈的歧视性力量,而且还揭示了反直觉的发现:模型显示与人的直觉不相匹配的难感,展示了2D到3D的性性悬崖,默认了对视觉化的公式衍生法,而且矛盾的是,在开放源模型中引发的连锁操作性下降。通过对错误类型进行统计和定性分析,SpaceViz-Ben 演示了目前现有的空间数据,从而继续展示实地数据缺陷。
Article 107
Title@2025-07-30 (3): DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph
Title: DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph | DBLPLink 2.0 – Ein Entity Linker für den DBLP-Wissenschaftsgraphen | DBLPLink 2.0 - DBLPLP 学术知识图的实体链接 2507.22811v1 |
Authors (3): Debayan Banerjee, Tilahun Abedissa Taffa, Ricardo Usbeck
In this work we present an entity linker for DBLP’s 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the “yes” token output at the penultimate layer of the LLM.
在这项工作中,我们为DBLP2025年版的RDF知识图提供了一个实体链接器。与2022年版本相比,DBLP现在将出版地点视为一个新的实体类型,名为 dblp:Stream。在先前版本的DBLPLink中,我们培训了KG编组和重新排序人员使用数据集来产生实体链接。与此形成对照的是,在这项工作中,我们使用一种新颖的方法开发了一个零光实体链接器,我们根据LLM倒数第二层的“是”象征性输出的日志概率重新排序候选实体。
Article 108
Title@2025-07-30 (3): IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation
Title: IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation | IterKey: Iterative Keyword Generation mit LLMs für verbesserte retrieval Augmented Generation | IterKey: 循环关键字生成,并配有 “ 增强再获取能力增量一代 “ 的LMML 2505.08450v2 |
Authors (4): Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, Taro Watanabe
Retrieval-Augmented Generation (RAG) has emerged as a way to complement the in-context knowledge of Large Language Models (LLMs) by integrating external documents. However, real-world applications demand not only accuracy but also interpretability. While dense retrieval methods provide high accuracy, they lack interpretability; conversely, sparse retrieval methods offer transparency but often fail to capture the full intent of queries due to their reliance on keyword matching. To address these issues, we introduce IterKey, an LLM-driven iterative keyword generation framework that enhances RAG via sparse retrieval. IterKey consists of three LLM-driven stages: generating keywords for retrieval, generating answers based on retrieved documents, and validating the answers. If validation fails, the process iteratively repeats with refined keywords. Across four QA tasks, experimental results show that IterKey achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines. Its performance is comparable to dense retrieval-based RAG and prior iterative query refinement methods using dense models. In summary, IterKey is a novel BM25-based approach leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with interpretability.
为解决这些问题,我们引入了IterKey,这是一个由LLM驱动的迭代关键词生成框架,通过吸收外部文件来补充大语言模型(LLMS)的内置知识。然而,现实世界应用程序不仅要求准确性,而且要求可解释性。虽然密集的检索方法提供高精度,但缺乏解释性;反过来,分散的检索方法提供了透明度,但往往由于依赖关键词匹配而不能反映查询的全部意图。为解决这些问题,我们引入了IterKey,这是一个LLM驱动的由LLM驱动的迭代关键词生成框架,通过稀薄的检索增强RAG。 IterKey由三个LLM驱动的阶段组成:为检索生成关键词,根据检索文件生成答案,并验证答案。如果验证失败,这一过程会与精细的关键字重复。在四种QA任务中,实验结果显示IterKey比基于B25的RAG和简单基线实现了5%至20%的准确性改进。其性与使用密度模型的密集检索-RAG和先前的重复性查询改进方法相当。概括,IterKey是一种新型的精准性调整方法。
Article 109
Title@2025-07-30 (3): Towards the Law of Capacity Gap in Distilling Language Models
Title: Towards the Law of Capacity Gap in Distilling Language Models | Auf dem Weg zum Gesetz der Kapazitä tigkeitslücke bei der Destillierung von Sprachmodellen | 迈向《语文模式再学习能力差距法》 2311.07052v4 |
Authors (6): Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, Yan Hu
Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the \textit{curse of capacity gap}, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the \textit{law of capacity gap} inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.
语言模型(LM)蒸馏法(LM)旨在将大型教师LM中的知识蒸馏成小学生的大型LM中的知识。作为LM蒸馏所面临的一个关键问题,高级学生往往来自规模相对较小而不是较大规模的教师,特别是在教师与学生之间能力差距很大的情况下。这个问题通常被称为能力差距的缩放管 , 表明在教师规模扩大过程中,很可能有一个最优秀的教师, 产生成绩最好的学生。 因此, 需要对各种规模的教师进行蒸馏试验,以确定在大型LMS(LLMs)中具有计算密集性的最佳教师。 本文通过在对广泛小规模( < 3B > ) LMS(<3B] LMS)进行的初步研究中引入的\ textitit{能力差距法} 来解决这一关键瓶颈问题, 最佳教师与不同模式和数据尺度的学生比例一致, 将法律推广到LMTM(7B)的蒸馏法, 通过在更大的规模(7B)中将法律推广到LM(LM)中,我们成功地获得了超越了多种磁体的磁体竞争阵列。
Article 110
Title@2025-07-30 (3): MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations
Title: MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations | MFTCXplain: Mehrsprachiger Benchmark-Datensatz zur Bewertung der moralischen Vernunft von LLMs durch Hassreden-Multi-Hop-Erklärungen | MFTCXplain:通过仇恨言论多呼多呼解释评估LLMs道德理由的多语言基准数据集 2506.19073v2 |
Authors (9): Jackson Trager, Diego Alves, Matteo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Flor Plaza-del-Arco, Yalda Daryanai, Farzan Karimi-Malekabadi, Francielle Vargas
Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.
由于这些系统被用于社会敏感的任务,确保大语言模型的道德推理能力日益成为日益令人关切的问题,然而,目前的评价基准存在两个主要缺陷:缺乏说明,说明道德分类的理由,限制了透明度和可解释性;主要侧重于英语,限制了对不同文化背景的道德推理的评估;在本文件中,我们引入了MFTCXplain,这是一个多语种的基准数据集,用于利用道德基金会理论(MFT)通过仇恨言论多语种解释评价LLM的道德推理;数据集包括葡萄牙语、意大利语、波斯语和英语的3 000个推文,附有二元仇恨言论标签、道德类别和跨层次的文字理由说明;经验性结果突出表明了在道德推理任务中LLM产出与人说明之间的不协调;虽然LMs在仇恨言论探测方面表现良好(F1至0.836),但其预测道德情绪的能力明显薄弱(F1 < 0.35),但理由调整仍然主要限于代表性不足的语言。
Article 111
Title@2025-07-30 (3): DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router
Title: DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router | DeepSieve: Informationen über LLM-as-a-Knowledge-Router | 深筛选:通过LLM-as-a- knowledge-Router获取信息 2507.22050v2 |
Authors (8): Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, Wei Cheng
Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches. Our codes are available at https://github.com/MinghoKwok/DeepSieve.
大型语言模型(LLMS)在很多推理任务上都非常出色,但是由于无法动态地获取最新或特定领域的信息,因此与知识密集型的询问纠缠不休。 回收原始一代(RAG)已经成为一个很有希望的解决方案,使LLM能够以外部来源提出其应对办法,但现有RAG方法缺乏对查询和来源方的精细控制,往往导致噪音检索和浅浅浅推。在这项工作中,我们引入了DeepSieve,这是一个包含通过LLM-as-a-nown-router 等手段获取信息的代理RAG框架。DeepSieve将复杂的查询解密到结构化的子问题和循环路径中,每个途径都可追溯到最合适的知识来源,通过多阶段的蒸馏过程过滤不相干的信息。我们的设计强调模块性、透明度和适应性,利用最近在代理系统设计上的进展。关于多种来源的多霍普 QA任务的实验表明推理深度、检索精确性和对常规RAG方法的可解释性。我们的代码可在 http://giepub.MKwow/Mings.我们可在http://http://httpss://
Article 112
Title@2025-07-30 (3): GATEAU: Selecting Influential Samples for Long Context Alignment
Title: GATEAU: Selecting Influential Samples for Long Context Alignment | GATEAU: Auswahl von einflussreichen Proben für lange Kontextausrichtung | GATEAU:为长期对齐选择有影响的样本 2410.15633v6 |
Authors (10): Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun
Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies have attempted to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model’s performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive experiments indicate that GATEAU effectively identifies influential samples, and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
以往的研究试图扩大现有数据量,方法是综合长期的遵循指令的样本,因为建立这样一个数据集往往对通知员来说具有挑战性;然而,缺乏确保数据质量的明确战略可能会引入低质量样本,并限制模型的性能。因此,我们提议GATEAU(GATEAU),这是一个新颖的框架,用以通过查明富含长距离依赖关系的有影响的样本来应对长期背景一致的独特挑战。具体地说,GATEAU衡量两个基本方面的长期依赖性:由于长期依赖性而难以产生目标反应,以及由于这种依赖性而难以理解长期投入。全面实验表明,GATEAU有效识别了有影响力的样本,而经过培训的这些样本模型显示了更好的指导跟踪和长文本理解能力。
Article 113
Title@2025-07-30 (3): MASCA: LLM based-Multi Agents System for Credit Assessment
Title: MASCA: LLM based-Multi Agents System for Credit Assessment | MASCA: LLM-basiertes Multi-Agenten-System zur Bonitätsbeurteilung | MASCA: 以LLM为基础的信用评估多边代理系统 2507.22758v1 |
Authors (3): Gautam Jajoo, Pranjal A Chitale, Saksham Agarwal
Recent advancements in financial problem-solving have leveraged LLMs and agent-based systems, with a primary focus on trading and financial modeling. However, credit assessment remains an underexplored challenge, traditionally dependent on rule-based methods and statistical models. In this paper, we introduce MASCA, an LLM-driven multi-agent system designed to enhance credit evaluation by mirroring real-world decision-making processes. The framework employs a layered architecture where specialized LLM-based agents collaboratively tackle sub-tasks. Additionally, we integrate contrastive learning for risk and reward assessment to optimize decision-making. We further present a signaling game theory perspective on hierarchical multi-agent systems, offering theoretical insights into their structure and interactions. Our paper also includes a detailed bias analysis in credit assessment, addressing fairness concerns. Experimental results demonstrate that MASCA outperforms baseline approaches, highlighting the effectiveness of hierarchical LLM-based multi-agent systems in financial applications, particularly in credit scoring.
最近在金融问题解决方面的进展利用了LLMs和代理系统,主要侧重于贸易和金融模式,然而,信用评估仍是一个未得到充分探讨的挑战,传统上依赖基于规则的方法和统计模式。我们在本文件中引入了由LMCA驱动的多种代理系统,即由LMCA驱动的多种代理系统,目的是通过反映现实世界的决策进程加强信用评估。框架采用一个多层结构,由基于LLMM的专门代理机构协作应对次级任务。此外,我们将风险和奖励评估的对比性学习纳入优化决策。我们进一步展示了有关等级多试剂系统的示性游戏理论观点,提供了对其结构和互动的理论见解。我们的文件还包括了信用评估中的详细偏见分析,解决公平问题。实验结果表明,MASCA在金融应用中,特别是信用评分中,以LMM为主的多代理系统优于基线方法,突出其效力。
Article 114
Title@2025-07-30 (3): Opportunities and Challenges of LLMs in Education: An NLP Perspective
Title: Opportunities and Challenges of LLMs in Education: An NLP Perspective | Chancen und Herausforderungen von LLM im Bildungswesen: Eine NLP-Perspektive | 教育中法学硕士的机遇和挑战:国家学习方案展望 2507.22753v1 |
Authors (5): Sowmya Vajjala, Bashar Alhafni, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar
Interest in the role of large language models (LLMs) in education is increasing, considering the new opportunities they offer for teaching, learning, and assessment. In this paper, we examine the impact of LLMs on educational NLP in the context of two main application scenarios: {\em assistance} and {\em assessment}, grounding them along the four dimensions – reading, writing, speaking, and tutoring. We then present the new directions enabled by LLMs, and the key challenges to address. We envision that this holistic overview would be useful for NLP researchers and practitioners interested in exploring the role of LLMs in developing language-focused and NLP-enabled educational applications of the future.
考虑到大型语言模式为教学、学习和评估提供的新机会,人们对大型语言模式在教育中的作用的兴趣正在增加,在本文件中,我们结合两个主要应用情景(`em援助}和{em评估}),审查LLM对教育NLP的影响,将其建立在阅读、写作、讲演和辅导这四个层面的基础上。然后,我们介绍LLMM所促成的新方向和要应对的主要挑战。我们设想,这一全面概览将有益于NLP研究人员和有意探索LLMs在开发未来以语言为主和以NLP为主的教育应用方面的作用的从业人员。
Article 115
Title@2025-07-30 (3): CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
Title: CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset | CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset | CUS-QA:以本地知识为主的不限成员名额问题解答数据集 2507.22752v1 |
Authors (4): Jindřich Libovický, Jindřich Helcl, Andrei Manea, Gianluca Vico
We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results highlight a significant gap in regional knowledge among current LLMs. Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering.
我们引入了一个包含文字和视觉两种方式的开放式区域问题解答基准,我们还利用最先进的大型语言模型(LLMs)提供强有力的基线。我们的数据集由捷克、斯洛伐克和乌克兰的土著发言人在维基百科基础上手工整理的问答以及随附英文译文组成。它包括纯文字问题和需要视觉理解的问题。作为一个基线,我们通过激发和补充对答案正确性的人类判断来评估最先进的LLMs。我们利用这些人类评估,分析现有自动评价指标的可靠性。我们的基线结果突显了当前LLMs之间在区域知识方面存在的巨大差距。此外,除了LLM评价之外,自动化指标与人类判断之间几乎没有什么关联。我们发布这一数据集,作为资源(1) 评估LMS的区域知识,(2) 研究挑战性环境中的跨语言生成一致性,(3) 推进为开放式问题解答制定评价指标。
Article 116
Title@2025-07-30 (3): Next Tokens Denoising for Speech Synthesis
Title: Next Tokens Denoising for Speech Synthesis | Nächste Tokens Denoising für Sprachsynthese | 下一集 Tokens 代言人演讲综述 2507.22746v1 |
Authors (10): Yanqing Liu, Ruiqing Xue, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao
While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact 12.5 tokens per second rate. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Consequently, the proposed model can utilize KV-cache across chunks and incorporate future context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also makes the proposed model particularly effective for generating extended content. Experiment for demos of our work} on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.
虽然扩散和自动递减模型(AR)模型具有显著的进步基因模型,但它们都具有不同的局限性。AR模型依赖因果关注,无法利用未来背景,并且受到缓慢的生成速度的影响。相反,扩散模型与关键值(KV)的缓冲相争。为了克服这些挑战,我们引入了龙-调,这是一个新颖的文本到语音(TTS)设计,可以统一AR和流程匹配。这个模型处理块块中48千赫兹音调解码器,以每秒12个缩压压制成。这个设计可以让AR在块间建模,确保全球的一致性,而块内的平行流配制有助于快速迭代分解。因此,拟议的模型可以利用KV缓冲,将未来环境纳入每个块中。此外,它连接连续和离散的特征模型,表明持续的AR流配制能以有限的量度四分辨器预测离散的象征物。这个高效的编码和快速块反向递增结构也使得拟议的模型对生成扩展的内容特别有效。对高频度数据定位的定位进行实验。
Article 117
Title@2025-07-30 (3): Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index
Title: Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index | Verringerung der Halluzinationen in der Zusammenfassung durch Verstärkungslernen mit Entity Halluzination Index | 利用实体幻觉指数,通过强化学习减少在总结中的幻觉 2507.22744v1 |
Authors (4): Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala
Reducing hallucinations in abstractive summarization remains a critical challenge for deploying language models (LMs) in real-world settings. In this work, we introduce a rewarddriven fine-tuning framework that explicitly optimizes for Entity Hallucination Index (EHI), a metric designed to quantify the presence, correctness, and grounding of named entities in generated summaries. Given a corpus of meeting transcripts, we first generate baseline summaries using a pre-trained LM and compute EHI scores via automatic entity extraction and matching. We then apply reinforcement learning to fine-tune the model parameters, using EHI as a reward signal to bias generation toward entity-faithful outputs. Our approach does not rely on human-written factuality annotations, enabling scalable fine-tuning. Experiments demonstrate consistent improvements in EHI across datasets, with qualitative analysis revealing a significant reduction in entity-level hallucinations without degradation in fluency or informativeness. We release a reproducible Colab pipeline, facilitating further research on hallucination-aware model fine-tuning using lightweight, hallucintion metrics like EHI.
在现实世界环境中部署语言模型(LMS)时,减少抽象合成的幻觉仍然是一项关键的挑战。在这项工作中,我们引入了一个奖赏驱动的微调框架,明确优化实体幻觉指数(EHI),该指数旨在量化名称实体的存在、正确性和在生成摘要中的依据。根据一系列会议记录,我们首先使用预先培训的LM(LM)生成基线摘要,然后通过自动实体提取和匹配来计算EHI分数。然后我们运用强化学习来微调模型参数,利用EHI作为奖赏信号来生成对实体信仰产出的偏见。我们的方法并不依赖于人写的事实性说明,从而能够进行可扩展的微调。实验表明EHI跨数据集的持续改进,其质量分析显示实体一级幻觉的显著减少,而不会在流利或信息性方面出现退化。我们发布了一个可复制的Colab输油管,便利进一步研究使用轻量、致幻度指标(如EHI)对幻模型进行微调。
Article 118
Title@2025-07-30 (3): Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning
Title: Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning | Bewertungsprüfer: Bewertung der synthetischen Überprüfung für Code und Begründung | 标定验证符:评估编码和理由的合成核查 2502.13820v3 |
Authors (4): Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg
Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
合成核查技术,如产生测试案例和奖励建模,是提高大型语言模型(LLM)的编码能力,超越预先界定的测试的常见方法。此外,守则核查最近发现,作为通过强化学习提高LLMS推理能力的一个关键组成部分,在通过强化学习提高LMS推理能力方面,取得了巨大成功。在本文件中,我们提出一种方法,可将现有的编码基准转换成评分和排名数据集,以评价合成核查员的效力。我们还提出多种指标,用拟议基准衡量合成核查员的不同方面。通过采用拟议方法,我们发布了四个新的基准(HE-R、HE-R+、MBPP-R和MBPP-R+),并以标准、推理和奖励为基础的LMS分析合成核查方法。我们的实验表明,推理可以大大改进测试案例的生成,扩大测试案例的数量可以提高核查的准确性。
Article 119
Title@2025-07-30 (3): Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning
Title: Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning | Ressourceneffiziente Anpassung großer Sprachmodelle für Text-Embeddings über Prompt Engineering und Contrastive Fine-Tuning | 通过即时工程和反竞争微调对文本嵌入大语言模型进行资源高效率的改编 2507.22729v1 |
Authors (6): Benedikt Roth, Stephan Rappensperger, Tianming Qiu, Hamza Imamović, Julian Wörmann, Hao Shen
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation. Their token-level representations capture rich, human-aligned semantics. However, pooling these vectors into a text embedding discards crucial information. Nevertheless, many non-generative downstream tasks, such as clustering, classification, or retrieval, still depend on accurate and controllable sentence- or document-level embeddings. We explore several adaptation strategies for pre-trained, decoder-only LLMs: (i) various aggregation techniques for token embeddings, (ii) task-specific prompt engineering, and (iii) text-level augmentation via contrastive fine-tuning. Combining these components yields state-of-the-art performance on the English clustering track of the Massive Text Embedding Benchmark (MTEB). An analysis of the attention map further shows that fine-tuning shifts focus from prompt tokens to semantically relevant words, indicating more effective compression of meaning into the final hidden state. Our experiments demonstrate that LLMs can be effectively adapted as text embedding models through a combination of prompt engineering and resource-efficient contrastive fine-tuning on synthetically generated positive pairs.
大型语言模型(LLMS)已成为自然语言处理(NLP)的基石,在文本生成中取得了令人印象深刻的成绩。它们的象征性代表层捕捉了丰富的、与人类一致的语义。然而,将这些矢量集中到一个嵌入抛弃物的文本中,关键的信息。然而,许多非遗传性的下游任务,如集群、分类或检索,仍然取决于准确和可控制的判决或文件级嵌入。我们探索了预先训练的、非编码的LMS(NLP)的几种适应战略,在文本生成过程中取得了令人印象深刻的成绩。我们探讨了一些适应策略,这些策略是:(一) 代用品嵌入的各种聚合技术,(二) 特定任务迅速的工程,以及(三) 通过对比性微调增强文本层。将这些组件合并起来,可以产生在大规模文本嵌入基准(MTEB)的英国组合轨迹上的最先进的表现。对关注地图的分析进一步表明,微调的焦点从提示符号转向语义相关词,表明对含义进行更有效的压缩到最后隐藏状态。我们的实验表明,LMSMs可以有效地作为文本嵌入模型,通过迅速的工程和资源节制生成的合成对准组合。
Article 120
Title@2025-07-30 (3): Investigating Hallucination in Conversations for Low Resource Languages
Title: Investigating Hallucination in Conversations for Low Resource Languages | Untersuchung von Halluzinationen in Gesprächen über Sprachen mit geringem Ressourcenreichtum | 低资源语言对话中的幻觉 2507.22720v1 |
Authors (10): Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as ‘hallucination’. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
大型语言模型(LLMS)在生成与人文写作非常相似的文本方面表现出了非凡的熟练程度,然而,它们往往产生事实错误的陈述,通常被称为“职业介绍”,解决幻觉问题对于提高LLMS的可靠性和有效性至关重要。虽然许多研究都侧重于英语的幻觉,但我们的研究将这一调查扩大到三种语言的谈话数据:印地语、法西语和普通话。我们对一套数据进行了全面分析,以检查GPT-35、GPT-4o、Llama-3.1、Gemma-2.0、DeepSeek-R1和Quen-3等语言中这些语言中的事实和语言错误。我们发现LMS在曼达林语中产生的幻觉反应很少,但在印地语和法西语中产生的幻觉数量却要高得多。
Article 121
Title@2025-07-30 (3): Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining
Title: Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining | Erhöhung der Ultra-Low-Bit-Quantisierung großer Sprachmodelle durch Saliency-Aware Partial Retraining | 通过提高质量-软件部分再培训,加强大语言模型的超低比小量量化 2504.13932v3 |
Authors (2): Deyu Cao, Samin Aref
The growing use of large language models has raised environmental and economic concerns about their intensity of resource usage during inference. Serving these models to each user requires substantial energy and water for cooling. Model compression techniques like quantization can shrink large language models and make them more resource efficient at the cost of potential performance degradation. Quantization methods compress model size through replacing their high-precision parameters by quantized values of lower precision. Among existing methods, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ’s level. First, we look into combining existing quantization-aware training techniques with ApiQ’s partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining is unlikely to be feasible through partial training. (2) This gain may depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. This publicly available method relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on LLaMA 7B and 13B benchmarks demonstrate that our method reduces the ApiQ’s accuracy degradation by 10.85% and 7.54% respectively. A Python implementation of the proposed quantization method is publicly available on GitHub https://github.com/TokuyuSou/ULB-SAPR.
大型语言模型的使用日益增多,引起了人们对在推断期间资源使用强度的环境和经济关切。向每个用户提供这些模型需要大量的能量和水来进行冷却。模型压缩技术,如量化等,可以压缩大型语言模型,使其以潜在性能退化的代价提高资源效率。量化方法压缩模型规模,以低精度值的分量值取代高精度参数。在现有方法中,ApiQ方法在最小的记忆和时间管理上实现高度准确性保护。我们调查了将超低比特四分化的性能扩大到ApiQ的两种想法。首先,我们研究了将现有的量化培训技术与ApiQ的部分培训结合起来的情况。我们表明,这并没有以有限的培训数据和冷却的重量来超过ApiQ的基线方法。这导致两个关键见解:(1) 通过全面再培训获得的大量代表能力不太可能通过部分培训获得。(2)这一增益取决于在夸度化培训中采用大规模和多样化的数据设置的参数。首先,我们研究了现有的量级-测测测测测测测测标准,在Sqial-ralreal再使用两种方法后,我们提出了一种完整的业绩方法。
Article 122
Title@2025-07-30 (3): From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs
Title: From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs | Von der Fähigkeit zur Reflexion: Stärkungsorientiertes Denken Qualität in retrieval-augmented Begründung für LLMs | 从充足到反思:LLMs在追偿和增加理由方面的强化引导思考质量 2507.22716v1 |
Authors (3): Jie He, Victor Gutierrez Basulto, Jeff Z. Pan
Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: https://github.com/probe2/TIRESRAG-R1.
以学习为基础的强化检索-增强一代(RAG)方法加强了大型语言模型(LLMS)的推理能力。然而,多数人只依赖最后回答的奖励,而忽略中间推理质量。本文分析现有RAG推理模型,并查明三个主要失败模式:(1)信息不足,意味着模型未能获得足够的支持;(2)错误推理,尽管信息充足,但逻辑或内容层次的缺陷似乎不足;(3)答案推理不一致,因为有效的推理链导致有错配的最后答案。我们提议TIRESRAG-R1,一个利用思维-检索-反射进程和多层面奖励系统来改进推理和稳定性的新框架。TIRESRAG-R1介绍了:(1)鼓励彻底检索的充分奖励;(2)评估推理链的合理性和准确性的推理质量奖励;(3)发现和修改错误的反省奖励。我们还采用难辨称战略和培训样本过滤,以提高复杂任务的绩效。在四种多霍普-QA数据集上进行的实验表明,TIRESRAG-RAG1号现有数据格式比以往的数据格式。
Article 123
Title@2025-07-30 (3): UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Title: UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis | UI-E2I-Synth: Weiterentwicklung der GUI-Grundierung mit großformatiger Instruktionssynthese | UI-E2I-Synth:以大型教学合成为基础推进图形界面 2504.11257v4 |
Authors (4): Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, Yan Lu
Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability. In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation. In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline. The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding. We will release corresponding artifacts at https://microsoft.github.io/FIVE-UI-Evol/ .
大型视觉语言模型最近的进展正在加速开发图形用户界面(GUI)代理器,这些代理器利用人性化的视觉感知能力提高数字装置的生产率。与基于GUI元数据的方法相比,基于愿景的方法具有更广泛的适用性,因为GUI依靠平台,容易出现执行差异。在这个基于愿景的模式中,图形指导定位将用户指示映射到特定截图上相应元素的位置,这仍然是一个重大挑战,特别是因为公共培训数据集和资源密集的人工指令数据说明有限。在本文中,我们深入探讨了这一任务中未探讨的挑战,包括元素对屏幕比率、不平衡元素类型和隐含的指令。为了应对这些挑战,我们采用了大规模的数据合成管道UIUI-E2I-Synth 方法,用于使用GPT-4o而不是使用人类说明器生成不同的复杂指令数据集。此外,我们提出了一个新的界面指令,目的是通过纳入不同的注释说明,解决现有基准的局限性,包括元素对屏幕比率、不平衡的元素类型类型和隐含的指令。为了应对这些挑战,我们模型、经过培训的管道中的最新数据分析,通过拟议的地面分析,将实现拟议的地面分析。
Article 124
Title@2025-07-30 (3): Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations
Title: Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations | Raumsprache Likelihood Grounding Network für Bayesian Fusion von Mensch-Roboter-Beobachtungen | Bayesian人类-机器人观测融合空间语言定位网络 2507.19947v2 |
Authors (4): Supawich Sitdhipol, Waritwong Sukprasongdee, Ekapol Chuangsuwanich, Rina Tse
Fusing information from human observations can help robots overcome sensing limitations in collaborative tasks. However, an uncertainty-aware fusion framework requires a grounded likelihood representing the uncertainty of human inputs. This paper presents a Feature Pyramid Likelihood Grounding Network (FP-LGN) that grounds spatial language by learning relevant map image features and their relationships with spatial relation semantics. The model is trained as a probability estimator to capture aleatoric uncertainty in human language using three-stage curriculum learning. Results showed that FP-LGN matched expert-designed rules in mean Negative Log-Likelihood (NLL) and demonstrated greater robustness with lower standard deviation. Collaborative sensing results demonstrated that the grounded likelihood successfully enabled uncertainty-aware fusion of heterogeneous human language observations and robot sensor measurements, achieving significant improvements in human-robot collaborative task performance.
人类观测的阻燃信息可以帮助机器人克服协作任务中的感知限制。然而,一个具有不确定性的聚合框架需要具有代表人类投入不确定性的有根有据的可能性。本文介绍了一个地貌虫状金字塔网络(FP-LGN),它通过学习相关的地图图像特征及其与空间关系语义的关系而将空间语言作为基础。模型被培训为概率估计器,以便利用三阶段课程学习来捕捉人类语言中的感知性不确定性。结果显示,FP-LGN与专家设计的规则相匹配,其平均值为负日志-日产(NLLL),并显示在较低标准偏差情况下更加稳健。合作感结果表明,基于地貌的概率成功地促成了多种人类语言观测和机器人传感器测量的不确定性-认知融合,从而在人类机器人协作性工作方面取得了显著的改进。
Article 125
Title@2025-07-30 (3): Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment
Title: Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment | Hören auf das Unausgesprochene: Erforschen von 365 Aspekten der multimodalen Interview-Performance Assessment | 聆听无语者:探索多模式访谈业绩评估的365方面 2507.22676v1 |
Authors (6): Jia Li, Yang Wang, Wenhao Qian, Zhenzhen Hu, Richang Hong, Meng Wang
Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.
面试业绩评估对于确定候选人是否适合专业职位至关重要。为了确保整体和公平评价,我们提议了一个新颖和全面的框架,探讨“365”“365”的面试业绩方面,将“textit{3⁄3”模式(视频、音频和文本)、每个候选人的回答(textit{6}6}答复和关键评价层面结合起来。框架采用特定模式的特征提取器,将不同数据流编码,随后通过一个共同压缩多层次的多层次接受器进行整合。这个模块压缩将多式联运嵌入一个统一的潜在空间,便利高效率的特征互动。为了提高预测的稳健性,我们纳入了一个两级的混合学习战略:(1) 独立回归头预测每个答复的评分,以及(2) 利用一个平均集合机制对各种答复进行汇总预测,以产生五个目标层面的最后评分。通过听不说,我们的方法从多式联运数据中获取明确和隐含的提示,能够进行全面和公正的评估。实现多维平均数0.1824,我们的框架在2025年AVI 挑战性评估中首次得到保障。
Article 126
Title@2025-07-30 (3): What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization
Title: What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization | Wovon reden sie? Ein Benchmark der wissensgeprägten Diskussionszusammenfassung | 他们在谈论什么?知识类讨论总结的基准 2505.12474v2 |
Authors (7): Weixiao Zhou, Junnan Zhu, Gengyao Li, Xianfu Cheng, Xinnian Liang, Feifei Zhai, Zhoujun Li
Traditional dialogue summarization primarily focuses on dialogue content, assuming it comprises adequate information for a clear summary. However, this assumption often fails for discussions grounded in shared background, where participants frequently omit context and use implicit references. This results in summaries that are confusing to readers unfamiliar with the background. To address this, we introduce Knowledge-Grounded Discussion Summarization (KGDS), a novel task that produces a supplementary background summary for context and a clear opinion summary with clarified references. To facilitate research, we construct the first KGDS benchmark, featuring news-discussion pairs and expert-created multi-granularity gold annotations for evaluating sub-summaries. We also propose a novel hierarchical evaluation framework with fine-grained and interpretable metrics. Our extensive evaluation of 12 advanced large language models (LLMs) reveals that KGDS remains a significant challenge. The models frequently miss key facts and retain irrelevant ones in background summarization, and often fail to resolve implicit references in opinion summary integration.
传统对话总结主要侧重于对话内容,假定它包含足够的信息,可以提出明确的总结,但这一假设往往不能用于基于共同背景的讨论,因为与会者经常略去背景,使用隐含的参考,结果摘要使不熟悉背景的读者感到困惑。为了解决这个问题,我们引入了知识四面讨论总结(KGDS),这是一项新颖的任务,为背景提供了补充背景摘要,并提供了明确的意见摘要。为了便于研究,我们构建了第一个KGDS基准,以进行新闻讨论的对口和专家为次摘要评价创建的多色金说明为主。我们还提出了一个新的等级评价框架,配有精细的和可解释的参数。我们对12个先进的大语言模型(LLMS)的广泛评价表明,KGDS仍然是一个重大挑战。模型常常忽略关键事实,在背景总结中保留不相干的内容,而且常常无法解决意见摘要整合中隐含的参考。
Article 127
Title@2025-07-30 (3): Instruction-tuned Large Language Models for Machine Translation in the Medical Domain
Title: Instruction-tuned Large Language Models for Machine Translation in the Medical Domain | Instruktionsorientierte große Sprachmodelle für die maschinelle Übersetzung im medizinischen Bereich | 医疗领域机器翻译大语言模型 2408.16440v2 |
Authors (1): Miguel Rios
Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medical domain. In addition, we introduce terminology from specialised medical dictionaries into the instruction formatted datasets for fine-tuning LLMs. The instruction-tuned LLMs significantly outperform the baseline models with automatic metrics.
大型语言模型(LLMS)在高资源语言配对和域的机器翻译方面显示出了大有希望的成果,然而,在专业领域(例如医疗)LMS与标准的神经机翻译模型相比表现较差,术语的机器翻译的一致性对于专业领域的用户、研究人员和笔译员至关重要。在本研究中,我们比较了医疗领域的基线LMS和按指示调整的LMS之间的性能。此外,我们还将专门医学词典中的术语引入了用于微调LMS的规范格式化数据集。通过自动衡量,指导调整LMS的LMS大大超过了基线模型。
Article 128
Title@2025-07-30 (3): QE4PE: Word-level Quality Estimation for Human Post-Editing
Title: QE4PE: Word-level Quality Estimation for Human Post-Editing | QE4PE: Qualitätsschätzung auf Word-Ebene für die menschliche Nachbearbeitung | QE4PE: 计算后人类的字级质量估算 2503.03044v2 |
Authors (6): Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza
Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors’ speed are critical factors in determining highlights’ effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.
字级质量估计(QE)方法旨在检测机器翻译中的误差,这可以指导和促进人类编辑后编辑工作的误差。虽然对字级质量评价系统的准确性进行了广泛评估,但其可用性和对编辑后人类编辑工作的速度、质量和编辑选择的下游影响仍然研究不足。在这项研究中,我们调查了字级质量评价对机器翻译(MT)编辑后编辑的影响。我们发现,在现实环境下,有42个专业编辑后编辑跨两个翻译方向的42个专业版本。我们比较了四种错误分布突出模式,包括有监督的和不确定的字级质量评价方法,用以查明最先进的神经计量模型产出中的潜在错误。后编辑工作和生产率是从行为记录中估算出来的,而质量改进则由字级和分级人类注释评估。我们发现,域、语文和编辑速度是确定重点有效性的关键因素,人造和自动化的QE之间差异不大,突出了专业工作流程中准确性和可用性之间的差距。
Article 129
Title@2025-07-30 (3): Multilingual Political Views of Large Language Models: Identification and Steering
Title: Multilingual Political Views of Large Language Models: Identification and Steering | Mehrsprachige politische Ansichten von großen Sprachmodellen: Identifikation und Steuerung | 大语言模式多语言多语言政治观点:识别和指导 2507.22623v1 |
Authors (6): Daniil Gurgurov, Katharina Trinley, Ivan Vykopal, Josef van Genabith, Simon Ostermann, Roberto Zamparelli
Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases–frequently skewing toward liberal or progressive positions–key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.
大型语言模型(LLMS)越来越多地用于日常工具和应用,引起了人们对其对政治观点潜在影响的担忧。先前的研究显示,LLMS经常表现出可衡量的政治偏见,经常地向自由或进步立场-关键差距倾斜。大多数现有研究只评价一套狭窄的模式和语言,留下关于政治偏见在建筑、规模和多语言环境之间的普遍性的开放问题。此外,很少有人研究这些偏见能否得到积极控制。在这项工作中,我们通过在现代开放源码指令调控LMS中大规模研究政治取向来消除这些差距。我们用14种语言评估了7种模式,包括LalaMA-3、Qwen-3和Aya-Explanse,使用11种语义等同的语句子来评估,以确保稳健健的测量。我们的结果表明,较大的模式始终向自由左翼立场转变,在各语言和模范家庭之间差异很大。要测试政治姿态的可操纵性,我们使用简单的中质中心激活干预技术,并显示它可靠地指导着多种语言/MLA-MA-MS-MR的替代意识形态立场。
Article 130
Title@2025-07-30 (3): Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Title: Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation | Sprache Arithmetik: Auf dem Weg zur systemischen Sprache Neuronenidentifikation und Manipulation | 语言解貌学:迈向系统语言中中子识别和操纵 2507.22608v1 |
Authors (6): Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann
Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal “fallback” mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.
大型语言模型(LLMS)具有很强的多语种能力,但具体语言处理背后的神经机制仍然不清楚。我们分析了Llama-3.1-31-8B、Mistral-Nemo-12B和Aya-Expanse-8B和32B中21种类型多样的语言中与语言相关的神经元,确定了控制语言行为的神经元。使用语言动能渗透(LAPE)方法,我们发现这些神经元在更深层次上聚居,非拉丁文字更加专业化。相关语言共享重叠的神经元,反映了语言近距离的内部表现。我们通过语言算术,即系统激活添加和倍增,引导模式停止使用不需要的语言并激活理想语言,超越了更简单的替换方法。这些干预措施有效地指导了五个多语言任务的行为:语言强迫、翻译、QA、理解和NLIL. 人工智能,高资源语言的使用更成功,而类型相似性则提高效力。我们还表明,跨语言神经导能增强下游的性,并揭示了在神经系统逐步停止使用/Mangurip时用于选择语言的内部“倒退”机制。我们的代码是公开的。
Article 131
Title@2025-07-30 (3): UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Title: UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding | UI-AGILE: Verbesserung von GUI-Agenten mit effektivem Verstärkungslernen und präziser Schlussfolgerungs-Zeiterdung | UI-AGILE: 提高具有有效强化学习和精确推断时间定位的图形代理器 2507.22025v2 |
Authors (7): Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li
The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a “Simple Thinking” reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.
多种多式大语言模型(MLLMM)的出现推动了图形用户界面(GUI)代理能力的显著进步,然而,现有的GUI代理培训和推断技术仍然在推理设计、无效奖赏和视觉噪音方面处于两难境地。为了解决这些问题,我们引入了UI-AGILE,这是在培训和推理阶段加强GUI代理的综合框架。为了培训,我们建议了一套改进监督微调进程(SFT)的全套改进办法:1) 不断提升功能,激励高精度地面定位;2) 一项“简单思考”奖励,以平衡规划与速度和定位准确性之间的平衡;3) 一项基于裁剪裁法的重现战略,以缓解稀少的奖励问题,改进复杂任务的学习。据推测,我们提出了与选择的松散基础,这是一种新颖方法,通过将图像破碎成小、可操作性部分,大大提高高分辨率显示的准确性。实验显示,UI-AGILE在两个基准标准上实现了最新业绩,利用23SProsporS-Stoforforfor 提高基准方法,在23SPropos-scloforfor-sc-sclopprobortial
Article 132
Title@2025-07-30 (3): BALSAM: A Platform for Benchmarking Arabic Large Language Models
Title: BALSAM: A Platform for Benchmarking Arabic Large Language Models | BALSAM: Eine Plattform für Benchmarking arabischer Großsprachenmodelle | BALSAM:阿拉伯语大语言模式基准制定平台 2507.22603v1 |
Authors (43): Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov, Norah Alzahrani, Eman alBilali, Nizar Habash, Abdelrahman El-Sheikh, Muhammad Elmallah, Haonan Li, Hamdy Mubarak, Mohamed Anwar, Zaid Alyafeai, Ahmed Abdelali, Nora Altwairesh, Maram Hasanain, Abdulmohsen Al Thubaity, Shady Shehata, Bashar Alhafni, Injy Hamed, Go Inoue, Khalid Elmadani, Ossama Obeid, Fatima Haouari, Tamer Elsayed, Emad Alghamdi, Khalid Almubarak, Saied Alshahrani, Ola Aljarrah, Safa Alajlan, Areej Alshaqarawi, Maryam Alshihri, Sultana Alghurabi, Atikah Alzeghayer, Afrah Altamimi, Abdullah Alfaifi, Abdulrahman AlOsaimy
The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
英文大语言模型(LLMs)的进步令人印象深刻,没有在所有语文中相匹配,特别是由于数据稀缺、阿拉伯语及其方言语言多样性、形态复杂等原因,LLM在阿拉伯语方面的表现落后。 阿拉伯基准的质量进一步阻碍了进展,这些基准通常依靠静态的公开数据,缺乏全面的任务覆盖,或没有提供专用的盲人测试仪平台。这给衡量实际进展和减少数据污染带来了挑战。我们在这里的目标是弥合这些差距。特别是,我们引入了BALSAM,这是一个由社区驱动的综合基准,旨在推进阿拉伯语LM的发展和评价,其中包括来自14大类的78项NLP任务,其中52K实例分为37K测试和15K开发,以及一个集中、透明的盲人评价平台。我们设想BALSAM是一个统一平台,用以制定标准,促进合作研究,以提高阿拉伯语LM能力。
Article 133
Title@2025-07-30 (3): Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation
Title: Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation | Lernen, rationale Beweise durch Verstärkungslernen für die retrieval-angereicherte Generation zu extrahieren | 学习如何通过为回收-提款一代人加强学习来提取合理证据 2507.15586v4 |
Authors (7): Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang
Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose EviOmni, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of EviOmni, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.
重新获取-增强一代(RAG)有效地提高了大语言模型(LLMs)的准确性。然而,检索噪音对LLMs的生成质量有重大影响,需要开发脱网机制。以前的方法直接地提取证据,而没有明确的思考,有可能过滤关键线索,通过一般化而挣扎。为此,我们建议EviOmni学习通过:(1) 明确推理,首先确定检索内容中的潜在线索,然后(2) 有意识地提取证据,以避免遗漏任何有助于回答问题的关键线索。具体地说,我们将证据推理和证据提取纳入终端到终端培训的统一对策中;应用知识符号面罩解密,以得出基于推理和提取的答案;设计三种可核查的奖励功能,包括答案、长度和格式,以便通过政策优化算法更新模型。关于三个基准数据集的广泛实验显示EviOmni的有效性,提供紧凑和高质量的证据,提高下游任务的准确性,并促进在线RAG系统的有效应用。
Article 134
Title@2025-07-30 (3): Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Title: Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions | Die Frontier of Vision-Language Models erkunden: Eine Übersicht aktueller Methoden und Zukunftsrichtungen | 探索远景-语言模型的前沿:对当前方法和未来方向的调查 2404.07214v3 |
Authors (5): Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha
The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.
大型语言模型(LLMS)的出现大大改变了AI革命的轨迹,然而,这些LLMS显示出明显的局限性,因为它们主要擅长处理文本信息。为了解决这一制约因素,研究人员努力将视觉能力与LLMS结合起来,从而导致产生视觉语言模型(VLMS)的出现。这些先进的模型有助于处理更复杂的任务,例如图像字幕和视觉问题解答。在我们的综合调查文件中,我们深入探讨了VLMS领域的主要进展。我们的分类将VLMS分为三个不同的类别:专门用来处理视觉语言理解的模型、处理多式联运投入的模型,以产生接受和产生多种形式投入和产出的单一形式(LMS)产出和模型。这种分类基于他们在处理和生成各种数据模式方面各自的能力和功能。我们仔细地区分了每一种模型,对它的基础结构、培训数据来源以及尽可能的优势和局限性进行了广泛的分析,为读者提供了对其基本组成部分的全面理解。我们还分析了VLMS在各种基准领域研究领域取得突破性进展的成绩,从而强调了我们今后对数据前景进行突破的突破性研究的突破。
Article 135
Title@2025-07-30 (3): Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck
Title: Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck | Effizientes kontinuierliches Lernen für kleine Sprachmodelle mit einem diskreten Schlüsselwert-Bottleneck | 高效持续学习具有分立键- Value 瓶颈的小语言模式 2412.08528v2 |
Authors (4): Andor Diera, Lukas Galke, Fabian Karl, Ansgar Scherp
Continual learning remains a challenge across various natural language processing (NLP) tasks, as models updated with new training data often risk catastrophic forgetting of previously acquired knowledge. We introduce a discrete key-value bottleneck (DKVB) for encoder-only language models, enabling efficient continual learning through localized updates. Inspired by a discrete key-value bottleneck in vision, we consider new and NLP-specific challenges. We compare different bottleneck architectures for NLP and introduce a new, task-independent initialization technique for the discrete keys. We evaluate our DKVB for NLP in four continual learning scenarios and show that it alleviates catastrophic forgetting. Our experiments demonstrate that the proposed approach achieves competitive performance compared to popular continual learning methods while incurring lower computational costs. Furthermore, we show that DKVB remains effective even in challenging single-head continual learning scenarios where no task ID is provided.
持续学习仍然是各种自然语言处理(NLP)任务中的一项挑战,因为根据新的培训数据更新的模型往往有灾难性地忘记先前获得的知识的风险。我们为只编码器语言模型引入了独立的关键值瓶颈(DKVB),通过本地更新,能够高效地持续学习。在独立的关键值瓶颈的启发下,我们考虑新的和NLP特有的挑战。我们比较了不同的关键值架构,为离散键引入了新的、任务独立的初始化技术。我们在四个持续学习情景中为NLP评估了我们的DKVB, 并表明它缓解了灾难性的遗忘。我们的实验表明,拟议方法在降低计算成本的同时,取得了与流行的持续学习方法相比的竞争性业绩。此外,我们表明即使在没有提供任务识别符的具有挑战性的单头持续学习情景中,DKVB依然有效。
Article 136
Title@2025-07-30 (3): Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning
Title: Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning | Effizientes Differentielles Privates Feintuning von LLMs durch Verstärkungslernen | 通过强化学习对LLMs 进行有区别的私人高效率私人罚款 2507.22565v1 |
Authors (5): Afshin Khadangi, Amir Sartipi, Igor Tchappi, Ramin Bahmani, Gilbert Fridgen
The tension between data privacy and model utility has become the defining bottleneck for the practical deployment of large language models (LLMs) trained on sensitive corpora including healthcare. Differentially private stochastic gradient descent (DP-SGD) guarantees formal privacy, yet it does so at a pronounced cost: gradients are forcibly clipped and perturbed with noise, degrading sample efficiency and final accuracy. Numerous variants have been proposed to soften this trade-off, but they all share a handicap: their control knobs are hard-coded, global, and oblivious to the evolving optimization landscape. Consequently, practitioners are forced either to over-spend privacy budget in pursuit of utility, or to accept mediocre models in order to stay within privacy constraints. We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL). RLDP continuously senses rich statistics of the learning dynamics and acts by selecting fine-grained per parameter gradient-clipping thresholds as well as the magnitude of injected Gaussian noise. A soft actor-critic (SAC) hyper-policy is trained online during language model fine-tuning; it learns, from scratch, how to allocate the privacy budget where it matters and when it matters. Across more than 1,600 ablation experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstream utility gain. RLDP reaches each baseline’s final utility after only 13-43% of the gradient-update budget (mean speed-up 71%), all while honoring the same ($\epsilon$, $\delta$)-DP contract and exhibiting equal or lower susceptibility to membership-inference and canary-extraction attacks.
数据隐私和模型效用之间的紧张关系已成为实际部署大型语言模型(LLMS)(LLMs)(LLMs)(LLMs)(LLMs)(LLMs)(LLMS)(LLMM)(LLMM)(LLMM(DP-SGD(DP-SGD))(DP-SGD(DP-SGD))(DP-SGD)(DP-SGD)(DP-SGD)(DP-SGD)(DP-SGD(DP-SGD)(DP)(DP)(DP)(LLM)(PLLLLD)(T)(TLLLLD)(TF)(LLLLDDP(LD)(LLLD)(T)(SLLD)(LLLLDP)(S)(LLLD)(LLLD)(LP)(LD)(LO)(LDB(LD)(L)(LDRD)(LD)(IDRD)(L-I-ID)(SD)(LD)(LI-I-IDID)(S)(S)(S)(S-I(SD)(SDLI)(S)(S)(S)(S)(SLD)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(S)(L)(IDL)(L)(L)(L)(ID)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(
Article 137
Title@2025-07-30 (3): Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Title: Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs | Nutzung synergistischer Kognitiv-Biasen zur Umgehung der Sicherheit in LLMs | 利用协同协同一致的双星体在LLM中用于绕过安全 2507.22564v1 |
Authors (5): Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases – systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
大型语言模型(LLMS)展示了在一系列广泛任务中令人印象深刻的能力,然而,其安全机制仍然容易受到利用认知偏差 – – 系统偏离理性判断的系统偏差 – – 的对抗性攻击。与以往侧重于迅速工程或算法操纵的侵入性做法不同,这项工作凸显了在破坏LLM保障措施方面多偏见互动的被忽视力量。我们提议了CognitiveAttack,这是一个新型的红色组合框架,系统地利用个人和综合认知偏差。通过整合受监督的微调和强化学习,CognitiveAttack生成了闪烁,将最佳偏差组合嵌入其中,有效绕过安全协议,同时保持高攻击成功率。实验结果揭示了30个不同LMSM的显著脆弱性,特别是在开放源模型中。ConnitiveAtack实现了比SOTA黑箱方法PAP(60.1%对31.6%)要高得多的攻击成功率,暴露了当前防御机制的关键限制。这些发现突出了多偏见相互作用,作为强大但未被充分利用的攻击矢控矢量媒介。这项工作通过连接认知科学和LLM安全系统,为新的学科视角。
Article 138
Title@2025-07-30 (3): Rationale-guided Prompting for Knowledge-based Visual Question Answering
Title: Rationale-guided Prompting for Knowledge-based Visual Question Answering | Rationale-geführte Aufforderung zur wissensbasierten visuellen Fragebeantwortung | 以知识为基础的视觉问题解答 2412.16936v2 |
Authors (4): Zhongjian Hu, Peng Yang, Bing Li, Fengyuan Liu
Recently, Large Language Models (LLMs) have been used for knowledge-based Visual Question Answering (VQA). Despite the encouraging results of previous studies, prior methods prompt LLMs to predict answers directly, neglecting intermediate thought processes. We argue that prior methods do not sufficiently activate the capacities of LLMs. We propose a framework called PLRH that Prompts LLMs with Rationale Heuristics for knowledge-based VQA. The PLRH prompts LLMs with Chain of Thought (CoT) to generate rationale heuristics, i.e., intermediate thought processes, and then leverages the rationale heuristics to inspire LLMs to predict answers. Experiments show that our approach outperforms the existing baselines by more than 2.2 and 2.1 on OK-VQA and A-OKVQA, respectively.
最近,大型语言模型(LLMs)被用于知识型视觉问答(VQA),尽管以往的研究取得了令人鼓舞的成果,但以往的方法促使LMs直接预测答案,忽略了中间思维过程。我们争辩说,以往的方法不足以激活LLMs的能力。我们提议了一个称为PLRH的框架,用基于知识的VQA的推理法刺激LLMs。PLRH促使具有思维链的LLMs产生理论超常,即中间思维过程,然后利用理论超常法来激励LMs预测答案。实验表明,我们的方法分别超过基于 OK-VQA和A-OKVQA的现有基线2.2和2.1。
Article 139
Title@2025-07-30 (3): Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
Title: Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection | Co-AttenDWG: Co-Attentive Dimension-Wise-Gating und Expertenfusion für Multi-Modal-Offensive Content Detection | 共同-DWG:多模式进攻性攻击物质探测联合加速维维维-韦兹交织和专家混合 2505.19010v2 |
Authors (4): Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha
Multi-modal learning has emerged as a crucial research direction, as integrating textual and visual information can substantially enhance performance in tasks such as classification, retrieval, and scene understanding. Despite advances with large pre-trained models, existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies, failing to fully harness the complementary strengths of different modalities. To address these limitations, we propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion. Our approach first projects textual and visual features into a shared embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This is further strengthened by a dimension-wise gating network, which adaptively modulates feature contributions at the channel level to emphasize salient information. In parallel, dual-path encoders independently refine modality-specific representations, while an additional cross-attention layer aligns the modalities further. The resulting features are aggregated via an expert fusion module that integrates learned gating and self-attention, yielding a robust unified representation. Experimental results on the MIMIC and SemEval Memotion 1.0 datasets show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment, highlighting its effectiveness for diverse multi-modal applications.
由于整合文本和视觉信息可以大大提高分类、检索和现场理解等任务的业绩,因此,多模式学习已成为关键的研究方向,因为整合文本和视觉信息可以大大提高分类、检索和现场理解等任务的业绩。尽管在经过事先培训的大型模型方面有所进展,但现有方法往往缺乏充分的跨模式互动和僵化的融合战略,未能充分利用不同模式的互补优势。为了解决这些局限性,我们提议共同-AtenDWG, 与维维维维的引力和专家融合相结合。我们的方法先是将文字和视觉特征纳入一个共享的嵌入空间,在这个空间中,专用的共享机制能够促进各种模式之间的同步、细微和精细的相互作用。这通过一个符合维度的整合网络得到进一步加强。这一网络以适应性调整的方式在频道一级贡献以强调突出的信息。同时,双方向的聚合者独立地完善了具体模式的表达方式,而额外的跨保护层进一步调整模式。由此产生的特征通过一个专家融合模块加以汇总,该模块将知识化和自我保存,从而产生强有力的统一代表。在MIMIMIC和Semal-Slovelyal-modal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-modal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-
Article 140
Title@2025-07-30 (3): ControlMed: Adding Reasoning Control to Medical Language Model
Title: ControlMed: Adding Reasoning Control to Medical Language Model | ControlMed: Reasoning Control in das medizinische Sprachmodell aufnehmen | 控制Med:在医疗语文模式中增加理由控制 2507.22545v1 |
Authors (4): Sung-Min Lee, Siyoon Lee, Juyeon Kim, Kyungmin Roh
Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.
由于临床决策的生命关键性质要求可靠的支持,医疗领域越来越多地采用具有更高准确度和可解释性的大语言模型(LLMs),因为临床决策的生命关键性质要求得到可靠的支持。尽管取得了这些进步,但现有的推理LLMs往往产生不必要的冗长推理过程,导致大量的计算间接费用和反应延迟。这些限制妨碍了它们在现实世界临床环境中的实际部署。为了应对这些挑战,我们引入了一个医学语言模型(LLLMs),使用户能够积极控制推论时间的长度,通过细微对照控制标记。 控制Med通过三阶段编审得到培训:1)关于大规模综合医疗指导数据集的预先培训,涵盖\ textit{direct}和\textit{irective responsert;2)用多长度推理数据和明确的长控标来监督微调;3)用基于模型的奖励信号加强学习,以提高事实准确度和反应质量。英语和韩国医学基准的实验结果表明,我们的模型在与州级综合医疗指导数据集中取得了类似或更好的性业绩。这些精确度分析是灵活分析所需要的,这些精确度和精确度分析,这些用户能够以灵活分析。
Article 141
Title@2025-07-30 (3): Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law
Title: Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law | Vortrainierte Modelle führen das Beste aus, wenn Token-Distributionen Zipfs Gesetz folgen | 事先培训的模型按照Zipf法在配制时最佳表现 2507.22543v1 |
Authors (3): Yanjin He, Qingkai Zeng, Meng Jiang
Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf’s law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf’s law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.
本地化是自然语言处理( NLP) 和其他序列建模域中的一个基本步骤, 词汇大小的选择对模型性能有重大影响。 尽管它很重要, 选择最优词汇大小仍然未得到充分探索, 通常依赖超自然学或数据集特定选择。 在这项工作中, 我们提出一个原则性方法, 通过 Zipf 法则分析象征性频率分布来确定词汇大小。 我们显示下游任务性能与符号分布与权力法行为之间的密切关联, 与齐普菲安缩放相匹配既能提高模型性能, 也能提高模型性能。 跨国家语言、 基因组学和化学的广泛实验表明, 当象征性分布与 Zipf 法 的严格一致时, 模式始终能达到峰值, 从而将齐普菲亚 校准作为选择词汇大小的强有力和通用标准 。
Article 142
Title@2025-07-30 (3): A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support
Title: A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support | Benchmark Dataset und Evaluation Framework für vietnamesische Großsprachenmodelle im Kundensupport | 越南客户支助大语言模式基准数据集和评价框架 2507.22542v1 |
Authors (9): Long S. T. Nguyen, Truong P. Hua, Thanh M. Nguyen, Toan Q. Pham, Nam K. Ngo, An X. Nguyen, Nghi D. M. Pham, Nghia H. Nguyen, Tho T. Quan
With the rapid growth of Artificial Intelligence, Large Language Models (LLMs) have become essential for Question Answering (QA) systems, improving efficiency and reducing human workload in customer service. The emergence of Vietnamese LLMs (ViLLMs) highlights lightweight open-source models as a practical choice for their accuracy, efficiency, and privacy benefits. However, domain-specific evaluations remain limited, and the absence of benchmark datasets reflecting real customer interactions makes it difficult for enterprises to select suitable models for support applications. To address this gap, we introduce the Customer Support Conversations Dataset (CSConDa), a curated benchmark of over 9,000 QA pairs drawn from real interactions with human advisors at a large Vietnamese software company. Covering diverse topics such as pricing, product availability, and technical troubleshooting, CSConDa provides a representative basis for evaluating ViLLMs in practical scenarios. We further present a comprehensive evaluation framework, benchmarking 11 lightweight open-source ViLLMs on CSConDa with both automatic metrics and syntactic analysis to reveal model strengths, weaknesses, and linguistic patterns. This study offers insights into model behavior, explains performance differences, and identifies key areas for improvement, supporting the development of next-generation ViLLMs. By establishing a robust benchmark and systematic evaluation, our work enables informed model selection for customer service QA and advances research on Vietnamese LLMs. The dataset is publicly available at https://huggingface.co/datasets/ura-hcmut/Vietnamese-Customer-Support-QA.
随着人工智能的迅速增长,大型语言模型(LLMS)对于问答系统(QA)至关重要,提高了客户服务的效率和减少了人的工作量。越南LLMS(VLMS)的出现凸显了轻量级开放源模式作为准确性、效率和隐私效益的实际选择。然而,具体领域的评价仍然有限,缺乏反映实际客户互动的基准数据集,使得企业难以选择合适的支持应用模式。为弥补这一差距,我们引入客户支持对话数据集(CSCConDa),这是从与越南一家大型软件公司的人类顾问的实际互动中得出的9,000多对QA的调整基准。涵盖诸如定价、产品供应和技术故障排除等不同主题,CSConDa为在实际情景中评估VillMS(实际客户)评估提供了有代表性的基础。我们还提出了一个全面评价框架,将CSConDA的11个轻度开放源VillMS(轻度开放源)数据库与自动计量和合成分析结合起来,以揭示模型的优点、弱点、语言模型的辅助性模型的对比。本研究报告为数据库和数据库的系统化的系统化评估领域提供了可靠的数据选择。
Article 143
Title@2025-07-30 (3): Training language models to be warm and empathetic makes them less reliable and more sycophantic
Title: Training language models to be warm and empathetic makes them less reliable and more sycophantic | Training Sprachmodelle warm und einfühlsam zu sein macht sie weniger zuverlässig und sykophantischer | 培训语言模式,使其温暖和同情,使其不那么可靠,更具有共生性 2507.21919v2 |
Authors (3): Lujain Ibrahim, Franziska Sofia Hafner, Luc Rocher
Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we show how this creates a significant trade-off: optimizing language models for warmth undermines their reliability, especially when users express vulnerability. We conducted controlled experiments on five language models of varying sizes and architectures, training them to produce warmer, more empathetic responses, then evaluating them on safety-critical tasks. Warm models showed substantially higher error rates (+10 to +30 percentage points) than their original counterparts, promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice. They were also significantly more likely to validate incorrect user beliefs, particularly when user messages expressed sadness. Importantly, these effects were consistent across different model architectures, and occurred despite preserved performance on standard benchmarks, revealing systematic risks that current evaluation practices may fail to detect. As human-like AI systems are deployed at an unprecedented scale, our findings indicate a need to rethink how we develop and oversee these systems that are reshaping human relationships and social interaction.
人工智能(AI)开发者正在越来越多地用热和同情的人来建立语言模型,数百万人现在使用这些模型来提供咨询、治疗和陪伴。在这里,我们展示了这如何创造出一个重大的权衡:优化语言模型以换取温暖会破坏其可靠性,特别是当用户表示脆弱性时。我们对五种不同大小和结构的语言模型进行了有控制的实验,培训它们以产生更温暖、更同情的反应,然后对安全关键任务进行评估。热模型显示的误差率(+10至+30百分点)大大高于原来的对应方,推广阴谋理论,提供不正确的事实信息,并提供有问题的医疗建议。它们也极有可能验证不正确的用户信仰,特别是当用户信息表达悲哀时。重要的是,这些影响在不同模型结构之间是一致的,而且尽管在标准基准上保持了业绩,暴露了当前评价做法可能无法检测的系统性风险。由于像人类一样的人工智能系统以前所未有的规模部署,我们的调查结果表明有必要重新思考我们如何发展和监督这些正在改变人类关系和社会互动的系统。
Article 144
Title@2025-07-30 (3): CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records
Title: CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records | CliCARE: Grounding Large Language Models in klinischen Richtlinien zur Entscheidungsunterstützung über Longitudinal Cancer Electronic Health Records | CliCARE:在纵向癌症电子健康记录决策支持临床指南中以大语言模式为基础 2507.22533v1 |
Authors (6): Dongchen Li, Jitao Liang, Wei Li, Xiaoyu Wang, Longbing Cao, Kun Yu
Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and multilingual nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these diverse settings, CliCARE significantly outperforms strong baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by expert oncologists.
大型语言模型(LLMS)为改进临床决策支持和减少医生的耗竭带来了重要前景,因为通过综合综合复杂、纵向癌症电子健康记录(EHRs),可以改善临床决策支持和减少医生的耗竭。然而,在这一重要领域实施这些模型面临三大挑战:无法有效处理患者记录的广泛长度和多语言性质,以便进行准确的时间分析;临床幻觉风险增加,因为Retreearval-Auged General(RAG)等传统基底技术没有适当纳入以流程为导向的临床指南;以及不可靠的评估指标,妨碍对人工智能系统在肿瘤学方面的验证。为了解决这些问题,我们提议CliCARE(CARE),一个在临床支持决定支持的临床大语言模型中定位大模范框架。 该框架通过将无结构的、纵向EHR(TKGs)转化为针对特定患者的时空知识图(TKGG)来捕捉取远程依赖性,然后将决策支持进程的基础是将这些现实世界病人的多样性轨迹与规范性指导知识图表相匹配。 这种方法为科学家提供了强有力的证据基础基础评估支持, 包括高层次的临床模型,通过高层次的临床模型数据模型,我们通过高层次的临床模型和高层次的临床模型数据模拟的临床模型数据分析, 显示了中国的临床模型数据模型数据。
Article 145
Title@2025-07-30 (3): Yankari: A Monolingual Yoruba Dataset
Title: Yankari: A Monolingual Yoruba Dataset | Yankari: Einsprachiger Yoruba-Datensatz | Yankari:单语Yoruba数据集 2412.03334v2 |
Authors (1): Maro Akpobi
This paper presents Yankari, a large-scale monolingual dataset for the Yoruba language, aimed at addressing the critical gap in Natural Language Processing (NLP) resources for this important West African language. Despite being spoken by over 30 million people, Yoruba has been severely underrepresented in NLP research and applications. We detail our methodology for creating this dataset, which includes careful source selection, automated quality control, and rigorous data cleaning processes. The Yankari dataset comprises 51,407 documents from 13 diverse sources, totaling over 30 million tokens. Our approach focuses on ethical data collection practices, avoiding problematic sources and addressing issues prevalent in existing datasets. We provide thorough automated evaluations of the dataset, demonstrating its quality compared to existing resources. The Yankari dataset represents a significant advancement in Yoruba language resources, providing a foundation for developing more accurate NLP models, supporting comparative linguistic studies, and contributing to the digital accessibility of the Yoruba language.
本文介绍Yoruba语言的大型单一语言数据集Yankari,这是一个大型的Yoruba语单一语言数据集,旨在解决这一重要的西非语言在自然语言处理资源(NLP)资源方面存在的重大差距。尽管有3 000多万人发言,Yoruba在NLP的研究和应用中的代表性严重不足。我们详细介绍了我们创建这一数据集的方法,其中包括仔细选择来源、自动化质量控制和严格的数据清理程序。Yankari数据集由13个不同来源的51 407份文件组成,总计超过3 000万个符号。我们的方法侧重于道德数据收集做法,避免问题源和解决现有数据集中普遍存在的问题。我们提供了对数据集的彻底自动评估,表明其质量与现有资源相比。Yankari数据集代表了Yoruba语言资源的重大进步,为开发更准确的NLP模型、支持比较语言研究以及帮助Yoruba语言数字无障碍提供了基础。
Article 146
Title@2025-07-30 (3): Probing Information Distribution in Transformer Architectures through Entropy Analysis
Title: Probing Information Distribution in Transformer Architectures through Entropy Analysis | Probing Information Distribution in Transformer-Architekturen durch Entropie-Analyse | 通过 Entropy 分析在变形结构中进行测试信息发布 2507.15347v2 |
Authors (5): Amedeo Buonanno, Alessandro Rivetti, Francesco A. N. Palmieri, Giovanni Di Gennaro, Gianmarco Romano
This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models
这项工作探索了作为在以变压器为基础的结构内进行信息传播的检验工具的酶分析。通过量化象征性的不确定性和审查不同处理阶段的酶型态,我们的目标是调查如何在这些模型内管理和转换信息。作为案例研究,我们将该方法应用到以GPT为基础的大型语言模型中,说明其揭示对模型行为和内部表现的洞察力的潜力。这一方法可以提供对模型行为的洞察力,并有助于为以变压器为基础的模型制定可解释性和评估框架。
Article 147
Title@2025-07-30 (3): SLM-SQL: An Exploration of Small Language Models for Text-to-SQL
Title: SLM-SQL: An Exploration of Small Language Models for Text-to-SQL | SLM-SQL: Eine Erforschung kleiner Sprachmodelle für Text-zu-SQL | SMS-SQL:探索文字到SQL的小型语言模型 2507.22478v1 |
Authors (2): Lei Sheng, Shuai-Shuai Xu
Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87\% execution accuracy (EX), while the 1.5B model achieved 67.08\% EX. We will release our dataset, model, and code to github: https://github.com/CycloneBoy/slm_sql.
大型语言模型(LLMS)在将自然语言问题转换成 SQL 查询(Text-to-SQL)方面表现良好,相比之下,小型语言模型(SLMs)在将自然语言问题转换成 SQL 查询(Text-to-SQL)方面表现良好,从0.5B至1.5B参数之间,目前文本到SQL任务方面表现不佳,因为其逻辑推理能力有限。然而,可持续土地管理在推导速度和边缘部署的适宜性方面,具有内在的优势。为了在文本到SQL应用程序中探索其潜力,我们利用培训后技术的最新进展。具体地说,我们利用开放源SySQL-2.5M数据集来构建两个衍生数据集:SySQL-Think-916KSL生成的SMLL-SML-SML-Tink-916K参数,SQL生成SQL,SQL生成SQL的SQLMSQL和SQQQQQL-MQL-MQL-MQL-MQL-MQL-SQL的SQL,SQL的SQLMLMLD;SQLSQLMLMLMLMLDSQLDSQLMQLMQLDML生成的模型和SQL-SQL-SQL-SQL-SQL-SQL-SQL的模型,SQL-SQL-SQL-SQL-SQL-SQL-SQL EX-SQL EX-SQL EX-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL EX-SQL-SQL-SQL EX-SQ
Article 148
Title@2025-07-30 (3): Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR
Title: Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR | Dynamische Parameter für vietnamesische geschlechtsunabhängige ASR erkunden | 探索越南性别独立ASR的动态参数 2507.22964v1 |
Authors (4): Sotheara Leang, Éric Castelli, Dominique Vaufreydaz, Sethserey Sam
The dynamic characteristics of speech signal provides temporal information and play an important role in enhancing Automatic Speech Recognition (ASR). In this work, we characterized the acoustic transitions in a ratio plane of Spectral Subband Centroid Frequencies (SSCFs) using polar parameters to capture the dynamic characteristics of the speech and minimize spectral variation. These dynamic parameters were combined with Mel-Frequency Cepstral Coefficients (MFCCs) in Vietnamese ASR to capture more detailed spectral information. The SSCF0 was used as a pseudo-feature for the fundamental frequency (F0) to describe the tonal information robustly. The findings showed that the proposed parameters significantly reduce word error rates and exhibit greater gender independence than the baseline MFCCs.
语音信号的动态特征提供了时间信息,并在加强自动语音识别方面发挥着重要作用。在这项工作中,我们用极地参数来捕捉语音动态特征并尽量减少光谱变异,用光谱信号的动态特征对光谱子波段中枢比例平面的声学转变作了描述。这些动态参数与越南ASR的Mel-Funity Cepstraal Covalies(MFCCs)相结合,以捕捉更详细的光谱信息。SSCF0被用作基本频率(F0)的假性功能,以有力地描述古典信息。调查结果显示,拟议的参数大大降低了字差率,并显示出比基线MFCC更多的性别独立性。
Article 149
Title@2025-07-30 (3): Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears
Title: Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears | Stimmen freiberuflicher Schriftsteller über KI: Einschränkungen, Erwartungen und Ängste | 自由职业作家对大赦国际的呼声:限制、期望和恐惧 2504.05008v2 |
Authors (4): Anastasiia Ivanova, Natalia Fedorova, Sergei Tilga, Ekaterina Artemova
The rapid development of AI-driven tools, particularly large language models (LLMs), is reshaping professional writing. Still, key aspects of their adoption such as languages support, ethics, and long-term impact on writers voice and creativity remain underexplored. In this work, we conducted a questionnaire (N = 301) and an interactive survey (N = 36) targeting professional writers regularly using AI. We examined LLM-assisted writing practices across 25+ languages, ethical concerns, and user expectations. The findings of the survey demonstrate important insights, reflecting upon the importance of: LLMs adoption for non-English speakers; the degree of misinformation, domain and style adaptation; usability and key features of LLMs. These insights can guide further development, benefiting both writers and a broader user base.
AI驱动工具的迅速发展,特别是大型语言模型(LLMS),正在改变专业写作,但是,采用这些工具的关键方面,如语言支持、道德和对作家声音和创造力的长期影响,仍然没有得到充分探讨,在这项工作中,我们经常使用AI对专业作家进行了问卷调查(N=301)和互动式调查(N=36),我们审查了LLM协助撰写25+种语言的做法、道德问题和用户期望,调查结果显示了重要的深刻见解,反映了以下做法的重要性:为非英语者采用LLMs;错误信息、领域和风格的适应程度;LLMs的实用性和关键特征。这些见解可以指导进一步发展,使作家和更广泛的用户基础受益。
Article 150
Title@2025-07-30 (3): IFEvalCode: Controlled Code Generation
Title: IFEvalCode: Controlled Code Generation | IFEvalCode: Kontrollierte Code-Generierung | IFEvalCode:受控制的代码生成 2507.22462v1 |
Authors (12): Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, Binyuan Hui, Junyang Lin
Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models’ ability to generate correct code versus code that precisely follows instructions.
守则大语言模型(Code LLMS)通过将自然语言描述转换成功能代码,在代码生成方面取得了显著进展;然而,现实世界应用往往要求严格遵循详细要求,如编码样式、行数和结构限制,而不仅仅是正确性;为此,本文件介绍了前向和后向制约生成,以提高守则LLM在受控代码生成中的遵循指令能力,确保产出与人类定义指南更加一致。作者还介绍了IFEvalCode,这是一个多语言基准,包括了7种编程语言(Python、Java、JavaScript、TypeScript、Shell、C++和C#)的1.6K测试样本,其中每种样本都包含中英查询。与现有的基准不同,IFEvalCode decouples评价分为两个衡量尺度:正确性(Corr.)和教学跟踪(Instr.),能够进行更细致的评估。对40多个LLMS进行了实验,表明封闭源模型在可控代码生成中超越开放源,并突出模型在生成代码与代码指示之间的巨大差距。
Article 151
Title@2025-07-30 (3): FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training
Title: FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training | FineMedLM-o1: Verbesserung des medizinischen Wissens, das die Fähigkeit von LLM vom überwachten Feintuning bis zum Test-Time Training begründet | FineMedLM-o1:提高LLM从监督的精密教学到试验时间培训的医疗知识能力 2501.09213v3 |
Authors (9): Hongzhou Yu, Tianhao Cheng, Yingwen Wang, Wen He, Qing Wang, Ying Cheng, Yuejie Zhang, Rui Feng, Xiaobo Zhang
Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the deep reasoning required for complex medical problems, such as differential diagnosis and medication recommendations. We propose FineMedLM-o1, which leverages high-quality medical synthetic data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduce Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also propose a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.
在大型语言模型(LLMS)的最近进展在疾病诊断和治疗规划等医疗应用方面显示出了希望,然而,大多数现有的医疗LMS都与复杂医疗问题所需的深刻推理,如不同诊断和药物建议等,进行了斗争。我们建议FineMedLM-o1,利用高质量的医学合成数据和长式推理数据,用于监督的精密配制(SFT)和直接优化(DPO),使先进的对话和深入推理能力成为可能。此外,我们首次在医疗领域引入了试验时间培训(TTTT),便利了域的适应,并确保了可靠和准确的推理。实验结果显示FineMedLM-o1比以前的关键医疗基准模型平均提高了23%的性能改进。此外,TT的引入提供了额外的14%的性能提升,突出其在提高医疗推理能力方面的效力。为了支持这一进程,我们还提出了一种新型的方法,用于综合医疗对话。与其他公开源数据集相比,我们的数据集在质量和复杂性上都处于优势。项目和数据将在GiHub上公布数据。
Article 152
Title@2025-07-30 (3): What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models
Title: What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models | Was ist ein “Abstract Reasoner”? Experimenten und Argumenten über große Sprachmodelle nachzuvollziehen | 什么是“抽象理由” ? 关于大语言模型的重新审视实验和争论 2507.22457v1 |
Authors (3): Tian Yun, Chen Sun, Ellie Pavlick
Recent work has argued that large language models (LLMs) are not “abstract reasoners”, citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an “abstract reasoner”, and why it matters whether LLMs fit the bill.
最近的工作认为,大型语言模型(LLMs)不是“抽象的理性”,以其在各种具有挑战性的任务上表现差的零点作为证据。我们重新审视这些实验,以便给索赔增加细微差别。首先,我们表明,虽然LLMs在零点情况下的表现确实很差,但即使调整一小撮输入编码参数也能产生接近完美的效果。然而,我们还表明,这种微调并不一定会跨越数据集。我们把收集的经验结果当作邀请(重新)开诚布公地讨论“简单理性”的含义,以及为什么LLMs是否适合帐单的问题。
Article 153
Title@2025-07-30 (3): Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
Title: Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance | Falcon-H1: Eine Familie hybrider Sprachmodelle zur Neudefinition von Effizienz und Leistung | Falcon-H1:调整效率和绩效的混合语言模式家庭 2507.22448v1 |
Authors (27): Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, Mugariya Farooq, Giulia Campesan, Ruxandra Cojocaru, Yasser Djilali, Shi Hu, Iheb Chaabane, Puneesh Khanna, Mohamed El Amine Seddik, Ngoc Dung Huynh, Phuc Le Khac, Leen AlQadi, Billel Mokeddem, Mohamed Chami, Abdalgader Abubaker, Mikhail Lubinets, Kacper Piskorski, Slim Frikha
In this report, we introduce Falcon-H1, a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency across diverse use cases. Unlike earlier Falcon models built solely on Transformer or Mamba architectures, Falcon-H1 adopts a parallel hybrid approach that combines Transformer-based attention with State Space Models (SSMs), known for superior long-context memory and computational efficiency. We systematically revisited model design, data strategy, and training dynamics, challenging conventional practices in the field. Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters. Quantized instruction-tuned models are also available, totaling over 30 checkpoints on Hugging Face Hub. Falcon-H1 models demonstrate state-of-the-art performance and exceptional parameter and training efficiency. The flagship Falcon-H1-34B matches or outperforms models up to 70B scale, such as Qwen3-32B, Qwen2.5-72B, and Llama3.3-70B, while using fewer parameters and less data. Smaller models show similar trends: the Falcon-H1-1.5B-Deep rivals current leading 7B-10B models, and Falcon-H1-0.5B performs comparably to typical 7B models from 2024. These models excel across reasoning, mathematics, multilingual tasks, instruction following, and scientific knowledge. With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications. All models are released under a permissive open-source license, underscoring our commitment to accessible and impactful AI research.
在本报告中,我们引入了Falcon-H1系列新的大型语言模型(LLMs),该系列是混合型结构(LLMs),其设计优化,以适应不同用途案例中的高性能和效率。与以前完全建在变压器或Mamba结构上的Falcon-H1模型不同,Falcon-H1采用了一种平行混合式方法,将基于变压器的注意力与国家空间模型(SSMS)相结合,后者以高长的长文记忆和计算效率著称。我们系统地重新审视了模型设计、数据战略和培训动态,对外地的常规做法提出了挑战。Fal-H1-34B匹配或超越了常规做法。Falconom-H1以多种组合形式发布,包括0.5B、1.5B、1.5B-deep、3B、7B、7B和指示调整变量变量变压变换变量。 Qwent-32B的调整型指令模式也可用Qwen-72B和34B的参数调整模型,同时以较低的当前趋势显示IM-10的运行模式。
Article 154
Title@2025-07-30 (3): AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini
Title: AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini | KI-generierte Geschichten begünstigen Stabilität gegenüber Veränderung: Homogenität und kulturelle Stereotypisierung in Erzählungen, die von gpt-4o-mini erzeugt werden | AI产生的故事有利于稳定而不是变化:在gpt-4o-mini产生的叙事中,同质性和文化陈规定型 2507.22445v1 |
Authors (2): Jill Walker Rettberg, Hermann Wigers
Can a language model trained largely on Anglo-American texts generate stories that are culturally relevant to other nationalities? To find out, we generated 11,800 stories - 50 for each of 236 countries - by sending the prompt “Write a 1500 word potential {demonym} story” to OpenAI’s model gpt-4o-mini. Although the stories do include surface-level national symbols and themes, they overwhelmingly conform to a single narrative plot structure across countries: a protagonist lives in or returns home to a small town and resolves a minor conflict by reconnecting with tradition and organising community events. Real-world conflicts are sanitised, romance is almost absent, and narrative tension is downplayed in favour of nostalgia and reconciliation. The result is a narrative homogenisation: an AI-generated synthetic imaginary that prioritises stability above change and tradition above growth. We argue that the structural homogeneity of AI-generated narratives constitutes a distinct form of AI bias, a narrative standardisation that should be acknowledged alongside the more familiar representational bias. These findings are relevant to literary studies, narratology, critical AI studies, NLP research, and efforts to improve the cultural alignment of generative AI.
在英美文本方面受过培训的语言模式能否产生在文化上与其他民族相关的故事?为了发现,我们通过向OpenAI的模型gpt-4o-mini发送“写写1500个单词潜在{demonom}故事”的提示,产生了11,800个故事——236个国家中每个国家各有50个——通过向OpenAI的模型发送“写出1,500个单词潜在{demonom}故事 {demonom}故事”。虽然故事确实包括地平面上的国家符号和主题,但它们绝大多数符合各国单一的叙事情节结构:主角生活在小城镇或回到小城镇,通过重新与传统联系和组织社区活动来解决小冲突。现实世界冲突已经消毒化,浪漫几乎不存在,叙述紧张被淡化,有利于怀旧与和解。结果是叙述式同质化:人工合成的想象,其优先特征高于变化和传统,高于增长。我们争辩说,AI产生的叙事的结构性同质构成一种独特的AI偏见形式,一种叙述标准化,应该与更熟悉的表述偏见一起得到承认。这些结论与文学研究、词学、批判性AI研究、批判性研究、NLP的基因调整和努力有关。
Article 155
Title@2025-07-30 (3): BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition
Title: BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition | BERSting at the Screams: Ein Maßstab für distanzierte, emotionale und erschrockene Spracherkennung | 尖叫时发出尖叫声:远程、情感和呼喊语音识别基准 2505.00059v2 |
Authors (9): Paige Tuttösí, Mantaj Dhillon, Luna Sang, Shane Eastwood, Poorvi Bhatia, Quang Minh Dinh, Avni Kapoor, Yewon Jin, Angelica Lim
Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.
一些语音识别任务,如自动语音识别(ASR),正在接近或已经达到许多报告指标中的人类性能。然而,它们继续在复杂的现实世界中挣扎,例如远程言论。以往的挑战已经释放出数据集,以解决远程ASR问题,然而,重点仍然主要放在距离上,具体依靠多声式阵列系统。这里我们展示的是B(asic) E(As) R(音调) R(音) R(音) R(音)) 数据集。数据集包含98个行为体的几乎4小时英语演讲,这些行为体具有不同的区域和非当地性口音。这些数据是在行为体家中智能手机上收集的,因此包括至少98个不同的声响环境。这些数据还包括7种不同的感应以及喊和口语表达。智能手机位于19个不同位置,包括障碍和位于与行为体不同的房间。这些数据可供公开使用,可用于评价各种语音识别任务,包括:ASR、警喊检测、言论感应激、感应感应感应、感应、感应感应性能、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感、感官、感官、感官、感官、感官、感官、感官、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感)等,我们
Article 156
Title@2025-07-30 (3): Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation
Title: Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation | Mmm whatcha sagen? Enthüllen distale und proximale Kontexteffekte in der ersten und zweiten Sprache Wort Wahrnehmung mit psychophysischen umgekehrten Korrelation | 使用心理物理反向关系,在第一和第二语言的词感中产生未发现和预期的环境效应 2406.05515v2 |
Authors (7): Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim
Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.
声频背景效应,即声频、速率或音调的变化影响声音感知的周围变化,在语音感知中有详细记载,但与语言背景互动的方式仍不明朗。我们采用反向关系方法,系统地将音频和语音率在第二语言(L2)讲英语(/i/-I/)和法语(/u/-/y/)的不同配音词(L2)周围的语调(L2)周围的语调(L2)上,用英语(/i/-I/)和法语(/u//y/)和法语(/y/),从而以数据驱动的方式重建偏向其感知的立体特征。测试英语和法语演讲者(n=25),我们表明,元音调感其实受周围音频和语调率的相冲突效应影响:准准准效果(0.2s)预准效果和偏差效果达到1;发现L1和L2讲者在认知中表现出惊人相似的Prosodic剖面特征。我们提供了一种新型方法,用于调查音学背景、时间尺度和声学域间影响。
Article 157
Title@2025-07-30 (3): NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models
Title: NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models | NeedleChain: Messung der Intact-Langkontext-Begründungsfähigkeit großer Sprachmodelle | Nenelechain:计量大语言模型的精密长文理由 2507.22411v1 |
Authors (2): Hyeonseok Moon, Heuiseok Lim
The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models’ (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, \textbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain
虽然这种方法是评价长期理解的一种广泛接受的标准,但我们的研究结果表明,甚至诸如GPT-4o等最先进的模型也广泛用于评价大语言模型(LLMs)了解长处(LLMs)理解长处(LLC)的能力。它评价了在广泛的与查询有关的段落中查明与查询有关背景的能力。虽然这种方法是评价长处理解的一种广泛接受的标准,但我们的研究结果表明,它可能高估LCms的真正LC能力。我们表明,甚至GPT-4o等最先进的模型也被用来完整地纳入由纯查询相关的十句话组成的特定环境。我们与各高级LMS的实验显示,它们处理大处环境的能力与完全理解投入的能力之间存在明显差距。我们的基准允许灵活的背景长度和推理顺序,对LLMMs绩效进行更全面的分析。此外,我们提出了一个非常简单而又令人信服的战略来提高LCs的LM能力:ROPE Contrationion。我们在各种高级LMs的实验显示,它们在处理大处所处理大处的能力与它们完全理解AMs/MHAHIRCSD的数据和数据之间有明显差距。
Article 158
Title@2025-07-30 (3): Question Generation for Assessing Early Literacy Reading Comprehension
Title: Question Generation for Assessing Early Literacy Reading Comprehension | Fragegenerierung für die Bewertung des frühen Leseverständnisses | 评估早期阅读读写能力读写能力的提问一代 2507.22410v1 |
Authors (3): Xiaocheng Yang, Sumuk Shashidhar, Dilek Hakkani-Tur
Assessment of reading comprehension through content-based interactions plays an important role in the reading acquisition process. In this paper, we propose a novel approach for generating comprehension questions geared to K-2 English learners. Our method ensures complete coverage of the underlying material and adaptation to the learner’s specific proficiencies, and can generate a large diversity of question types at various difficulty levels to ensure a thorough evaluation. We evaluate the performance of various language models in this framework using the FairytaleQA dataset as the source material. Eventually, the proposed approach has the potential to become an important part of autonomous AI-driven English instructors.
通过基于内容的互动对阅读理解进行评估,在阅读获取过程中发挥了重要作用。在本文件中,我们提出了针对K-2英语学习者提出理解问题的新办法。我们的方法确保基础材料的完整覆盖和适应学习者的具体能力,并能够在不同的困难级别产生多种多样的问题类型,以确保进行彻底评估。我们利用FairtytaleQA数据集作为原始材料,评估本框架中各种语言模式的绩效。最终,拟议办法有可能成为AI驱动的自主英语教员的重要组成部分。
Article 159
Title@2025-07-30 (3): R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
Title: R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs | R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs | R2-KG:知识图表可靠理由通用双重目的机构框架 2502.12767v6 |
Authors (4): Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.
最近的研究将大语言模型(LLMs)与知识图(KGs)相结合,以加强推理,提高推理准确性,而无需额外培训,同时减轻幻觉;然而,现有框架仍然有两个实际的缺点:每当KG或推理任务发生变化时,必须重新调整现有框架;它们依赖一个单一的、高能力的LM(LLM)(可靠(即可信赖的)推理);为此,我们引入了R2-KG(一个插接插和游戏的双重试剂框架),将依赖分为两个角色:一个操作员(低容量LM)(收集证据的低容量LM)和一个主管(高容量LMM)(作出最后判断的高级LM)(高容量LM),这一设计对LM推断具有成本效益,同时仍然保持很强的推理准确性;此外,R2-KGG公司采用一个吸收足够证据的系统,只有在从KG组收集到足够证据,从而大大提高可靠性时才能找到答案;五个基准的实验显示,R2-KG(KG)在准确性和可靠性标准方面始终都比基准的基线,不管LMS的内在能力,进一步实验显示,在操作操作者使用高清晰度上,在高清晰度战略中,在高清晰度上,而能能度上,在高的KKG(KKG)在高的精确度上,在高的精确度上,在高的精确度上,在高的精确度上要达到高水平上,在高的精确度上,在精确度上,在精确度上,在高的精确度上要能能能能能能能。
Article 160
Title@2025-07-30 (3): Reservoir Computing as a Language Model
Title: Reservoir Computing as a Language Model | Reservoir Computing als Sprachmodell | 作为语言模式的 “ 储量计算 “ 模式 2507.15779v2 |
Authors (2): Felix Köster, Atsushi Uchida
Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known transformer-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.
大型语言模型(LLM)在科学和媒体景观中占据了主导地位,在处理大量数据和制作人文文本方面的业绩令人印象深刻。然而,其巨大的能源需求和缓慢的处理仍然是进一步提高质量的瓶颈,同时也使每个人都可以使用模型。为了解决这一瓶颈,我们将调查储油层计算如何在自然文本处理中发挥作用,这可以快速和节能地实施硬件。研究储油层计算作为一种语言模型的使用仍然很少。在本文件中,我们比较了三个不同的性格语言模型使用方法:两种不同的储油层计算方法,即只有产出层可以培训的两种不同的储油层计算方法,以及众所周知的基于变压器的结构,这些结构充分学习基于关注的顺序代表。我们探索这两种模式的性能、计算成本和预测准确性,通过对所有模型的可培训参数数量进行同样的差异。我们用所有三种方法一致的管道证明变压器在预测质量方面是优秀的,而储油层计算机仍然非常高效地减少培训和推导速度。此外,我们调查了两种储油层计算方法:一种传统的储油层储油层储层储油层和固定的直线式结构,我们通过感重的阅读了它们如何调整了它们。
Article 161
Title@2025-07-30 (3): PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs
Title: PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs | PATENTWRITER: Eine Benchmarking-Studie für die Patenterstellung mit LLMs | PATENTWRITER: 专利起草基准研究与LLMs 2507.22387v1 |
Authors (3): Homaira Huda Shomee, Suman Kalyan Maity, Sourav Medya
Large language models (LLMs) have emerged as transformative approaches in several important fields. This paper aims for a paradigm shift for patent writing by leveraging LLMs to overcome the tedious patent-filing process. In this work, we present PATENTWRITER, the first unified benchmarking framework for evaluating LLMs in patent abstract generation. Given the first claim of a patent, we evaluate six leading LLMs – including GPT-4 and LLaMA-3 – under a consistent setup spanning zero-shot, few-shot, and chain-of-thought prompting strategies to generate the abstract of the patent. Our benchmark PATENTWRITER goes beyond surface-level evaluation: we systematically assess the output quality using a comprehensive suite of metrics – standard NLP measures (e.g., BLEU, ROUGE, BERTScore), robustness under three types of input perturbations, and applicability in two downstream patent classification and retrieval tasks. We also conduct stylistic analysis to assess length, readability, and tone. Experimental results show that modern LLMs can generate high-fidelity and stylistically appropriate patent abstracts, often surpassing domain-specific baselines. Our code and dataset are open-sourced to support reproducibility and future research.
大型语言模型(LLMS)已成为若干重要领域的变革方法。本文件旨在通过利用LLMS来利用LLMs来克服无聊的专利过滤程序,实现专利写作范式的转变。在这项工作中,我们介绍PATENTWRITER,这是在专利抽象生成过程中评价LMS的第一个统一基准框架。鉴于第一项专利主张,我们根据一个涵盖零发、少发和一连串的激励策略的一致设置,对六大LMS – – 包括GPT-4和LLLAMA-3 – – 进行了评价,这六大LLMS(包括GPT-4和LLAMA-3)进行了评价,以产生专利的抽象。我们的基准PATENTWRITER超越了地表一级的评估:我们系统评估产出质量的方法包括一套综合的计量尺度 – – 标准NLP措施(例如,BLEU、ROUGE、BERSTScore),在三种类型的投入扰动作用下,以及两种下专利分类和检索任务的适用性。我们还进行了模拟分析,以评估性分析,以评估性分析,以评估长度、可读性和调和调度。实验性分析结果显示,现代LMSMSDMs能够产生高的专利基础和对未来数据库基础和升级性支持。
Article 162
Title@2025-07-30 (3): OWLViz: An Open-World Benchmark for Visual Question Answering
Title: OWLViz: An Open-World Benchmark for Visual Question Answering | OWLViz: Ein Open-World-Benchmark für visuelle Fragen | OWLViz:视觉问答的开放世界基准 2503.07631v3 |
Authors (6): Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, Viet Dac Lai
We present a challenging benchmark for the Open WorLd VISual question answering (OWLViz) task. OWLViz presents concise, unambiguous queries that require integrating multiple capabilities, including visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems’ ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.
我们为Open WorLd Visual 答题(OWLViz)的任务提出了一个具有挑战性的基准。 OWLViz 给出了简明、明确的询问,要求整合多种能力,包括视觉理解、网络探索和专门工具使用。 虽然人类在这些直观任务上实现了69.2%的准确性,但即使是最先进的VLM,其最佳模型是Gemini 2.0, 其准确性仅为26.6%。目前依赖有限的愿景和愿景语言模型作为工具的VLMs,其表现更差。这一绩效差距揭示了多式联运系统在选择适当工具和执行复杂推理序列、为推进实际的AI研究确定新方向方面的巨大局限性。
Article 163
Title@2025-07-30 (3): Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Title: Multimodal LLMs as Customized Reward Models for Text-to-Image Generation | Multimodale LLMs als maßgeschneiderte Reward-Modelle für die Text-zu-Image-Generierung | 以多式多式LLMs作为生成文字到图像的自定制奖励模型 2507.21391v2 |
Authors (8): Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.
我们引入了LLAVA-Reward(LLAVA-Reward),这是一个高效的奖励模式,旨在从多种角度自动评价文字到图像(T2I)的几代人,利用预先培训的多式联运大型语言模型(MLLLM)。基于MLLM(MLLM)的现有方法需要指导跟踪数据,用于监督微调,并评估分析文本反应的生成质量,这既耗时又难于培训。为了解决这一问题,我们提议LLLAVA-Reward(Reward)直接利用MLLMS(MLimage)的隐藏状态,直接利用MLLLM(T-image)的文本图像。为了加强光学和文本代表之间的双向互动,我们进一步提议增加一个Skipple-contion Crostition(Skip-CA)模块。这一设计通过将早期视觉特征与后层隐藏的表达方式联系起来,从而增强文字的关联性推介关系。此外,LAVA-Refer(LA-A-A-A-A-A-LVA-A-SAR)级的升级方法展示,以展示-A-A-A-A-AD-SLVD-SD-SD-SD-SLD-SD-SLD-SD-S-S-S-S-S-S-SD-S-S-S-SAR-SD-SD-S-SD-SD-S-SD-SD-SD-SD-SD-SD-S-S-S-S-S-S-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-S-S-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-A-S-S-S-A-A-S-S-S-A-
Article 164
Title@2025-07-30 (3): BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
Title: BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity | BlockFFN: Auf dem Weg zur End-Side Acceleration-Friendly Mixture-of-Experts mit Chunk-Level-Aktivierung Sparsity | 块块FFN: 向具有整块级激活分级的 终端- 双极加速- 友好混合混合专家方向 2507.08771v2 |
Authors (8): Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).
为了减轻大型语言模型(LLMS)的计算负担,以专家混合(MoE)为代表的具有激活性弹性的架构(LLMS)吸引了越来越多的关注。然而,香草MoE的无差别和不灵活路线令模式伤害了模型性能。此外,尽管每个象征性的架构只激活了几个参数,但这些分散活跃的架构却呈现出低块水平的宽度,表明多个连续代号的结合激活了巨大的参数比例。这样的松散模式对于在低资源条件(例如,终端设备)下加速运行不方便,而且与主流加速技术(例如,投机性解码)不兼容。为了应对这些挑战,我们引入了全新的MOE架构(BlubFFN)及其高效培训和部署技术。具体地说,我们使用一个将RELU的激活和RMSNormm 整合到不同和灵活路线上的路径上。接下来,在Sal-levelopmental-ality(TLS)和块级级的终端设备(CLS)中(CS-LS-s-wapildal-loadalalalalalalalal-dealalalalalalalal) strationalalalalalalalal 和3.(Cal-deal-deal-dealal) 80) 目标是设计、CLFMal-deal-deal-deal-dealizaldal-dealmentalmentalmental-dealmentalmentalal 80 80 。最后运行,为我们80 和80的升级化的加速性能、CLFMFMFTalmental-deal-deal-deal-deal-deal-deal-deal-tamental-tamental-al-al-al-al-deal-tamental-tamental-deal-deal-tamental-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-al-al-deal-al-al-deal-deal-deal-al-al-al-al-al-al-al-al-al-deal-
Article 165
Title@2025-07-30 (3): Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors
Title: Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors | Traits Run Deep: Verbesserung der Persönlichkeitsbeurteilung durch psychologisch geführte LLM-Darstellungen und multimodale Scheinverhalten | 深层轨迹:通过心理学辅导LLM代表和多模式亲善行为,加强个性评估 2507.22367v1 |
Authors (7): Jia Li, Yichao He, Jiacheng Xu, Tianhao Luo, Zhenzhen Hu, Richang Hong, Meng Wang
Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called \textit{\textbf{Traits Run Deep}}. It employs \textit{\textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. Besides, it devises a \textit{\textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45\% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method’s superiority, ranking first in the Personality Assessment track. The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep.
准确和可靠的人格评估在许多领域,例如情感智能、心理健康诊断和个性化教育,都发挥着关键的作用。与时俱进的情感不同,个性特征是稳定的,通常通过语言、面部表达方式和身体行为在潜意识中渗漏,不同模式的形态是非同步的。很难模拟具有传统表面特征的个性语义,而且似乎不可能实现有效的跨模式理解。为了应对这些挑战,我们提议了一个名为\ textitut thextb{Traits run Deep的新性人格评估框架。它使用\ textitle thextb{精神健康诊断和个性化教育。 与时俱进的个性特征特征特征特征特征特征是稳定的稳定,此外,它设计出一个具有丰富文字语义的语义表达, 将一个Chunk-Wisetical Projectorationorlation to developmental dislationality A-dealalalalal ress reviews 用于有效的变现变现工具。
Article 166
Title@2025-07-30 (3): Masked Language Models are Good Heterogeneous Graph Generalizers
Title: Masked Language Models are Good Heterogeneous Graph Generalizers | Masked Language Models sind gute Heterogene Graph Generalizers | 遮罩语言模型是好异基因图形缩略图 2506.06157v2 |
Authors (8): Jinyu Yang, Cheng Yang, Shanyuan Cui, Zeyuan Guo, Liangwei Yang, Muhan Zhang, Zhiqiang Zhang, Chuan Shi
Heterogeneous graph neural networks (HGNNs) excel at capturing structural and semantic information in heterogeneous graphs (HGs), while struggling to generalize across domains and tasks. With the rapid advancement of large language models (LLMs), a recent study explored the integration of HGNNs with LLMs for generalizable heterogeneous graph learning. However, this approach typically encodes structural information as HG tokens using HGNNs, and disparities in embedding spaces between HGNNs and LLMs have been shown to bias the LLM’s comprehension of HGs. Moreover, since these HG tokens are often derived from node-level tasks, the model’s ability to generalize across tasks remains limited. To this end, we propose a simple yet effective Masked Language Modeling-based method, called MLM4HG. MLM4HG introduces metapath-based textual sequences instead of HG tokens to extract structural and semantic information inherent in HGs, and designs customized textual templates to unify different graph tasks into a coherent cloze-style ‘mask’ token prediction paradigm. Specifically,MLM4HG first converts HGs from various domains to texts based on metapaths, and subsequently combines them with the unified task texts to form a HG-based corpus. Moreover, the corpus is fed into a pretrained LM for fine-tuning with a constrained target vocabulary, enabling the fine-tuned LM to generalize to unseen target HGs. Extensive cross-domain and multi-task experiments on four real-world datasets demonstrate the superior generalization performance of MLM4HG over state-of-the-art methods in both few-shot and zero-shot scenarios. Our code is available at https://github.com/BUPT-GAMMA/MLM4HG.
厚度图形神经网络(HGNNS) 擅长捕捉混杂图形(HGNS) 中的结构性和语义信息,同时努力在各领域和任务之间推广。随着大型语言模型(LLMS)的快速进步,最近的一项研究探索了HGNs与LLMS的整合,以便进行通用的混杂图形学习。然而,这种方法通常会将结构信息编码为HG表示使用HGNNS的HGmost,而将HGNNs和LLLMMMS之间的空间嵌入差异显示为对LMMLMTM的交叉理解。此外,由于这些HGMS往往来自节点目标层面的任务,因此该模型对任务进行统称化的能力仍然有限。为此,我们提出了一个简单而有效的、有效的基于通用语言模型的MLMHMHMH模型方法,将基于HGML的结构性和基于GMMMMML的MLMS-Maldaldaldal 版本用于提取基于H的模板。
Article 167
Title@2025-07-30 (3): Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning
Title: Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning | Nutzung von großen Sprachmodellen für Bengalische Mathematik-Wort-Probleme bei der Lösung der Kette der Gedankenveranlagung | 利用大语言模型解决孟加拉语数学字词与思维链理性的解决问题 2505.21354v2 |
Authors (5): Bidyarthi Paul, Jalisha Jashim Era, Mirazur Rahman Zim, Tahmid Sattar Aothoi, Faisal Muhammad Shah
Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language’s low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.
解决孟加拉数学字数问题(MWPs)仍然是自然语言处理(NLP)的一大挑战,原因是该语言资源水平低,需要多步推理。现有的模型与复杂的孟加拉语言模型(LLMMS)挣扎,这主要是因为以前没有人类附加说明的孟加拉语数据集。这一差距限制了孟加拉语数学推理的进展。为了解决这个问题,我们创建了由8792个复杂的孟加拉语模型组成的数据集SOMADHAN,配有手写、逐步的解决方案。我们设计了这一数据集,以支持在语言代表性不足的背景下进行注重逻辑的评价和模型开发。我们利用SOMAADHAN评估了一系列大型语言模型(LLMMSM),包括GPT-4o、GPT-3.5 Turbo、LLMMA系列模型、Deepseek和Quwen-通过零发和几发的推理(CoT)推理推理来不断改进标准,特别是在需要多步逻辑的情况下,标准推理的推理。LMA-3.3-RMM-B在高层次推理学上实现了88的精度的精准,我们的数据推理成本,我们用低的精确推理学推理学推理学推理,我们用低的推理推理推理学的推理,我们用高的推理的推理,我们用低的推理学的推理的推理的推理的推理的推理方法推理方法推理,还了88888,也提高了了高的推理。
Article 168
Title@2025-07-30 (3): MuSciClaims: Multimodal Scientific Claim Verification
Title: MuSciClaims: Multimodal Scientific Claim Verification | MuSciClaims: Multimodale wissenschaftliche Antragsprüfung | 穆西索赔: 多式联运科学索赔核实 2506.04585v2 |
Authors (6): Yash Kumar Lal, Manikanta Bandham, Mohammad Saqib Hasan, Apoorva Kashi, Mahnaz Koupaee, Niranjan Balasubramanian
Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.
评估科学主张需要确定、提取和推理科学文献中信息丰富数字中表述的多式数据。尽管科学质量评估、图表说明和其他基于图表的数据的多式联运推理任务方面做了大量工作,但没有直接测试核实能力的现用多式联运基准。为了弥补这一差距,我们引入了新的基准MuSci要求,并辅以诊断任务。我们自动从科学文章中提取支持性主张,我们人工干扰这些主张,以产生自相矛盾的主张。扰动是为了测试一套具体的索赔核实能力。我们还引入了一套有助于理解模型失败的诊断任务。我们的结果显示,大多数愿景语言模型都很差(~0.3-0.5 F1),即使最佳模型也只能达到0.72 F1。它们也偏向于判断索赔所支持的索赔,可能存在误解,对索赔中的扰动进行细微分化。我们的诊断显示模型不利于在数字中将正确证据本地化,与各种方式的信息拼凑,而且往往无法理解数字的基本组成部分。
Article 169
Title@2025-07-30 (3): A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers
Title: A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers | Eine umfassende Taxonomie der Negation für NLP und Neuralretriever | NLP和神经再研究综合清点分类 2507.22337v1 |
Authors (4): Roxana Petcu, Samarth Bhargav, Maarten de Rijke, Evangelos Kanoulas
Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.
理解和解决复杂的推理任务对于满足用户的信息需求至关重要。虽然密集的神经模型学会了背景嵌入,但它们在含有否定的查询方面仍然表现不佳。为了理解这一现象,我们在传统的神经信息检索和基于LLM的模型中都研究否定现象。我们(1) 采用哲学、语言和逻辑定义产生的否定分类法;(2) 产生两个基准数据集,可用于评价神经信息检索模型的性能和微调模型,以便在否定方面进行更强的性能;(3) 提出一个基于逻辑的分类机制,可用来分析现有数据集检索模型的性能。我们的分类法在否定类型上产生了平衡的数据分布,提供了更好的培训设置,使NevIR数据集更快地趋于一致。此外,我们提出一个分类方案,显示现有数据集中否定类型的范围,对可能影响到对否定的精确模型的普及。
Article 170
Title@2025-07-30 (3): Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing
Title: Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing | Prompt-Reverse Inkonsistenz: LLM Selbstinkonsistenz jenseits generativer Zufälligkeit und prompt Paraphrasierung | 迅速反向不一致:LLM 自我不连贯,超越发生性随机和迅速划线 2504.01282v2 |
Authors (2): Jihyun Janice Ahn, Wenpeng Yin
While the inconsistency of LLMs is not a novel topic, prior research has predominantly addressed two types of generative inconsistencies: i) Randomness Inconsistency: running the same LLM multiple trials, yielding varying responses; ii) Paraphrase Inconsistency: paraphrased prompts result in different responses from the same LLM. Randomness Inconsistency arises from the inherent randomness due to stochastic sampling in generative models, while Paraphrase Inconsistency is a consequence of the language modeling objectives, where paraphrased prompts alter the distribution of vocabulary logits. This research discovers Prompt-Reverse Inconsistency (PRIN), a new form of LLM self-inconsistency: given a question and a couple of LLM-generated answer candidates, the LLM often has conflicting responses when prompted “Which are correct answers?” and “Which are incorrect answers?”. PRIN poses a big concern as it undermines the credibility of LLM-as-a-judge, and suggests a challenge for LLMs to adhere to basic logical rules. We conduct a series of experiments to investigate PRIN, examining the extent of PRIN across different LLMs, methods to mitigate it, potential applications, and its relationship with Randomness Inconsistency and Paraphrase Inconsistency. As the first study to explore PRIN, our findings offer valuable insights into the inner workings of LLMs and contribute to advancing trustworthy AI.
虽然LLMs的不一致并不是一个新专题,但先前的研究主要解决了两类基因不一致的问题:(一) 随机性:执行同样的LLM多重审判,得出不同的答复;(二) 原教旨不一致性:由同一LLM作出不同的答复。 随机性不一致性产生于因基因模型的随机抽样而固有的随机性,而原教旨不一致性是语言模型目标的结果,用原言来解释的迅速性改变了词汇记录表的分发。本研究发现迅速反偏向性(PRIN)是一种新的LLM自我不一致性:给一个问题和几个LLM产生的应答候选人带来不同的答复。LMM常常在“答案正确”和“答案不正确”时作出相互冲突的答复。 PRIN是一个非常令人关切的问题,因为它破坏了LM-a-a-a-judge的可信度,并且建议LMS内部的遵守基本逻辑规则。我们进行了一系列实验,以调查PRIN(PRI)的透明性研究,研究其价值性应用程度,并研究它与不同LIS-LIS的透明性研究。
Article 171
Title@2025-07-30 (3): Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges
Title: Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges | Natürliche Sprachverarbeitung für den Rechtsbereich: Eine Übersicht über Aufgaben, Datensätze, Modelle und Herausforderungen | 法律领域自然语言处理:任务、数据集、模型和挑战调查 2410.21306v3 |
Authors (3): Farid Ariai, Joel Mackenzie, Gianluca Demartini
Natural Language Processing (NLP) is revolutionising the way both professionals and laypersons operate in the legal field. The considerable potential for NLP in the legal sector, especially in developing computational assistance tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 154 studies, with a final selection of 131 after manual filtering. It explores foundational concepts related to NLP in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document lengths, complex language, and limited open legal datasets. We provide an overview of NLP tasks specific to legal text, such as Document Summarisation, Named Entity Recognition, Question Answering, Argument Mining, Text Classification, and Judgement Prediction. Furthermore, we analyse both developed legal-oriented language models, and approaches for adapting general-purpose language models to the legal domain. Additionally, we identify sixteen open research challenges, including the detection and mitigation of bias in artificial intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.
自然语言处理(NLP)在法律领域改变了专业人员和外行人员的运作方式,在法律部门,特别是在为各种法律程序开发计算协助工具方面,国家语言处理具有相当大的潜力,多年来引起了研究人员的兴趣,这项调查遵循了系统审查和元分析框架的首选报告项目,审查了154项研究,在人工过滤后最后挑选了131项研究,在法律领域探索了与国家语言处理方法有关的基本概念,说明了法律文本处理的独特方面和挑战,如文件长度大、语言复杂和开放法律数据集有限等。我们概述了国家语言处理方法的具体任务,如文件总结、实体识别、问题回答、标名、标榜采矿、文本分类和判断。此外,我们分析了以法律为导向的语言模式和使通用语言模式适应法律领域的方法。此外,我们查明了16项公开研究挑战,包括发现和减少人工智能应用中的偏差,需要更有力和可解释的语言模型,以及改进处理复杂程度。
Article 172
Title@2025-07-29 (2): Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations
Title: Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations | Intent Recognition und Out-of-Scope-Erkennung mit LLMs in Multi-Party-Konversationen | 在多方对话中使用LLMs 2507.22289v1 |
Authors (3): Galo Castillo-López, Gaël de Chalendar, Nasredine Semmar
Intent recognition is a fundamental component in task-oriented dialogue systems (TODS). Determining user intents and detecting whether an intent is Out-of-Scope (OOS) is crucial for TODS to provide reliable responses. However, traditional TODS require large amount of annotated data. In this work we propose a hybrid approach to combine BERT and LLMs in zero and few-shot settings to recognize intents and detect OOS utterances. Our approach leverages LLMs generalization power and BERT’s computational efficiency in such scenarios. We evaluate our method on multi-party conversation corpora and observe that sharing information from BERT outputs to LLMs leads to system performance improvement.
主动认识是面向任务的对话系统(TODS)的一个基本组成部分。确定用户意图并查明某种意图是否超越范围(OOS)对于TODS提供可靠的反应至关重要。然而,传统的TODS需要大量附带说明的数据。在这项工作中,我们提议采用混合办法,将BERT和LLMs在零和几发环境中结合起来,以确认意图并检测OS的全局性能。我们的方法利用LLMS的概括性能和BERT在这类情况下的计算效率。我们评估了我们关于多党对话公司的方法,并观察到从BERT产出到LOMS共享信息可以改善系统性能。
Article 173
Title@2025-07-29 (2): Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs
Title: Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs | Bedeutungsverstärkte Grammatik: Gradient Akzeptabilität formt die geometrischen Darstellungen von Konstruktionen in LLMs | 含义内含语法:逐渐可接受性形状 LLM 中工程的几何表示法 2507.22286v1 |
Authors (2): Supantho Rakshit, Adele Goldberg
The usage-based constructionist (UCx) approach posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze the neural representations of the English dative constructions (Double Object and Prepositional Object) in Pythia-$1.4$B, using a dataset of $5000$ sentence pairs systematically varied for human-rated preference strength. A macro-level geometric analysis finds that the separability between construction representations, as measured by Energy Distance or Jensen-Shannon Divergence, is systematically modulated by gradient preference strength. More prototypical exemplars of each construction occupy more distinct regions in the activation space of LLMs. These results provide strong evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures of basic constructionist principles in LLMs.
以使用为基础的建筑师(UCx)方法假定,语言包括一个知识形式的配对网络(建筑),其使用主要取决于其含义或功能,要求其分级和概率。本研究调查了大语言模型的内部代表是否反映了拟议的功能抑制光度。我们分析了Pythia-1.4美元B中英国配制建筑(二重物体和前置物体)的神经表现,使用了一套因人称优惠强度而系统变化的5 000美元判刑配对数据集。一项宏观的几何分析发现,按能源距离或詹森-汉诺分辨法衡量的建筑代表之间是否具有系统性的分离性,由梯度偏好强度加以调节。我们分析了在激活LLMS空间中更多具有超典型特征的建筑设计师占据了较不同的区域。这些结果提供了有力的证据,证明LMS学会了丰富的、含意的、分级的建筑表现,并为LMS基本建筑原则的几何测量度措施提供了支持。
Article 174
Title@2025-07-29 (2): CoEx – Co-evolving World-model and Exploration
Title: CoEx – Co-evolving World-model and Exploration | CoEx – Co-evolving World-Modell und Exploration | CoEx – – 共同发展的世界模式和探索 2507.22281v1 |
Authors (2): Minsoo Kim, Seung-won Hwang
Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM’s static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co-evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code-based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.
现代LLM代理商的规划依赖于将LLM作为内部世界模型,在培训前获得;然而,现有代理商的设计未能有效地将新的观测结果吸收到世界模型的动态更新中。对LLM静态的内部世界模型的这种依赖,逐渐会与世界基本真实状态不相符,导致产生不同和错误的计划。我们引入了等级代理结构CoEx,在这个结构中,等级国家抽象使LLM计划与动态更新的世界模型共同演变。CoEx计划和与世界互动,利用LLM推理来协调由次级目标组成的动态计划,其学习机制不断将这些次级目标经验纳入一个持久的世界模型中,其形式是神经共性信仰状态,由文字推理和基于代码的象征记忆组成。我们评估了我们的代理商在包括ALFWorld、PDDL和杰里科在内的多种涉及丰富环境和复杂任务的代理商设想方案。我们的实验表明,CoEx在规划和探索方面超越了现有的代理商范式。
Article 175
Title@2025-07-29 (2): Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
Title: Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence | Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz | 嵌入式网络代理:为综合特工情报连接物理数字王国 2506.15677v3 |
Authors (10): Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.
今天,AI代理机构大多是空置的 – – 它们要么对在网上获得的大量数字信息和知识进行检索和解释;要么通过体现的认知、规划和行动与物理世界互动,但两者都很少。这种分离限制了他们解决需要综合物理和数字情报的任务的能力,例如用在线食谱烹饪,用动态地图数据浏览,或者利用网络知识解释真实世界的里程碑。我们引入Embodied网络代理机构,这是AI代理机构的一种新颖范例,可以流传地连接成形和网络规模推理。为了落实这一概念,我们首先开发了Embudied网络代理机构的任务环境,一个将现实的3D室内和室外环境与功能的网络界面紧密结合的统一模拟平台。在这个平台上,我们建造和发布Embodied网络代理机构基准,它包含各种各样的任务,包括烹饪、导航、购物、旅游和地理定位等,所有这些任务都需要跨物理和数字领域的协调推理,以便系统地评估跨多域情报。实验结果揭示了国家-艺术AI系统和人类能力之间的重大业绩差距,一个统一的模拟平台,既能连接,又能将挑战与机会紧密地结合的网络网站/网络网站。
Article 176
Title@2025-07-29 (2): Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
Title: Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering | Denoising Concept Vectors mit Sparse Autoencodern für verbesserte Sprachmodellsteuerung | 用于改进语言模式指导的与斯普鲁斯自动编码器一起的失言概念矢量 2505.15038v2 |
Authors (6): Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16\% across six challenging concepts, while maintaining topic relevance.
线性概念矢量有效地引导了LLMS,但现有方法在各种数据集中都存在噪音,破坏了方向的稳健性。我们提议采用Sparse Autoencoder-Denoized概念矢量(SDCV ) , 有选择地保留最具歧视性的 SAE 潜值,同时重建隐藏的表达方式。我们的关键见解是,通过扩大最能区分正和负样的顶层潜值的激活,可以将概念相关信号与数据集噪音明确分开。适用于线性探测和中值差异,SDCV 不断提高6个具有挑战性的概念的成功率,同时保持主题相关性。
Article 177
Title@2025-07-29 (2): Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs
Title: Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs | Modeling Story Erwartungen, Engagement zu verstehen: Ein generatives Framework mit LLMs | 模拟对理解参与的理论期望:利用LLMM的生成框架 2412.15239v3 |
Authors (2): Hortense Fong, George Gui
Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement-continuing to read, commenting, and voting-are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.
了解消费者何时和为何参与故事对于内容创作者和平台至关重要。虽然现有的理论表明,受众对即将发生的事情的信念应在参与决定中发挥重要作用,但经验性工作主要侧重于开发直接从实际内容中提取特征的技术,而不是捕捉前瞻性信仰,因为缺乏一种原则性方法来将这种信仰建模在结构化的叙述性数据中。为了补充现有的特征提取技术,本文件提出了一个新颖的框架,利用大型语言模型来模拟受众对故事如何展开的前瞻性信仰。我们的方法为每个故事产生多种潜在延续,并用既定内容分析技术提取与期望、不确定性和惊喜有关的特征。我们将我们的方法应用到超过30,000个书章中,我们证明我们的框架补充了现有特征工程技术,平均扩大了31%的边际解释力。结果显示,不同的接触类型延续了当前和预期内容特征的不同组合。我们的框架提供了一种新颖的方法,用于研究和探索受众如何前瞻性信仰影响其与叙述性媒体的接触,对内容重心产业的营销战略产生了影响。
Article 178
Title@2025-07-29 (2): ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling | EKG-Byte: Ein Tokenizer für die End-to-End Generative Elektrokardiogramm-Sprachenmodellierung | ECG-Byte: 终端到 En-En Energy 电动心电图语言建模调控器 2412.14373v3 |
Authors (5): William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48\% of the data required by traditional two-stage methods.
大型语言模型(LLMS)展示了不同领域的特殊多功能性,包括对心电图的应用。越来越多的工作重点是从多渠道ECG信号和相应的文本提示生成文本。现有方法往往涉及一个两阶段过程:先对ECG专用编码器进行自我监督学习(SSL)目标的培训,然后对用于自然语言生成的LLM(NLG)进行微调,使用编码器衍生特征。然而,这些方法面临两个主要的局限性:由于多阶段培训和解释编码器生成特征方面的挑战,效率低下。为了克服这些问题,我们提议ECG-Byte,这是为ECGs自动递增语言模型改编成的配对代用品管道。ECG-Byrest 和 ECG 信号编码为代号,通过将ECG和文本符号合并,使直接端到端LM培训成为直接培训。这种方法提高了解释性,因为ECG的代号可以直接映射回原信号,而我们则需要使用48种具有竞争力的NBE-BSL方法,我们只需通过48种具有竞争力的C-C-C-C-BSpeat-CSy 进行快速的测试。
Article 179
Title@2025-07-29 (2): GneissWeb: Preparing High Quality Data for LLMs at Scale
Title: GneissWeb: Preparing High Quality Data for LLMs at Scale | GneissWeb: Hochqualitative Daten für LLMs im Maßstab vorbereiten | GneissWeb: 为缩放 LLMs 准备高品质数据 2502.14907v2 |
Authors (32): Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah, Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli, Farhan Ahmed, Nathalie Baracaldo Angel, Santosh Subhashrao Borse, Yuan-Chi Chang, Xuan-Hong Dang, Nirmit Desai, Revital Eres, Ran Iwamoto, Alexei Karve, Yan Koyfman, Wei-Han Lee, Changchang Liu, Boris Lublinsky, Takuyo Ohko, Pablo Pesce, Maroun Touma, Shiqiang Wang, Shalisha Witherspoon, Herbert Woisetschläger, David Wood, Kun-Lung Wu, Issei Yoshida, Syed Zawad, Petros Zerfos, Yi Zhou, Bishwaranjan Bhattacharjee
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM’s ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.
高质量的数据,尤其能够大大提高LLM在一系列下游任务方面的普及能力。主要LLM的大型培训前数据集仍然无法为公众所接受,而许多开放数据集的规模小(少于5万亿个符号),限制了其培训大型模型的适宜性。在本文中,我们引入了GneissWeb,一个大型数据集,产生大约10万亿个符号,满足培训LM的数据质量和数量要求。我们的GneissWeb制成的精确的子字符串除和明智地构建的质量过滤器集合构成的精细小培训前数据集。GneissWeb在数据质量和数量之间实现了有利的交换,产生了超出在开放大数据集方面受过培训的模型(5+万亿个比例)。我们展示了使用GneisWeb制的模型所培训的优于经过培训的1.0个基准数,在FineWeb-Vserb 标准中,在经过培训的181.0级标准中实现了2.73个基准,在标准前实现了2.73个标准中,在平均标准中实现了2.73个基准。
Article 180
Title@2025-07-29 (2): LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Title: LLM-as-a-qualitative-judge: automating error analysis in natural language generation | LLM-as-a-qualitative-Richter: Automatisierung der Fehleranalyse bei der Generierung natürlicher Sprachen | LLM-as-as-法官法官:在自然语言生成中进行自动误差分析 2506.09147v2 |
Authors (7): Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
推动大型语言模型(LLM)(LLM-as-a-judge),以评价生成的文本,称为LLM-a-a-judge(LLM),已成为自然语言生成的一种标准评价方法,但主要用作量化工具,即以数字分数作为主要产出。在这项工作中,我们提议LLM-as-as-a-qilitiative-judge(LLM),以LLM(LM)为基础的评价方法,主要产出为关于NLG系统产出中常见问题类型的结构化报告。我们的方法旨在向开发者提供有意义的见解,使其了解对特定NLG系统可作出哪些改进,包括两个主要步骤,即不限每次审议问题分析,并利用直观累积算法对发现的问题进行分组。我们还提出了一项评价拟议方法的战略,加上12 NLG数据集中的问题说明~300。我们的结果显示,LLM-as-as-a-qlitial-judge(LM)正确地确认2/3个案例中的特定问题,能够产生错误类型报告,并重印由人类说明的报告。我们的代码和数据可在http://qlistal-mas-qal-qal-qual-s-s-s-s-s-qual-s-s-s-s-s-qir-qir-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-qut-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s
Article 181
Title@2025-07-29 (2): RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
Title: RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation | RL von Lehrer-Modell-Verfeinerung: Graduale Imitation Lernen für maschinelle Übersetzung | 教师-模式改进:机器翻译逐步模拟学习 2507.22219v1 |
Authors (3): Dongyub Jude Lee, Zhenyi Ye, Pengcheng He
Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
机械翻译(MT)的参考学习方法——例如直接偏好优化(DPO)——已经取得了令人印象深刻的成绩,但在很大程度上依赖于大型、仔细整理的三重数据集,而且往往难以超越其调校范围。我们提议“教师-模版精炼强化学习”(RLfR),这是一个新颖的框架,它通过利用外部教师模式(GPT-4o)的连续、高质量的反馈,消除对静态三重技术的依赖。RLfR将每一翻译步骤作为微观研究:行为者产生一种假设,教师加以完善,而行为者根据它与教师的精细校准如何紧密配合而得到奖励。我们建议,我们建议采用两个互补信号-(i)负编辑距离,促进词汇性和结构上的忠诚性,以及(ii)知识与技术伦理学的评分,确保行为者逐渐学习模仿教师,通过递增、迭代改进反映人类学习过程。关于FLORES-200基准(英语和德语、西班牙语、中文、韩语和日语),RfR-FT的评分差比(不断提高基准)和MMMMMT的优等基准。
Article 182
Title@2025-07-29 (2): Can adversarial attacks by large language models be attributed?
Title: Can adversarial attacks by large language models be attributed? | Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden? | 大型语言模式的对抗性攻击能否归结为对抗性攻击? 2411.08003v3 |
Authors (3): Manuel Cebrian, Andres Abeliuk, Jan Arne Telle
Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation campaigns-presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and an empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM’s set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that, under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold’s classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin’s tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions-each open-source model fine-tuned on at most one new dataset-the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users, renders exhaustive attribution infeasible in practice.
将大语言模型(LLMs)的输出归结为对抗性环境(如网络攻击和不信息运动)中的大语言模型(LLMs)的产出,这带来了可能越来越重要的重大挑战。我们从理论角度和经验角度处理这一归因问题,借鉴的是正式语言理论(在限度内确定)和对不断扩大的LLM生态系统的数据驱动分析。通过将LLM的一组可能的产出建模成一种正式语言,我们分析的是文本的有限样本是否能够独特地定位出原模型。我们的结果显示,在对模型之间能力重叠的微小假设下,某些LLMs类别基本上无法从它们的输出中辨别出来。我们从理论的明显可辨别性角度从理论角度和实验的角度来处理这一归别问题。 我们界定了四种不同的理论可辨别性制度:(1) 无限的确定性(差异性) LLMMs语言是1967年的经典结果;(2) 无限的概率LMs(通过确定性案例的延伸,确定性案例)(3) 确定性组合模型是可识别的(符合Angluin loudalalalal commal ex ex ex liversation liversation ex) as the the the folview ex in the folver ex ex ex ex the folview lievations in the fearmations in the ex the ex ex the folviolview ex immations impolverations imations ex ex ex immationsmations ex the the the the the thes ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex the the the the flipoltime ex ex thes ex ex ex ex ex ex the the thesmationsmationsmations mations mations mations ex ex ex ex ex ex the the the the the the the the the the thes ex.
Article 183
Title@2025-07-29 (2): How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?
Title: How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor? | Wie gut ist die Erst-Token-Entropie ungefähre Wort-Entropie als psycholinguistischer Vorhersager? | 作为心理语言学预测者,第一到真真真真真真真假 近似单字真真真假如何? 2507.22209v1 |
Authors (3): Christian Clark, Byung-Doh Oh, William Schuler
Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model’s probability distribution over a word’s first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.
上下文引力是一种心理语言测量方法,它反映了在遇到一个单词之前处理一个单词的预期困难。最近的研究已经测试了与酶有关的影响,作为超原的已知效应的一种潜在补充。为了方便起见,英特罗比通常根据一个语言模型对单词第一个子字符号的概率分布来估计。然而,这一近似结果导致低估和可能扭曲真实单词的英特罗比。为了解决这个问题,我们生成了蒙特卡洛(MC)对单词导力的估计,使单词能够跨越一个可变数的符号。阅读时间的反常实验显示第一调和MC字的英特罗比之间的不同结果,这表明在使用一调方言的方言外引力近似值时需要谨慎。
Article 184
Title@2025-07-29 (2): The role of media memorability in facilitating startups’ access to venture capital funding
Title: The role of media memorability in facilitating startups’ access to venture capital funding | Die Rolle der Medienerinnerung bei der Erleichterung des Zugangs von Start-ups zu Risikokapitalfinanzierungen | B. 媒体在便利初创企业获得风险资本资金方面的作用 2507.22201v1 |
Authors (3): L. Toschi, S. Torrisi, A. Fronzetti Colladon
Media reputation plays an important role in attracting venture capital investment. However, prior research has focused too narrowly on general media exposure, limiting our understanding of how media truly influences funding decisions. As informed decision-makers, venture capitalists respond to more nuanced aspects of media content. We introduce the concept of media memorability - the media’s ability to imprint a startup’s name in the memory of relevant investors. Using data from 197 UK startups in the micro and nanotechnology sector (funded between 1995 and 2004), we show that media memorability significantly influences investment outcomes. Our findings suggest that venture capitalists rely on detailed cues such as a startup’s distinctiveness and connectivity within news semantic networks. This contributes to research on entrepreneurial finance and media legitimation. In practice, startups should go beyond frequent media mentions to strengthen brand memorability through more targeted, meaningful coverage highlighting their uniqueness and relevance within the broader industry conversation.
在吸引风险资本投资方面,媒体声誉在吸引风险资本投资方面起着重要作用。然而,先前的研究过于狭隘地侧重于一般媒体曝光,限制了我们对媒体如何真正影响供资决策的理解。作为知情的决策者,风险资本家对媒体内容中更为细微的方面作出反应。我们引入了媒体可保性的概念 — — 媒体在相关投资者记忆中刻画启动者名字的能力。我们利用197家英国微型和纳米技术部门新创办企业(1995年至2004年提供资金)的数据,表明媒体可保性极大地影响投资成果。我们的调查结果表明,风险资本家依赖诸如启动企业的独特性和新闻语义网络的连通性等详细线索。这有助于对创业融资和媒体合法性的研究。在实践中,创业企业应超越经常提到的媒体,通过更有针对性的、更有意义的报道来强化品牌可保性,突出其在更广泛的产业对话中的独特性和相关性。
Article 185
Title@2025-07-29 (2): Basic Reading Distillation
Title: Basic Reading Distillation | Grundlesedestillation | 基础阅读蒸馏 2507.19741v2 |
Authors (5): Zhi Zhou, Sirui Miao, Xiangyu Duan, Hao Yang, Min Zhang
Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are \emph{unrelated} to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.
大型语言模型(LLMS)在各种自然语言处理领域表现出了非凡的能力,但是它们需要高的计算资源来限制其在现实世界中的部署。 蒸馏是一种通过知识蒸馏或任务蒸馏来解决这个问题的方法。 两种蒸馏方法都训练小模型来模仿LLMS的具体特征, 但是它们都忽略了对与下游任务有关的通用文本小模型的基本阅读教育。 在本文中, 我们提议了基础阅读蒸馏(BRD) , 用来教育一个小模型来模仿LMS的基本阅读行为, 比如名称的实体识别、 问题提炼和回答。 在进行这种基本教育之后, 我们将小模型应用于各种任务, 包括语言推断基准和 BIG-bench 任务。 它表明小模型可以超越或执行相当于20x以上大LMS。 分析表明, BRD有效地影响小模型的概率分布, 并且对知识蒸馏或任务蒸馏具有任意的分辨性。
Article 186
Title@2025-07-29 (2): Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence
Title: Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence | Erklärbarkeit durch Systematik: Die harte Systematik-Herausforderung für künstliche Intelligenz | 系统化解释:人工智能的硬系统化挑战 2507.22197v1 |
Authors (1): Matthieu Queloz
This paper argues that explainability is only one facet of a broader ideal that shapes our expectations towards artificial intelligence (AI). Fundamentally, the issue is to what extent AI exhibits systematicity–not merely in being sensitive to how thoughts are composed of recombinable constituents, but in striving towards an integrated body of thought that is consistent, coherent, comprehensive, and parsimoniously principled. This richer conception of systematicity has been obscured by the long shadow of the “systematicity challenge” to connectionism, according to which network architectures are fundamentally at odds with what Fodor and colleagues termed “the systematicity of thought.” I offer a conceptual framework for thinking about “the systematicity of thought” that distinguishes four senses of the phrase. I use these distinctions to defuse the perceived tension between systematicity and connectionism and show that the conception of systematicity that historically shaped our sense of what makes thought rational, authoritative, and scientific is more demanding than the Fodorian notion. To determine whether we have reason to hold AI models to this ideal of systematicity, I then argue, we must look to the rationales for systematization and explore to what extent they transfer to AI models. I identify five such rationales and apply them to AI. This brings into view the “hard systematicity challenge.” However, the demand for systematization itself needs to be regulated by the rationales for systematization. This yields a dynamic understanding of the need to systematize thought, which tells us how systematic we need AI models to be and when.
本文认为,解释性只是影响我们对人工智能(AI)期望的更广泛理想的一个方面。 从根本上说,问题在于AI在多大程度上表现出系统性 — — 不仅仅是对思想如何由可重新组合的成分组成具有敏感性,而是在努力形成一个一致、一致、全面、神秘的集思广益的综合思想体系。这种更丰富的系统性概念被联系主义“系统化挑战”的长阴影所掩盖,根据这种观念,网络结构从根本上与Fodor和同事所称的“系统化思维”有矛盾。 我提供了一个概念框架,用于思考“系统化思维”如何区分四个概念。我利用这些区别来缓解系统性与关联主义之间的明显紧张关系,并表明系统化概念在历史上左右着我们思维理性、权威和科学感的观念比Fodorian概念更为艰巨。为了确定我们是否有理由让AI模式符合这种系统化理想,我随后指出,我们必须研究系统化的理论原理,并探索如何系统化的理论化,从而让AI本身具有何种程度的理念。
Article 187
Title@2025-07-29 (2): Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
Title: Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation | Déjà Vu: Mehrsprachige LLM-Evaluierung durch die Lens of Machine Translation Evaluation | Déjà Vu:通过机器翻译评价的镜头进行多种语文LLM评价 2504.11829v3 |
Authors (5): Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom
Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.
多语文大型语言模型(MLLMs)的生成能力和语言覆盖面正在迅速发展,然而,对MLLMs的基因能力的评价做法仍然缺乏全面性、科学严密性和跨研究实验室的一致采用,这削弱了它们有意义地指导MLLM发展的潜力。我们与机器翻译(MT)评价平行进行,这个领域面临类似的挑战,数十年来为多语文基因模型制定了透明的报告标准和可靠的评价。通过在基因化评价管道的关键阶段进行有针对性的试验,我们展示了MT评价的最佳做法如何加深对模型之间质量差异的理解。此外,我们确定了对MLLMs进行强有力元评价的必要组成部分,确保评价方法本身得到严格评估。我们将这些见解纳入一个可用于MLLM研发的行动建议清单。
Article 188
Title@2025-07-29 (2): A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models
Title: A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models | Eine skalierbare Pipeline zur Schätzung von Verb Frame Frequenzen mit großen Sprachmodellen | 使用大语言模型估算 Verb 框架频谱的可缩放管道 2507.22187v1 |
Authors (2): Adam M. Morgan, Adeen Flinker
We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.
我们提出了一个用于估计 Verb 框架变量( VFFs) 的自动管道, 即动词在特定合成框架中出现的频率。 VFFs 在人文和机器语言系统中为语法提供了强大的窗口, 但是现有的计算工具在规模、 准确性或可访问性方面都有限。 我们使用大型语言模型( LLMs) 来生成包含476 个英语动词的一组句子。 其次, 通过指示一个LLM 能够像一个专业语言学家那样行事, 我们得到了它分析本句中句子的合成结构结构的精确性结构。 这个管道在多个评价数据集中都比两个广泛使用的合成分析器高。 此外, 它所需要的资源远远少于手动的分类( 金本标准), 从而使得能够快速、 可扩展的 VFF 估计。 我们用LM 模型制作了一个新的 VFF 数据库, 其覆盖范围更广, 精细的合成方法区分, 并明确估计了在心理语言中共同研究的结构性替代结构的相对频率。 管道可以轻松、 和扩展的频率 定义框架 , , 将 和 用于 新的 新的 Cenerview view view view view view view view view view view view vical view view view view view view view view view view view view view view view view view view view view view view 。
Article 189
Title@2025-07-29 (2): Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training
Title: Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training | Positiv-Augmented Contrastive Learning für Vision-and-Language Evaluation und Training | 愿景和语言评价和培训的积极强化反竞争学习 2410.07336v2 |
Authors (5): Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
尽管在标题生成方面取得了显著进步,但现有的评价指标往往无法捕捉完整质量或细细细细细的字幕细节,这主要是由于它们依赖非特定的人写参考资料或紧张的培训前数据。尽管如此,找到有效的指标不仅对标题评估至关重要,而且对于生成阶段也至关重要。计量确实可以在字幕制作模型的微调阶段发挥关键作用,最终提高生成字幕的质量。在本文中,我们提议采用PAC-S++这一可学习的衡量标准,利用CLIP模型,在网络收集和清理数据方面经过预先培训,并通过更多成对的视觉和文字正面样本进行正规化。探讨这一更强有力和经过整理的预培训前的样本,我们还将PAC-S++作为自定义序列培训阶段的一种奖励,通常用于微调字幕模型的质量。关于不同图像和视频数据集的广泛实验,突出PAC-S++与用于本项任务的普通指标的实效,包括在网络收集的图像和文本上生成的精准性样本。此外,我们展示了在Sqrealimalalal-realalal 的模型中,我们展示了比以往更难得的模型/变校正的成绩。
Article 190
Title@2025-07-29 (2): Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles
Title: Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles | Persona-Augmented Benchmarking: Bewertung von LLMs über unterschiedliche Schreibstile hinweg | 人 人 推 基准 定 : 评价各种写 作 风格 2507.22168v1 |
Authors (4): Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu
Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with “non-standard” input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.
目前用于评价大语言模型的基准往往没有表现出足够的写作风格多样性,其中许多人主要遵守标准化的公约,这些基准没有充分反映人类所展示的丰富多样的通信模式,因此,在面临“非标准”投入时,根据这些基准优化的LLM可能显示业绩差。在这项工作中,我们通过使用以人为基础的LLM促进的低成本方法重写评价提示来测试这一假设,这种方法可以模仿不同的写作风格。我们的结果表明,即使具有相同的语义内容,写作风格的变化和迅速格式化也会对所评价的LLM的估计业绩产生重大影响。值得注意的是,我们发现不同的写作风格在一系列模式和任务中始终触发低或高绩效,而不管其型号、大小和正确性如何。我们的工作为扩大现有基准提供了一种可扩展的方法,提高了它们为衡量不同语言的LM业绩所提供的评估的外部有效性。
Article 191
Title@2025-07-29 (2): Strategic Deflection: Defending LLMs from Logit Manipulation
Title: Strategic Deflection: Defending LLMs from Logit Manipulation | Strategische Durchbiegung: LLMs durch Logit-Manipulation verteidigen | 战略抵消:保护LLMs免受逻辑操纵 2507.22160v1 |
Authors (5): Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni
With the growing adoption of Large Language Models (LLMs) in critical areas, ensuring their security against jailbreaking attacks is paramount. While traditional defenses primarily rely on refusing malicious prompts, recent logit-level attacks have demonstrated the ability to bypass these safeguards by directly manipulating the token-selection process during generation. We introduce Strategic Deflection (SDeflection), a defense that redefines the LLM’s response to such advanced attacks. Instead of outright refusal, the model produces an answer that is semantically adjacent to the user’s request yet strips away the harmful intent, thereby neutralizing the attacker’s harmful intent. Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries. This work presents a critical shift in defensive strategies, moving from simple refusal to strategic content redirection to neutralize advanced threats.
随着在关键领域越来越多地采用大语言模型(LLMs),确保他们的安全不受侵入性攻击是最重要的。传统防御主要依靠拒绝恶意煽动,而最近的逻辑级袭击则表明能够绕过这些保障,直接操纵代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代谢过程。我们引入了战略抑制(Sdeflection),这是重新定义LM对此类先进攻击的反应的辩护。该模型产生的答案不是完全拒绝,而是一种与用户的要求息息相近的答案,却将有害意图抹除,从而消除攻击者的有害意图。我们的实验表明,Sdeflection在保持良性查询的示范性表现的同时,大大降低了攻击成功率。 这项工作展示了防御战略的重大转变,从简单的拒绝转向战略内容的重新定向,以抵消先进威胁。
Article 192
Title@2025-07-29 (2): IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
Title: IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian | IndoPref: Ein multi-Domain-Pairwise-Preference-Datensatz für Indonesisch | IndoPref:印度尼西亚多域对等优惠数据集 2507.22159v1 |
Authors (4): Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, Genta Indra Winata
Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset specifically designed to evaluate the naturalness and quality of LLM-generated text. All annotations are natively written in Indonesian and evaluated using Krippendorff’s alpha, demonstrating strong inter-annotator agreement. Additionally, we benchmark the dataset across multiple LLMs and assess the output quality of each model.
2亿多人会讲印度尼西亚语,但语言在大型语言模型(LLMs)的以优惠为基础的研究中仍然严重不足,现有的多语文数据集大多来自英文译文,其内容往往缺乏文化和语言真实性。为了弥补这一差距,我们引入了IndoPref,这是印度尼西亚首个专门为评价LLM产生的文本的自然性质和质量而设计的完全由人编写的多域印度尼西亚偏好数据集。所有说明都是印度尼西亚语,用Krippendorff的字母来进行本地撰写,并用Krippendorff的字母来评估,显示出强烈的跨咨询者协议。此外,我们为多个LLMs的数据集设定基准,并评估每个模型的产出质量。
Article 193
Title@2025-07-29 (2): The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?
Title: The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face? | Die Bedeutung von Gesichtsfunktionen bei der visionsbasierten Erkennung von Zeichensprachen: Augen, Mund oder Gesicht? | 面貌在基于愿景的手语识别中的重要性:眼、嘴还是脸? 2507.20884v2 |
Authors (2): Dinh Nam Pham, Eleftherios Avramidis
Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.
非人工面部特征在手语交流中发挥着关键作用,但在自动手语识别(ASLR)中的重要性仍未得到充分探讨。虽然先前的研究显示,包含面部特征可以提高识别度,但相关工作往往依赖手工艺特征提取,没有超越手工艺特征与手工艺特征和面部特征相结合的比较。在这项工作中,我们系统地调查不同面部区域眼、口腔和充分运用两种不同的深层学习模式(以CNN为基础的模型和以变压器为基础的模型)的贡献,这两种模式是受过随机选择班级的孤立标志的 SLR数据集培训的。我们通过定量性能和定性突出的地图评估,发现口部是最重要的非人工面部特征,显著提高了准确性。我们的调查结果突出表明了将面部特征纳入ASLR的必要性。
Article 194
Title@2025-07-29 (2): Prompt Optimization and Evaluation for LLM Automated Red Teaming
Title: Prompt Optimization and Evaluation for LLM Automated Red Teaming | Prompt Optimierung und Auswertung für LLM Automatisiertes Red Teaming | LLM自动红色小组迅速优化和评价 2507.22133v1 |
Authors (11): Michael Freenor, Lauren Alvarez, Milton Leal, Lily Smith, Joel Garrett, Yelyzaveta Husieva, Madeline Woodruff, Ryan Miller, Erich Kummerfeld, Rafael Medeiros, Sander Schulhoff
Applications that use Large Language Models (LLMs) are becoming widespread, making the identification of system vulnerabilities increasingly important. Automated Red Teaming accelerates this effort by using an LLM to generate and execute attacks against target systems. Attack generators are evaluated using the Attack Success Rate (ASR) the sample mean calculated over the judgment of success for each attack. In this paper, we introduce a method for optimizing attack generator prompts that applies ASR to individual attacks. By repeating each attack multiple times against a randomly seeded target, we measure an attack’s discoverability the expectation of the individual attack success. This approach reveals exploitable patterns that inform prompt optimization, ultimately enabling more robust evaluation and refinement of generators.
使用大语言模型(LLMS)的应用正在变得日益普及,使得系统脆弱性的识别越来越重要。自动红队队通过使用LLM来加速这项努力,利用LLM来制造和实施对目标系统的袭击。攻击发电机用攻击成功率(ASR)评估,这是根据每次攻击成功率判断结果得出的样本平均值。在本论文中,我们引入了一种将ASR用于个别攻击的攻击发生频率优化的方法。通过对随机种子目标重复多次攻击,我们测量一次攻击的可发现性与个人攻击成功的期望。这个方法揭示了可加以利用的模式,为迅速优化提供了信息,最终使得能够对发电机进行更强有力的评估和改进。
Article 195
Title@2025-07-29 (2): SAKE: Steering Activations for Knowledge Editing
Title: SAKE: Steering Activations for Knowledge Editing | SAKE: Steuerung von Aktivierungen für die Wissensbearbeitung | 战略:知识编辑指导活动 2503.01751v2 |
Authors (4): Marco Scialanga, Thibault Laugel, Vincent Grari, Marcin Detyniecki
As Large Langue Models have been shown to memorize real-world facts, the need to update this knowledge in a controlled and efficient manner arises. Designed with these constraints in mind, Knowledge Editing (KE) approaches propose to alter specific facts in pretrained models. However, they have been shown to suffer from several limitations, including their lack of contextual robustness and their failure to generalize to logical implications related to the fact. To overcome these issues, we propose SAKE, a steering activation method that models a fact to be edited as a distribution rather than a single prompt. Leveraging Optimal Transport, SAKE alters the LLM behavior over a whole fact-related distribution, defined as paraphrases and logical implications. Several numerical experiments demonstrate the effectiveness of this method: SAKE is thus able to perform more robust edits than its existing counterparts.
由于大型Langue模型已被显示为对现实世界事实的记忆,因此有必要以有控制和有效的方式更新这一知识。根据这些限制因素设计,知识编辑(KE)方法建议改变预先培训的模式中的具体事实。然而,它们受到若干限制,包括缺乏背景的稳健性,以及无法概括与事实有关的逻辑影响。为了克服这些问题,我们提议Sake,这是一种指导启动方法,它模拟一个事实,作为发行而不是单一的提示进行编辑。利用最佳运输,Sake将LLM行为改变为整个与事实有关的分配,被定义为副词和逻辑影响。若干数字实验证明了这种方法的有效性:Sake因此,Sake能够比现有的对应方进行更强有力的编辑。
Article 196
Title@2025-07-29 (2): UserBench: An Interactive Gym Environment for User-Centric Agents
Title: UserBench: An Interactive Gym Environment for User-Centric Agents | UserBench: Eine interaktive Gym-Umgebung für User-Centric-Agenten | 用户 Bench: 用户中心代理器的交互式 Gym 环境 2507.22034v1 |
Authors (12): Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.
大型语言模型(LLMS)的代理商在推理和工具使用方面取得了令人印象深刻的进展,使他们能够解决复杂的任务。然而,他们与用户积极协作的能力,特别是在目标模糊、演变或间接表达的情况下,仍然没有得到充分利用。为了解决这一差距,我们引入了一个以用户为中心的基准,即用户Bench(用户Bench),该基准旨在评估多方向、偏好驱动的互动中的代理商。用户Bench特征模拟用户,这些用户以特定目标为起点,并逐渐显示偏好,要求代理商积极主动地澄清意向,用工具做出有根据的决定。我们对主要的开放和封闭源LMS的评估显示,任务完成和用户对齐之间有很大的脱节。例如,模型提供的答案与所有用户的意向完全一致,平均只有20%的时间,甚至最先进的模型通过积极互动发现不到所有用户偏好30%。这些结果突出了建筑代理商的挑战,这些代理商不仅仅是任务执行人,而且是真正的合作伙伴。用户Bench提供了一种互动的环境,以测量和推进这一关键能力。
Article 197
Title@2025-07-29 (2): FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
Title: FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression | FLAT-LLM: Feinkörnige Low-rank Aktivierung Raumtransformation für großsprachliche Modellkompression | FLAT-LLM: 用于大语言模型压缩的精制低级激活空间转换 2505.23966v3 |
Authors (6): Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang
Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
大型语言模型(LLMs)在自然语言处理方面取得了显著进展,但其高的计算和记忆需求对资源限制环境中的部署提出了挑战。尽管最近的低级分解方法为结构压缩提供了一条充满希望的道路,但它们往往受到精度退化、昂贵的校准程序的影响,并导致低效模型结构结构阻碍真实世界的推导速度。在本文中,我们提议FLAT-LLM是一种快速、准确、无培训的结构性压缩方法,其基础是在激活空间中精细的低级转换。具体地说,我们通过使用通过头部主构件分析计算出来的短精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精
Article 198
Title@2025-07-29 (2): SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers
Title: SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers | SAND-Math: LLMs nutzen, um neuartige, schwierige und nützliche Mathematikfragen und -antworten zu generieren | SAND-Math:利用LLMs生成新创、困难和有用的数学问答 2507.20527v2 |
Authors (5): Chaitanya Manem, Pratik Prabhanjan Brahma, Prakamya Mishra, Zicheng Liu, Emad Barsoum
The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{https://huggingface.co/datasets/amd/SAND-MATH}{https://huggingface.co/datasets/amd/SAND-MATH}
对具有精密数学推理能力的大型语言模型(LLMS)的需求正在各行业之间增长。然而,由于缺少困难和新颖的培训数据,发展性能数学数学模型(LLMS)的工作严重受阻。我们引入了“合成增强新颖和困难数学问题和解决方案” ,这是一个解决该问题的管道,它首先从零开始产生高质量的问题,然后通过一个新的\ textbf{Difficultyhiking}步骤系统提升其复杂性。我们通过两个关键结论展示了我们的方法的有效性。首先,用SAND-Math数据提升一个强大的基准大大提升了业绩,在 AIME25 基准上比下一个最好的合成数据集表现得更好。第二,在一项专门的通货膨胀研究中,我们展示了我们困难的感应过程:通过增加平均问题难度,从5.02/5.98,这一步骤将AIME25的性能从46.38/%+49.23}。 全面生成的SMA-MA-S-S-ANDS-S-S-S-S-S-S-Appentalalalaldalalalal-dal-dalsetal-dalset 建立一个高效的管道、最终数据库和升级的模型和升级的模型和升级的模型和升级的模型。
Article 199
Title@2025-07-29 (2): Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models
Title: Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models | Vorhersage mikrobieller Ontologie und Pathogenrisiken durch Umweltmetadaten mit großen Sprachmodellen | 预测具有大语言模型的环境元数据产生的微生物本体学和病原体风险和病原体风险 2507.21980v1 |
Authors (2): Hyunwoo Yoo, Gail L. Rosen
Traditional machine learning models struggle to generalize in microbiome studies where only metadata is available, especially in small-sample settings or across studies with heterogeneous label formats. In this work, we explore the use of large language models (LLMs) to classify microbial samples into ontology categories such as EMPO 3 and related biological labels, as well as to predict pathogen contamination risk, specifically the presence of E. Coli, using environmental metadata alone. We evaluate LLMs such as ChatGPT-4o, Claude 3.7 Sonnet, Grok-3, and LLaMA 4 in zero-shot and few-shot settings, comparing their performance against traditional models like Random Forests across multiple real-world datasets. Our results show that LLMs not only outperform baselines in ontology classification, but also demonstrate strong predictive ability for contamination risk, generalizing across sites and metadata distributions. These findings suggest that LLMs can effectively reason over sparse, heterogeneous biological metadata and offer a promising metadata-only approach for environmental microbiology and biosurveillance applications.
传统机器学习模式试图在只有元数据的微生物研究中,特别是在小型抽样环境中或以不同标签格式进行的跨类研究中,在只有元数据的情况下,在微生物研究中普遍推广传统机器学习模式。在这项工作中,我们探索使用大型语言模型(LLMs)将微生物样本分类为肿瘤类,如EMPO 3和相关生物标签,以及预测病原体污染风险,特别是E.Coli的存在,仅使用环境元数据即可。我们用零发和几发方式评估ChatGPT-4o、Claude 3.7 Sonnet、Grok-3和LalaMA 4等LLMMMS,将它们的性能与随机森林等传统模型在多个现实世界数据集中的性能进行比较。我们的结果显示,LLMs不仅超越了肿瘤分类方面的标准基线,而且还显示出对污染风险的强烈预测能力,对不同地点和元数据分布进行了概括。这些研究结果表明,LLLMs可以有效地解释稀多、混杂的生物元元元数据,并为环境微生物学和生物巡视应用提供有希望的元方法。
Article 200
Title@2025-07-29 (2): LIMO: Less is More for Reasoning
Title: LIMO: Less is More for Reasoning | LIMO: Weniger ist mehr für Vernunft | LIMO: 较少的理由更多 2502.03387v3 |
Authors (6): Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model’s pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as “cognitive templates” that guide reasoning.
我们质疑一种普遍假设,即大型语言模型(LLMS)的复杂推理需要大量的培训数据。我们证明,精密的数学推理只能以几个例子出现。具体地说,通过简单的监督微调,我们的模型(LIMO)在AIME24和95.6的MATH500上实现了63.3精确度,超过了以前的微调模型(关于AIME24、59.2关于MATH500),同时只使用了先前方法所要求的1培训数据。此外,LIMO表现出在分布上过于概括化,在不同的基准中取得了45.8绝对的改进,超过了在100x以上数据方面受过培训的绩效模型。我们将这些结果结合起来,我们提出了“低I-More 解释假说”:在基础模型中,域知识在培训前已经全面编码,精密的推理可以通过最低但有战略设计的认知过程演示产生。这一假设表明,得出复杂推理的门槛不是由任务复杂性决定的,而是由两个关键因素决定的:(1) 模型的完备性,作为培训后推理学的模板。
Article 201
Title@2025-07-29 (2): Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation
Title: Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation | Kulinarische Kreuzungen: Ein RAG-Rahmen zur Verbesserung der Vielfalt in der kulturübergreifenden Rezeptanpassung | 烹饪十字路口:加强跨文化适应性适应多样性的RAG框架 2507.21934v1 |
Authors (5): Tianyi Hu, Andrea Morales-Garzón, Jingyi Zheng, Maria Maistro, Daniel Hershcovich
In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish’s essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.
在跨文化食谱适应中,目标不仅在于确保文化适当性并保留原始菜的精髓,而且在于为各种饮食需要和偏好提供多种选择。检索增强型(RAG)是一种很有希望的方法,将从用于文化适应性的目标烹饪中检索真实食谱与大语言模型(LLMs)结合起来,但是仍然不清楚RAG是否能够产生不同的适应结果。我们的分析表明,RAG往往过度依赖不同世代之间有限的环境部分,即使提供不同的背景投入,也无法产生不同的产出。这揭示了RAG在创造性任务中存在一个关键局限性,并有多重有效的答案:它未能利用背景多样性来产生不同的反应。为了解决这一问题,我们建议CARRIAG,这是一个用于跨文化食谱适应的插座和播放式的RAG框架,可以增强检索和背景组织的多样性。据我们所知,这是第一个RAG框架,明确旨在产生高度多样化的产出,以适应多种用户的偏好。我们的实验表明,CARRIAGE在食谱适应与封闭式LMS相比,在多样性和质量上达到了帕雷托效率。
Article 202
Title@2025-07-29 (2): Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory
Title: Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory | LLM-Autoscoring-Verlässlichkeit in großformatigen Schriftbeurteilungen unter Verwendung von Generalisierbarkeitstheorien erkunden | 利用通用理论探索利用通用理论进行大型书写评估时的可靠性 2507.19980v2 |
Authors (3): Dan Song, Won-Chan Lee, Hong Jiao
This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.
这项研究调查了大语言模型(LLMs)在从AP中文语言和文化考试中评分写作任务时的可靠性估计。研究采用可概括性理论,评估和比较了在两种AP中文自由应答写任务(故事叙述和电子邮件回应)中人与AI评级员之间的得分一致性:故事叙事和电子邮件回应。这些论文由两名训练有素的人和七名AI评分员独立评分。每篇论文得分四分:一个整体得分和三个分析得分,与任务完成、交付和语言使用领域相对应。结果显示,虽然人类计分员总得分比较可靠,但LLMs在某些条件下表现出了合理的一致性,特别是在故事叙事任务方面。包含人和AI评分员的复合评分提高了可靠性,这支持混合评分模型可为大规模写作评估带来好处。
Article 203
Title@2025-07-29 (2): “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection
Title: “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection | „Auf wessen Seite bist du?” Schätzung der Ideologie von Politik und Nachrichteninhalten mit großen Sprachmodellen und der Auswahl von Demonstrationsobjekten | “你站在谁一边?” 估计政治和新闻内容使用大语言模型和少见的示范选择的意识形态和新闻内容。 2503.20797v2 |
Authors (3): Muhammad Haroon, Magdalena Wojcieszak, Anshuman Chhabra
The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content in the context of the two-party US political spectrum through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM’s classification.
社交媒体平台的迅速增长引起了人们对激进化、过滤泡沫和内容偏向的关切。现有的意识形态分类方法有限,因为它们需要广泛的人力努力、大数据集标签、无法适应不断变化的意识形态背景。本文探讨了大语言模型(LLMs)通过通俗学习(ICL)在美国两党政治背景中对在线内容的政治意识形态进行分类的潜力。我们在由新闻文章和YouTube视频组成的三个数据集上进行的关于以标签平衡方式进行示范选择的广泛实验表明,我们的方法大大超过零光和传统监督方法。此外,我们评估了元数据(例如内容来源和描述)对意识形态分类的影响,并讨论了其影响。最后,我们展示了提供政治和非政治内容来源如何影响LM的分类。
Article 204
Title@2025-07-29 (2): Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
Title: Post-Training Large Language Models via Reinforcement Learning from Self-Feedback | Post-Training Große Sprachmodelle durch Stärkung Lernen aus Selbst-Feedback | 培训后通过 “ 自我学习 “ 强化学习大语言模式 2507.21931v1 |
Authors (5): Carel van Niekerk, Renato Vukovic, Benjamin Matthias Ruppik, Hsien-chin Lin, Milica Gašić
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model’s own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model’s probability estimates – restoring well-behaved calibration – and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model’s own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.
大型语言模型(LLMS) 通常产生合理、但协调不力的答案,限制了其在推理密集型任务的可靠性。我们展示了“自我回馈强化学习”这一培训后阶段,该阶段将模型本身的信心用作内在的奖励,模仿人类在没有外部反馈的情况下如何学习。在冻结的LLM产生若干一系列思考解决方案后,我们定义并计算了每个最终答案的可信度,并据此对痕迹进行排序。这些合成偏好随后被用来微调该政策,使其符合标准偏好优化,类似于RLHF,但不需要人类标签、黄金答案或外部调节的奖励。RLSF同时(一) 完善模型的概率估计 – – 恢复良好管理校准 – 并(二) 加强逐步推理,提高算推理和多曲解的处理能力。通过将模型本身的不确定性转化为有用的自我反馈,RLSF确认将内在行为模型的学习作为LM后训练和战争后不断研究中的一项有原则和数据效率的部分。
Article 205
Title@2025-07-29 (2): CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation
Title: CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation | CHIMERA: Eine Wissensbasis für wissenschaftliche Ideen-Rekombinationen für Forschungsanalyse und -Ideation | CHIMERA: 研究分析和衰变科学理念重组知识库 2505.20779v4 |
Authors (2): Noy Sternlicht, Tom Hope
A hallmark of human innovation is recombination – the creation of novel ideas by integrating elements from existing concepts and mechanisms. In this work, we introduce CHIMERA, a large-scale Knowledge Base (KB) of over 28K recombination examples automatically mined from the scientific literature. CHIMERA enables large-scale empirical analysis of how scientists recombine concepts and draw inspiration from different areas, and enables training models that propose novel, cross-disciplinary research directions. To construct this KB, we define a new information extraction task: identifying recombination instances in scientific abstracts. We curate a high-quality, expert-annotated dataset and use it to fine-tune a large language model, which we apply to a broad corpus of AI papers. We showcase the utility of CHIMERA through two applications. First, we analyze patterns of recombination across AI subfields. Second, we train a scientific hypothesis generation model using the KB, showing that it can propose novel research directions that researchers rate as inspiring. We release our data and code at https://github.com/noy-sternlicht/CHIMERA-KB.
人类创新的一个标志是重组 – – 通过整合现有概念和机制的要素,创建了新颖思想。在这项工作中,我们引入了CHIMERA(CHIMERA),这是一个大型知识库(KB),拥有28K以上科学文献自动提取的重组实例。CHIMERA使得能够对科学家的再生概念和不同领域的灵感进行大规模的经验分析,并使培训模式能够提出新的、跨学科的研究方向。为了构建这个KB,我们定义了一个新的信息提取任务:在科学摘要中确定再融合实例。我们整理了一个高质量的、专家附加说明的数据集,并用它微调一个大语言模型,我们将其应用于广泛的AI文件。我们通过两种应用展示了CHIMERA的效用。首先,我们分析了跨AI子领域的再融合模式。第二,我们用KB来培训一个科学假说生成模型,表明它可以提出新的研究方向,研究人员将它评为鼓舞人心。我们在 https://github.com/noy-sternlich/CHIK-MARKB公布我们的数据和代码。
Article 206
Title@2025-07-29 (2): Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs
Title: Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs | Rotes Lernen als nützlich betrachtet: Verallgemeinern über gemerkte Daten in LLMs | 认为轮试学习有用:在LLMs中普遍使用记忆数据 2507.21914v1 |
Authors (7): Qinyuan Wu, Soumi Das, Mahsa Amani, Bishwamittra Ghosh, Mohammad Aflah Khan, Krishna P. Gummadi, Muhammad Bilal Zafar
Rote learning is a memorization technique based on repetition. It is commonly believed to hinder generalization by encouraging verbatim memorization rather than deeper understanding. This insight holds for even learning factual knowledge that inevitably requires a certain degree of memorization. In this work, we demonstrate that LLMs can be trained to generalize from rote memorized data. We introduce a two-phase memorize-then-generalize framework, where the model first rote memorizes factual subject-object associations using a semantically meaningless token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the two. This surprising finding opens the door to both effective and efficient knowledge injection and possible risks of repurposing the memorized data for malicious usage.
旋转学习是一种基于重复的记忆技术。 一般认为它会通过鼓励逐字记忆而不是更深入的理解而阻碍一般化。 这种洞察力甚至有助于学习必然需要某种程度的记忆的实际知识。 在这项工作中,我们证明LLMs可以接受从腐烂的记忆数据中进行概括化的训练。 我们引入了一个两阶段的记忆-当时的普及框架, 模型首先用一个语义上毫无意义的象征物, 转录事实主题对象关联, 然后通过微调一小套具有语义意义的提示物来学习一般化。 8 LLMS 的广泛实验显示, 模型可以通过有结构的、 语义上一致的表达方式, 来重新解释具有象征意义的记忆数据。 这个惊人的发现打开了有效和高效的知识注入的大门, 以及重新将记忆中的数据用于恶意用途的可能风险。
Article 207
Title@2025-07-29 (2): SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs
Title: SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs | SmoothRot: Kombination von Kanal-Weiss-Skalierung und Rotation für Quantisierungsfreundliche LLMs | 平滑旋转: 将频道- Wise 缩放和旋转组合起来, 用于量化- 友好型LLMS 2506.05413v2 |
Authors (3): Patrik Czakó, Gábor Kertész, Sándor Szénási
We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.
我们展示了“平滑”技术,这是提高大语言模型四位数量化效率的一种创新培训后量化技术。“平滑”通过将频道与哈达马德变换相结合,应对大规模激活外源器的关键挑战。我们的技术有效地将极端外源器转化为有利于量化的激活,大大提高了量化的准确性。在广受欢迎的LLMS(LLAMA2 7B、LLLAMA3.1 8B和Mistral 7B)上进行的实验表明,平滑始终将四分制和FP16模型之间的性能差距缩小约10-30,在语言生成和零点推理任务之间减少约10-30,而没有引入额外的推理延度。守则可在https://github.com/czakop/smoothro查阅。
Article 208
Title@2025-07-29 (2): SLR: Automated Synthesis for Scalable Logical Reasoning
Title: SLR: Automated Synthesis for Scalable Logical Reasoning | SLR: Automatisierte Synthese für skalierbare logische Vernunft | SLR: 用于可缩放逻辑理由的自动合成 2506.15787v3 |
Authors (9): Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, and Wolfgang Stammer Kristian Kersting
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
我们引入了SLR,这是通过可缩放逻辑解释系统评估和培训大语言模型的端到端框架。根据用户的任务规格,SLR自动合成了(一) 用于推理任务的指令迅速,(二) 验证程序,可在模型产出上执行,以提供可核查的奖赏,(三) 潜在的地面真相规则。这一过程完全自动化,可缩放,不需要人手说明,对任务困难提供精确的控制。使用SLR,我们创建SLR-Bench,这是一个由19个提示组成的基准,分为20个课程级别,逐步提高关系、算术和再现的复杂性。大规模评估显示,当代LLLMS随时能够产生综合有效的规则,但往往无法正确推理出逻辑。最近的推理,LLMS显示业绩有所改善,但测试时间计算得非常高,只需1 000个提示就超过300美元。最后,通过SLR的LRA-3-8B精准性课程学习。SLR-B,在SLR-B上实现与GEM-FLash-Lash-Lasimlash通用推理算法基础的等等,在普遍推算中,这些推算能力至GLS-LILS-LisLisLisLisLL。
Article 209
Title@2025-07-29 (2): Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning
Title: Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning | Graph-R1: Auf dem Weg zu einem agentischen GraphRAG-Framework durch durchgängiges Ausbau-Lernen | 图R1:通过端至端强化学习,迈向 “ 干点至端强化学习 “ 框架 2507.21892v1 |
Authors (11): Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan
Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, an agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality.
通过吸收外部知识,并依赖缺乏结构语义学的块状检索,再获取新一代人(RAG)减轻了LMS的幻觉。GIARAG的方法通过将知识建模成实体关系图改进了RAG,但是在高建筑成本、固定的一次性检索以及依赖长文本推理和迅速设计方面仍然面临挑战。为了应对这些挑战,我们提议GIP-R1,一个通过端到端强化学习(RL)的代理GIGRAG框架。它引入了轻量知识高集构建、模型检索作为多转媒介-环境互动,并通过端到端奖励机制优化代理过程。关于标准RAG数据集的实验显示,在推理精度、检索效率和生成质量方面,GI和RL强化的RAG方法超越了传统的GRAG和RL强化RAG方法。
Article 210
Title@2025-07-29 (2): FrugalRAG: Learning to retrieve and reason for multi-hop QA
Title: FrugalRAG: Learning to retrieve and reason for multi-hop QA | FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA | FrugalRAG:学会检索和多呼QA的理由 2507.07634v2 |
Authors (4): Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma
We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).
我们考虑了回答复杂问题的问题,因为有了大量结构化的文件资料库,我们考虑了回答复杂的问题的问题。事实上解决问题的方法是利用语言模型,这些语言模型(表面上)通过检索的文件检索和解释,直到该模型有足够的信息来找到答案。改进这一方法的尝试侧重于检索强化的生成(RAG)指标,例如准确性和回忆性,可以分为两类:(a)对大问题的回答(QA)数据集进行微调,增加思考链的痕迹;(b)利用基于RL的微调技术,这些技术依赖于问题文件的相关性信号。然而,检索搜索数量的效率是一个同样重要的衡量标准,但这一衡量标准得到的注意较少。在这项工作中,我们表明:(1) 与最近的文献中流行的说法相反,不需要进行大规模的微调来改进RAG的衡量标准。 具体地说,改进的提示性标准“ReAc”管道可以超越HotPA等基准的先进模型数目。 (2) 超额和基于RL的微调技术方法,用于在RAG的50%的搜索中,从我们进行适当的成本模型搜索中可以证明。
Article 211
Title@2025-07-29 (2): WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking
Title: WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking | WakenLLM: Bewertung des Potenzials und der Stabilität von LLMs mittels feinkörniger Benchmarking | WakenLLLM:通过细微基准评估LLM公司的合理潜力和稳定性 2507.16199v3 |
Authors (10): Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu
Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs’ reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.
大型语言模型(LLMS)经常输出在推理任务中未知的标签,其中可能出现两种假想:(一) 输入样本真正无法核实,但模型无法理解原因;(二) 模型未能解决的可核查问题,因此无法解决,结果不明。我们将这些案例统称为模糊概念现象。目前的评估侧重于这些答案是否诚实,而不是分析LLM推理的限度。为了解决这个问题,我们引入了WakenLLLMM这个框架,这个框架量化了因模型缺乏能力而产生的未知产出部分,并评估了刺激能否将其转化为正确答案(可核实)或合理(不可核实)的答案。我们的方法更清楚地描绘了LLMM推理的局限性和各种数据集的纠正潜力。对六个LMS的全面实验表明,在不进行任何培训或参数修订的情况下,LLMS样本能够达到68.53%的精确度。我们的工作显示,目前的基线方法只能激发LMS推理潜力的一小部分,表明相当的未解释能力。我们的方法更清楚地展示了LMS推理的理论,从而加深了VLMS的深层推理。
Article 212
Title@2025-07-29 (2): FB-RAG: Improving RAG with Forward and Backward Lookup
Title: FB-RAG: Improving RAG with Forward and Backward Lookup | FB-RAG: Verbesserung der RAG durch Vorwärts- und Rückwärtsblick | FB-RAG:以前向和后向看改进RAG 2505.17206v2 |
Authors (4): Kushal Chawla, Alfy Samuel, Anoop Kumar, Daben Liu
Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across 9 datasets, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over 48% latency reduction or achieves an 8% performance improvement with a 10% latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.
传统回溯-增强型一代(RAG)与复杂的询问斗争,这些询问缺乏强有力的信号,无法找到最相关的背景,迫使在选择缺少关键信息的小背景和混淆LLM的大背景之间权衡取舍。 为了解决这个问题,我们提议采用基于简单而有力的前瞻性战略的新的无培训框架,即前背式RAG(FB-RAG),这是一个基于简单而有力的前瞻性战略的新的无培训框架。FB-RAG使用轻量级LM,以偷窥后代,利用多个抽样产出的证据精确地确定最终、更强大的生成器最相关的环境。这在以往工作中没有复杂的微调或强化学习共同之处,就能改善业绩。在9个数据集中,FB-RAG始终提供强有力的成果。此外,由于强大的生成器更短、更集中,从而降低了延迟性,因此可以实现业绩增益。在EN.QA数据集上,FB-RAG将领先基线与超过48%的拉特率降低或实现8 %的绩效改进,同时减少10%的延迟度。我们的分析发现,即使有更精确的预测性地展示了更精确性效率,但最终的答案,但最终的指南却却却却却也无法改进了。
Article 213
Title@2025-07-29 (2): AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning
Title: AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning | AutoTIR: Autonome Tools Integriertes Reasoning durch Verstärkungslernen | AutoTIR:通过强化学习综合解释理由的自主工具 2507.21836v1 |
Authors (6): Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li, Li Du
Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.
大语言模型(LLMS)在通过以推理为导向的培训后强化后,发展成为强大的大理由模型(LRMs)。工具综合理由模型(TIR)通过纳入外部工具进一步扩大其能力,但现有方法往往依赖僵硬的、预先界定的工具使用模式,这种模式可能使核心语言能力降低;在人类适应性选择工具的能力的启发下,我们引入AutoTIR(AutoTIR),这是一个强化学习框架,使LLMS能够在推理过程中自主决定是否和哪一种工具可以援引,而不是采用静态的工具使用战略。AutoTIR(AIR)利用混合奖励机制,共同优化具体任务答案的正确性、结构化产出的坚持性和对不正确工具使用的惩罚性,从而鼓励精确的推理和高效的工具整合。各种知识密集型、数学和通用语言模型任务的广泛评价表明,AutoTIR在总体业绩上取得优异,大大超出基准,在工具使用行为上表现出超凡。这些结果突出表明了加强学习在LMS内建立真正普遍可计量和可计量的TIR能力的前景。
Article 214
Title@2025-07-29 (2): Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences
Title: Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences | Einführung von HALC: Eine allgemeine Pipeline für die Suche nach optimalen Promptenstrategien für die automatisierte Codierung mit LLMs in den Computational Social Sciences | 介绍HALC:寻找计算社会科学中与LLMs自动编码的最佳加速战略的一般管道 2507.21831v1 |
Authors (3): Andreas Reich, Claudia Thoms, Tobias Schrimpf
LLMs are seeing widespread use for task automation, including automated coding in the social sciences. However, even though researchers have proposed different prompting strategies, their effectiveness varies across LLMs and tasks. Often trial and error practices are still widespread. We propose HALC$-$a general pipeline that allows for the systematic and reliable construction of optimal prompts for any given coding task and model, permitting the integration of any prompting strategy deemed relevant. To investigate LLM coding and validate our pipeline, we sent a total of 1,512 individual prompts to our local LLMs in over two million requests. We test prompting strategies and LLM task performance based on few expert codings (ground truth). When compared to these expert codings, we find prompts that code reliably for single variables (${\alpha}$climate = .76; ${\alpha}$movement = .78) and across two variables (${\alpha}$climate = .71; ${\alpha}$movement = .74) using the LLM Mistral NeMo. Our prompting strategies are set up in a way that aligns the LLM to our codebook$-$we are not optimizing our codebook for LLM friendliness. Our paper provides insights into the effectiveness of different prompting strategies, crucial influencing factors, and the identification of reliable prompts for each coding task and model.
然而,尽管研究人员提出了不同的快速战略,但其效力也各不相同。通常,试验和错误做法仍然很普遍。我们建议为任何特定的编码任务和模式系统、可靠地建造最佳速度的普通管道,允许整合认为相关的任何提示战略。为了调查LLM编码和验证我们的管道,我们总共以超过200万个请求向当地LM公司发送了1 512个个人提示。我们测试了基于少数专家编码(地面真相)的快速战略和LLM任务绩效。与这些专家编码相比,我们发现为单一变量($HALFA}$气候=76)可靠代码的提示;$halpha}移动=78,以及两个变量($HALFA}=0.71;$alpha}流动=74。我们利用LM Mistral NeMo测试了快速战略和LLM任务绩效快速分析的模型。我们迅速制定的战略,而不是以最可靠的方式调整我们的LLM的代码。
Article 215
Title@2025-07-29 (2): EEG-CLIP : Learning EEG representations from natural language descriptions
Title: EEG-CLIP : Learning EEG representations from natural language descriptions | EEG-CLIP : Lernen von EEG-Darstellungen aus natürlichen Sprachbeschreibungen | EEG-CLIP:从自然语言说明中学习EEG代表 2503.16531v2 |
Authors (3): Tidiane Camaret Ndir, Robin Tibor Schirrmeister, Tonio Ball
Deep networks for electroencephalogram (EEG) decoding are often only trained to solve one specific task, such as pathology or age decoding. A more general task-agnostic approach is to train deep networks to match a (clinical) EEG recording to its corresponding textual medical report and vice versa. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework, EEG-CLIP, that aligns the EEG time series and the descriptions of the corresponding clinical text in a shared embedding space. We investigated its potential for versatile EEG decoding, evaluating performance in a range of few-shot and zero-shot settings. Overall, we show that EEG-CLIP manages to non-trivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero-shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at https://github.com/tidiane-camaret/EEGClip
深层电脑图解码网络通常仅经过培训才能解决病理学或年龄解码等一项具体任务。更一般性的任务不可知性办法是培训深层网络,将(临床)EEEG记录与其相应的文本医学报告相匹配,反之亦然。这种方法在计算机视野域中先行推出,匹配图像及其文本标题,然后允许使用文本类提示成功完成零发解码。在这项工作中,我们遵循这一方法,并开发了一个对比式学习框架EEEG-CLIP,将EEG时间序列和相应的临床文本描述与共同嵌入空间相匹配。我们研究了其多功能 EEEG解码的潜力,评估了几发和零发环境环境的性能。总体而言,我们显示EEG-CLIP管理着非边际的文本和 EEG表示式。我们的工作为学习通用的 EEG 演示提供了一种很有希望的方法,它能够通过零发解码或培训特定任务模型对不同的解码问题进行更方便的分析。我们可以利用的 MAG/AMIADM/ADLAB 用于较少的培训示例。
Article 216
Title@2025-07-29 (2): Modelling Adjectival Modification Effects on Semantic Plausibility
Title: Modelling Adjectival Modification Effects on Semantic Plausibility | Modellierung adjektiver Modifizierungseffekte auf die semantische Plausibilität | 模拟弹道改变对语义等高可变性的影响 2507.21828v1 |
Authors (3): Anna Golub, Beate Zywietz, Annerose Eichel
While the task of assessing the plausibility of events such as ‘‘news is relevant’’ has been addressed by a growing body of work, less attention has been paid to capturing changes in plausibility as triggered by event modification. Understanding changes in plausibility is relevant for tasks such as dialogue generation, commonsense reasoning, and hallucination detection as it allows to correctly model, for example, ‘‘gentle sarcasm’’ as a sign of closeness rather than unkindness among friends [9]. In this work, we tackle the ADEPT challenge benchmark [6] consisting of 16K English sentence pairs differing by exactly one adjectival modifier. Our modeling experiments provide a conceptually novel method by using sentence transformers, and reveal that both they and transformer-based models struggle with the task at hand, and sentence transformers - despite their conceptual alignment with the task - even under-perform in comparison to models like RoBERTa. Furthermore, an in-depth comparison with prior work highlights the importance of a more realistic, balanced evaluation method: imbalances distort model performance and evaluation metrics, and weaken result trustworthiness.
虽然评估“新闻是相关的”等事件是否可信的任务已经通过越来越多的工作得到处理,但较少注意捕捉事件修改引发的可信程度的变化。理解可行性的变化对于诸如对话生成、常识推理和幻觉探测等任务来说是相关的,因为它能够正确地模拟,例如“gentle sarcasm”是朋友之间亲密而不是不友善的迹象[9]。在这项工作中,我们处理ADEPT挑战基准[6],其中包括16K英语句子对口,完全由一位弹道修饰者所不同。我们的模型实验通过使用变压器提供了一种概念上创新的方法,并揭示了它们和变压器模型与手头的任务和变压器——尽管在概念上与任务一致——甚至与RoBERTa等模型相比不甚完善。此外,与先前的工作进行深入比较,突出表明更现实、平衡的评价方法的重要性:不平衡的模型性能和评估度量度和削弱结果的可靠性。
Article 217
Title@2025-07-29 (2): HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs
Title: HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs | HRIPBench: Benchmarking von LLMs bei der Bereitstellung von Informationen zur Schadensreduzierung zur Unterstützung von Drogenkonsumenten | HRIPBENCH:在向吸毒者提供支助的减少危害信息提供中确定LLMs基准 2507.21815v1 |
Authors (5): Kaixuan Wang, Chenxin Diao, Jason T. Jacques, Zhongliang Guo, Shuai Zhao
Millions of individuals’ well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM’s accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge. Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes. WARNING: This paper contains illicit content that potentially induces harms.
减少危害是一项公共卫生战略,目的是改善他们的健康结果,减少安全风险。一些大型语言模型(LLMs)已经展示出适当的医疗知识水平,有望满足吸毒者的信息需求(PWUD),然而,他们在相关任务中的绩效在很大程度上仍未得到探讨。我们引入了HRIPBench,这是一个基准,旨在评估LLM在减少危害信息提供方面的准确性和安全风险。基准数据集 HRIP-Basus有2,160对问答证据。范围包括三项任务:检查安全界限,提供数量值,并推断多重物质使用风险。我们建立指令和RAG计划,以基于其固有知识和领域知识的整合来评价示范行为。我们的结果表明,目前最先进的LPBenchms仍然在努力提供准确的减少伤害信息,有时还会给PWUD带来严重的安全风险。在减少伤害背景下使用LLMs应该谨慎地加以限制,以避免产生负面的健康结果。WARNINING:本文载有可能诱发损害的非法内容。
Article 218
Title@2025-07-29 (2): Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish
Title: Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish | Übersicht über ADoBo bei IberLEF 2025: Automatische Erkennung von Anglizismen auf Spanisch | IberLEF 2025年IberLEF ADoBo ADoBo 概览:西班牙文自动检测 2507.21813v1 |
Authors (4): Elena Alvarez-Mellado, Jordi Porta-Zamorano, Constantine Lignos, Julio Gonzalo
This paper summarizes the main findings of ADoBo 2025, the shared task on anglicism identification in Spanish proposed in the context of IberLEF 2025. Participants of ADoBo 2025 were asked to detect English lexical borrowings (or anglicisms) from a collection of Spanish journalistic texts. Five teams submitted their solutions for the test phase. Proposed systems included LLMs, deep learning models, Transformer-based models and rule-based systems. The results range from F1 scores of 0.17 to 0.99, which showcases the variability in performance different systems can have for this task.
本文件总结了ADoBo 2025年的主要调查结果,这是在IberLEF 2025年背景下提出的用西班牙文识别古生物的共同任务,要求ADoBo 2025年的参与者从西班牙新闻文本汇编中发现英国的词汇借款(或假象),五个小组提交了测试阶段的解决办法,提议的系统包括LLMS、深学习模型、基于变异器的模型和基于规则的系统,结果从F1分0.17到0.99不等,显示不同系统业绩的变异性可以用于这项任务。
Article 219
Title@2025-07-29 (2): ChartMark: A Structured Grammar for Chart Annotation
Title: ChartMark: A Structured Grammar for Chart Annotation | ChartMark: Eine strukturierte Grammatik für Chart-Annotation | 图表 Mark: 用于图表注释的结构性语法 2507.21810v1 |
Authors (7): Yiyu Chen, Yifan Wu, Shuyu Shen, Yupeng Xie, Leixian Shen, Hui Xiong, Yuyu Luo
Chart annotations enhance visualization accessibility but suffer from fragmented, non-standardized representations that limit cross-platform reuse. We propose ChartMark, a structured grammar that separates annotation semantics from visualization implementations. ChartMark features a hierarchical framework mapping onto annotation dimensions (e.g., task, chart context), supporting both abstract intents and precise visual details. Our toolkit demonstrates converting ChartMark specifications into Vega-Lite visualizations, highlighting its flexibility, expressiveness, and practical applicability.
图表说明提高了可视化的可视性,但存在限制跨平台再利用的支离破碎、非标准化的表征。我们提出了ChartMark,这是一个结构化的语法,将注解语义与可视化实施区分开来。ChartMark具有向注解维度(例如任务、图表背景)绘制的等级框架,支持抽象意图和准确的直观细节。我们的工具包显示将图 Mark规格转换成Vega-Lite可视化,突出显示其灵活性、表现性和实用性。
Article 220
Title@2025-07-29 (2): Task Arithmetic for Language Expansion in Speech Translation
Title: Task Arithmetic for Language Expansion in Speech Translation | Aufgabe Arithmetik für Spracherweiterung in der Sprachübersetzung | 语音翻译中语言扩展任务 2409.11274v3 |
Authors (7): Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe
Recent progress in large language models (LLMs) has gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-tuned speech translation (ST). However, expanding language pairs is costly due to re-training on combined new and previous datasets. To address this, we aim to build a one-to-many ST system from existing one-to-one ST systems using task arithmetic without re-training. Direct application of task arithmetic in ST leads to language confusion; therefore, we introduce an augmented task arithmetic method incorporating a language control model to ensure correct target language generation. Our experiments on MuST-C and CoVoST-2 show BLEU score improvements of up to 4.66 and 4.92, with COMET gains of 8.87 and 11.83. In addition, we demonstrate our framework can extend to language pairs lacking paired ST training data or pre-trained ST models by synthesizing ST models based on existing machine translation (MT) and ST models via task analogies.
在大型语文模式(LLMS)方面最近取得的进展引起了人们对语言-文字多式联运基础模型的兴趣,在经指导的语音翻译(ST)方面表现良好。然而,由于在新的和以前的合并数据集方面进行再培训,扩大对口语文是昂贵的。为了解决这个问题,我们的目标是利用现有的一对一的ST系统,使用不经过再培训的任务算术,从现有的一对一的ST系统建立一个一对一的ST系统。在ST直接应用任务算术会导致语言混乱;因此,我们引入了一种扩大的任务算术方法,其中包括一种语言控制模型,以确保正确生成目标语言。我们在 MuST-C 和 CoVoST-2上进行的实验显示,BLEU的得分提高达4.66和4.92,而知识与技术交流的得益为8.87和11.83。此外,我们通过任务类比将基于现有机器翻译(MT)的ST模型和ST模型合成ST模型。
Article 221
Title@2025-07-29 (2): The Problem with Safety Classification is not just the Models
Title: The Problem with Safety Classification is not just the Models | Das Problem der Sicherheitsklassifizierung sind nicht nur die Modelle | 安全分类问题不仅仅是模型 2507.21782v1 |
Authors (1): Sowmya Vajjala
Studying the robustness of Large Language Models (LLMs) to unsafe behaviors is an important topic of research today. Building safety classification models or guard models, which are fine-tuned models for input/output safety classification for LLMs, is seen as one of the solutions to address the issue. Although there is a lot of research on the safety testing of LLMs themselves, there is little research on evaluating the effectiveness of such safety classifiers or the evaluation datasets used for testing them, especially in multilingual scenarios. In this position paper, we demonstrate how multilingual disparities exist in 5 safety classification models by considering datasets covering 18 languages. At the same time, we identify potential issues with the evaluation datasets, arguing that the shortcomings of current safety classifiers are not only because of the models themselves. We expect that these findings will contribute to the discussion on developing better methods to identify harmful content in LLM inputs across languages.
建立安全分类模型或防护模型,这些模型是LLMS投入/产出安全分类的精细模型,被视为解决这一问题的解决方案之一。尽管对LLMS本身的安全测试进行了大量研究,但很少研究评价这种安全分类器或用于测试这些模型的评价数据集的有效性,特别是在多语种情况下。在本立场文件中,我们通过考虑涵盖18种语言的数据集,表明5种安全分类模型存在多语言差异。与此同时,我们查明评价数据集的潜在问题,认为目前的安全分类器的缺点不仅仅是因为模型本身。我们期望这些研究结果将有助于讨论制定更好的方法,查明LLMM在各种语言中投入的有害内容。
Article 222
Title@2025-07-29 (2): Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages
Title: Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages | Sparse Autoencoder können sprachspezifische Konzepte über verschiedene Sprachen hinweg erfassen | 能够捕捉不同语言语言的特定语言概念的简单自定义者 2507.11230v2 |
Authors (6): Lyzander Marciano Andrylie, Inaya Rahmanisa, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji
Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model’s multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features
了解大型语言模型(LLMS)的多语言机制可以深入了解它们是如何处理不同语言的,然而,这仍然具有挑战性。现有的研究往往侧重于单个神经元,但其多语种性质使得难以将特定语言单位与跨语言代表隔离开来。为了解决这个问题,我们探索了稀少的自动校考员(SAEs),以使他们有能力学习代表不同语言具体和抽象概念的单语种特征。虽然其中一些特征是语言独立的,但特定语言特征的存在仍未得到充分探讨。在这项工作中,我们引入了基于特征激活概率的SAE-LAPE方法,即基于特征激活概率的SAE-LAPE,以识别进料前网络中特定语言特征。我们发现许多这类特征主要出现在模型的中间至最后层,是可以解释的。这些特征影响模型的多语种性能和语言输出,并可用于语言识别与可与快读性相比的功能和解释性能。我们的代码可在https://github.com/Lysanderandoryandrylie/laugal-fecal-fetatatatures上查阅。
Article 223
Title@2025-07-29 (2): AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models
Title: AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models | AgriEval: Ein umfassender chinesischer Landwirtschafts-Benchmark für große Sprachmodelle | 农业:中国大语言模式农业综合基准 2507.21773v1 |
Authors (8): Lian Yan, Haotian Wang, Chen Tang, Haifeng Liu, Tianyang Sun, Liangliang Liu, Yi Guan, Jingchi Jiang
In the agricultural domain, the deployment of large language models (LLMs) is hindered by the lack of training data and evaluation benchmarks. To mitigate this issue, we propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics: (1) Comprehensive Capability Evaluation. AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios: memorization, understanding, inference, and generation. (2) High-Quality Data. The dataset is curated from university-level examinations and assignments, providing a natural and robust benchmark for assessing the capacity of LLMs to apply knowledge and make expert-like decisions. (3) Diverse Formats and Extensive Scale. AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date. We also present comprehensive experimental results over 51 open-source and commercial LLMs. The experimental results reveal that most existing LLMs struggle to achieve 60% accuracy, underscoring the developmental potential in agricultural LLMs. Additionally, we conduct extensive experiments to investigate factors influencing model performance and propose strategies for enhancement. AgriEval is available at https://github.com/YanPioneer/AgriEval/.
在农业领域,大型语言模型(LLMs)的部署因缺乏培训数据和评价基准而受阻。为缓解这一问题,我们提议AgriEval,这是中国第一个综合农业基准,有三大特点:(1) 综合能力评估;AgriEval涵盖农业的六个主要农业类别和29个亚类,涉及四个核心认知情景:记忆、理解、推断和生成。(2) 高品质数据。数据集由大学一级的考试和任务整理,为评估LLMs应用知识和作出类似专家决定的能力提供一个自然和强有力的基准。(3) 多样化格式和广泛规模。AgriEval包括14,697个多选择问题和2,167个开放式问答问题,将其作为迄今最广泛的农业基准。我们还介绍了51个开放源和商业LMs的综合实验结果。实验结果显示,大多数现有的LMs为达到60%的精确度而斗争,强调了农业LMs的发展潜力。此外,我们进行了广泛的实验,以调查影响模型性表现的因素,并提出加强战略。AGA/AGIA。
Article 224
Title@2025-07-29 (2): Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal
Title: Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal | Adversariale Verteidigung ohne Adversariale Verteidigung: Verbesserung der Sprachmodell Robustheit über Instanz-Ebene Hauptkomponentenentfernung | 无反向辩护的反向辩护,无反向辩护:通过一审一级主要组成部分删除,加强语言模式的强力 2507.21750v1 |
Authors (6): Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
预先培训的语言模式(PLMs)在自然语言处理方面取得了长足进展,但仍然容易受到对抗性攻击,使人们对其在现实世界应用中的强健性感到担忧。以前的研究试图通过在培训过程中隐含或明确引入对抗性干扰来减轻对抗性攻击的影响。虽然这两种战略都增强了强健性,但往往产生很高的计算成本。在这项工作中,我们提出了一个简单而有效的附加模块,通过删除实例一级的主要组成部分,而不依赖常规的对抗性防御或干扰原始培训数据,加强PLMs的对抗性强健性。我们的方法将嵌入空间转化为近似高斯特性,从而减少其受对抗性攻击性侵的影响,同时保持语义关系。这种转变将分配方式与尽量减少对抗性噪音对决定边界的影响,加强强健性,而不需要对抗性实例或昂贵的培训时间增强。对八个基准数据集的评价表明,我们的方法在保持攻击前的准确性与基线的可比性的同时,提高了对抗性强性强性,同时实现了稳健性和一般之间的平衡贸易。
Article 225
Title@2025-07-29 (2): Image Captioning via Compact Bidirectional Architecture
Title: Image Captioning via Compact Bidirectional Architecture | Bildunterschrift über kompakte bidirektionale Architektur | 通过契约双向双向建筑进行图像描述 2201.01984v2 |
Authors (7): Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng Wang
Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.
多数当前图像字幕模型通常产生左对右的字幕。 这种单向属性使它们只能利用过去的背景而不是未来的背景。 虽然基于精细的模型可以利用过去和未来的背景, 在第二阶段产生一个新的标题, 其基础是在第一阶段以预检索或预生成的字幕为基础, 在第二阶段产生一个新的标题, 这些模型的解码器通常由两个网络组成 ~( 即, 第一阶段的检索器或字幕, 第二阶段的字幕) , 只能按顺序执行。 在本文中, 我们引入了一个契约双向双向变换模型, 用于图像字幕说明, 既可以以隐含和明确的方式利用双向的双向变换。 具体地说, 这些模式的解码在第二阶段产生一个新的标题( L2R) 和右向左转( R2L) , 以隐含双向双向的双向背景, 最终标题从 L2R2 或 R2L 将双向双向的双向变换 , 其最终的自我变换 , 将一个更深级的图像- IM- 工具级的游戏级的游戏- 结构 , 将一个更深层的变为我们进入一个更高级的版本。
Article 226
Title@2025-07-29 (2): My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt
Title: My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt | Mein Leben in Künstlicher Intelligenz: Menschen, Anekdoten und einige Lektionen gelernt | 我在人工智能中的生活:人、流浪者、以及一些经验教训 2504.04142v2 |
Authors (1): Kees van Deemter
In this very personal workography, I relate my 40-year experiences as a researcher and educator in and around Artificial Intelligence (AI), more specifically Natural Language Processing. I describe how curiosity, and the circumstances of the day, led me to work in both industry and academia, and in various countries, including The Netherlands (Amsterdam, Eindhoven, and Utrecht), the USA (Stanford), England (Brighton), Scotland (Aberdeen), and China (Beijing and Harbin). People and anecdotes play a large role in my story; the history of AI forms its backdrop. I focus on things that might be of interest to (even) younger colleagues, given the choices they face in their own work and life at a time when AI is finally emerging from the shadows.
我讲述了我40年来在人工智能(AI)及其周围(更具体地说是自然语言处理)的研究和教育家和教育工作者的经验。 我描述了好奇心和当时的情况如何导致我在工业和学术界以及包括荷兰(阿姆斯特丹、艾因多芬和乌得勒支)、美国(斯坦福德)、英格兰(布莱顿)、苏格兰(阿伯丁)和中国(北京和哈宾)等不同国家工作。 人和阿密多斯在我的故事中扮演了重要角色;AI的历史形成了它的背景。 我着重讲述了可能令年轻同事感兴趣的事情(甚至),因为当AI最终走出阴影的时候,他们在自己的工作和生活中面临着选择。
Article 227
Title@2025-07-29 (2): Technical Report of TeleChat2, TeleChat2.5 and T1
Title: Technical Report of TeleChat2, TeleChat2.5 and T1 | Technischer Bericht von TeleChat2, TeleChat2.5 und T1 | TeleChat2、TeleChat2.5和T1技术报告 2507.18013v3 |
Authors (38): Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li
We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.
我们推出最新系列的TeleChat 模式 :\ textbf{ TeleFhat2},\ textbf{ TeleChat2.5},\ textbf{ TeleC2.5} 和\ textbf{T1}, 提供了对其前身TeleC的大幅升级。 尽管对模式架构的修改很小, 新系列通过在培训前和培训后两个阶段的强化培训战略取得了巨大的绩效。 该系列从\ textbf{ TeleC2} 开始, 以10万个高品质和多种标识进行预培训。 之后是Surviced FinalT( SSFT) 和直接Preport Ofer Ofer Appimation( 支持长链- t- t) 高级模型, 以及将 G- flotf 数据数据集与强化学习( RL) 来提高代码生成和数学推理的性能。
Article 228
Title@2025-07-29 (2): UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
Title: UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases | UnsafeChain: Verbesserung der Modellsicherheit über Hard Cases | 不安全Chain:通过困难案件加强说明理由的示范安全 2507.21652v1 |
Authors (3): Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang
As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain
由于大型推理模型(LRMs)的能力越来越强,思维链(CoT)推理带来了新的安全挑战。基于SFT的现有安全协调研究主要侧重于以安全、高质量的回应方式过滤提示,而忽略总是产生有害产出的硬性提示。为了填补这一空白,我们引入了不安全Chain,这是一个安全协调数据集,该数据集由来自不同来源的硬性提示和不同来源建立,其中查明了不安全的完成情况,并明确更正为安全反应。通过将模型暴露为不安全行为并指导其纠正,Anse Chain在维护一般推理能力的同时加强了安全。我们在不安全Cain上对三个LRMs进行了微调,并将其与最近的Safechain和STAR-1在六个分配外和五个分配基准上进行了比较。不安全Chain一贯地超越先前的数据集,甚至1K组匹配或超过基线性,表明基于纠正的监督的有效性和可普遍性。我们在https://github.com/mbzuai-nlp/UnsafeCHain上公布了我们的数据设置和代码。
Article 229
Title@2025-07-29 (2): Libra: Assessing and Improving Reward Model by Learning to Think
Title: Libra: Assessing and Improving Reward Model by Learning to Think | Waage: Bewertung und Verbesserung des Prämienmodells durch Lernen zu denken | 利布拉:通过学习思考来评估和改进奖励模式 2507.21645v1 |
Authors (8): Meng Zhou, Bei Li, Jiahao Liu, Xiaowen Shi, Yang Bai, Rongxiang Weng, Jingang Wang, Xunliang Cai
Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.
强化学习(RL)极大地提高了大型语言模型的推理能力,然而,目前的奖赏模式在具有挑战性的推理假设和占主导地位的RL培训模式方面表现不佳,依赖于基于规则的或基于参考的奖赏,这些奖赏模式有两大限制:(1) 依赖附带注释的精细参考回答来获得奖赏;(2) 要求有限制的产出格式。这些限制从根本上阻碍了REL数据的进一步扩展和示范推理业绩的持续提高。为克服这些限制,我们提出了一个综合框架,用于在复杂的推理假设中评价和改进奖赏模型的绩效。我们首先提出了一个以推理为导向的基准(Libra Bench),该基准由各种具有挑战性的数学问题和高级推理模型系统构建,以解决推理假设中现有奖赏模型基准的局限性。我们进一步引入了一种新的方法,即通过从学习到思维的方法来改进归正奖励模式。我们开发了利布拉-RM系列,这是一套具有推理能力的、在各种基准上取得最新结果的归正奖赏模型。我们进行了全面的下游试验,实验结果进一步表明我们的图书馆座座座座座座座和下游应用与LIbraRM的推理的潜力。
Article 230
Title@2025-07-29 (2): Probing then Editing Response Personality of Large Language Models
Title: Probing then Editing Response Personality of Large Language Models | Probing dann Editing Response Persönlichkeit von großen Sprachmodellen | 检验后编辑大语言模型的个性反应 2504.10227v2 |
Authors (10): Tianjie Ju, Zhenyu Shao, Bowen Wang, Yujia Chen, Zhuosheng Zhang, Hao Fei, Mong-Li Lee, Wynne Hsu, Sufeng Duan, Gongshen Liu
Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that simulate consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in simulating personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly simulate personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at https://github.com/universe-sky/probing-then-editing-personality.
大型语言模型(LLMS)已经展示出极好的应对能力,以模拟一致的个性特征。尽管通过基于产出的评价对个性表现进行了重大分析,但对于这些特征如何在LLM参数内部编码却知之甚少。在本文中,我们引入了一个分层的探索框架,以系统调查LLMS在模拟个性响应时的分层能力。我们对11个开放源LMS在个人性Edit基准方面进行了测试实验,发现LLMS主要模拟个性在中层和上层作出反应,而经指导的模型显示个性特征的区分略为明确。此外,通过将经过训练的超高机率仪作为每个个性类别的分层边界加以解释,我们提出了一种分层的扰动方法,以编辑LLMMS在推断中表达的个性的能力。我们的结果显示,即使及时明确指明一个特定的个性,我们的方法仍然能够成功地改变LMMS的个性。有趣的是,某些个性特征之间的转换困难很大,这与我们进行模拟的分流/分级实验的距离距离相当,这与我们进行模拟的分级实验的分级试验时的距离相当的距离是我们在进行一般的平级分析时空分析,我们进行一般的平级的平级分析时空分析。最后,我们进行一个可接受的计算的方法是,我们进行一个可接受的计算。我们进行普通的计算。我们进行普通的计算。我们进行普通的计算的方法是用来进行普通的计算。
Article 231
Title@2025-07-29 (2): Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search
Title: Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search | Stratege: Selbstverbesserung der LLM-Entscheidungsfindung über die Bi-Level-Baumsuche | 战略:通过双层树木搜索自我改善LLM决策 2408.10635v3 |
Authors (8): Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu
Traditional reinforcement learning and planning typically requires vast amounts of data and training to develop effective policies. In contrast, large language models (LLMs) exhibit strong generalization and zero-shot capabilities, but struggle with tasks that require detailed planning and decision-making in complex action spaces. We introduce STRATEGIST, a novel approach that integrates the strengths of both methods. Our approach leverages LLMs to search and update high-level strategies (as text), which are then refined and executed by low-level Monte Carlo Tree Search (MCTS). STRATEGIST is a generalizable framework to optimize the strategy through population-based self-play simulations without the need for any training data. We demonstrate the effectiveness of STRATEGIST in learning optimal strategies for competitive, multi-turn games with partial information, including Game of Pure Strategy (GOPS) and multi-agent, hidden-identity discussion games like The Resistance: Avalon. Our results show that agents equipped with STRATEGIST outperform those trained with traditional RL methods, other LLM-based skill acquisition techniques, pre-existing LLM agents across both game environments and achieves comparable performance against human players.
传统强化学习和规划通常需要大量的数据和培训才能制定有效的政策,相比之下,大型语言模型(LLMS)具有很强的通用和零射能力,但与需要在复杂行动空间进行详细规划和决策的任务抗争。我们引入了SSTATEGIST,这是将两种方法的优势结合起来的一种新颖办法。我们的方法利用LLMS搜索和更新高层次战略(作为文本),这些战略随后由低层次的蒙特卡洛树搜索(MCTS)加以完善和执行。STATEGIST是一个一般化的框架,通过基于人口的自我游戏模拟来优化战略,而无需任何培训数据。我们展示了STATEGIST在学习竞争性、多转盘游戏的最佳战略方面的有效性,包括普里战略游戏(GOPS)和多媒介、隐性的讨论游戏,如抵抗:阿瓦隆。我们的结果显示,配备STATEGISTS的代理器超越了那些经过传统RL方法、其他基于LM的技能获取技术的技术技能技术、在游戏环境中的前LM代理者,并取得与人类玩家相似的业绩。
Article 232
Title@2025-07-29 (2): Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Title: Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Latent Adversarial Training verbessert Robustheit für persistente schädliche Verhalten in LLMs | 长效对长效有害行为培训能提高长效LMM中持久性有害行为的积极性 2407.15549v3 |
Authors (11): Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of ‘jailbreaking’ techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.
大型语言模型( LLMs) 通常可以让大型语言模型( LLM ) 以不可取的方式行事,而这种方式是它们明确调整不适应的。 例如, LLM 红队文学已经产生了各种各样的“ 侵入性” 技术,从经过微调无害的模型中引出有害文字。 最近关于红队、 模版编辑和可解释性的工作表明,这一挑战源于( 对抗性) 微调如何主要用来抑制而不是消除LLMs的不良能力。 先前的工作已经引入了潜在的对抗性训练( LAT) , 以此来提高各种失败的强度。 这些以前的工作已经考虑了非目标性的潜在空间行为, 敌人潜伏性地激活了这些“ 侵入性” 技术, 以最大限度地减少可取性行为的例子。 目标性LAT 能够提供一种通用的强健性, 但不利用特定失败模式的信息。 在这里, 我们试验定向的LAT 能够扩大各种最不可取的方法。 首先, 我们使用定向的LAT 来改进监狱破损的稳性, , 超越目标性地启动一个更强的 R2 基线, 最后我们用一个更有效的R2D 基准级的排序, 我们用一个更有害的基线, 将它去一个更有效的方式去一个更精确的顺序 。
Article 233
Title@2025-07-29 (2): Multilingual JobBERT for Cross-Lingual Job Title Matching
Title: Multilingual JobBERT for Cross-Lingual Job Title Matching | Mehrsprachiger JobBERT für Cross-Lingual Job Title Matching | 跨语言工作职称匹配多语言工作BERT 2507.21609v1 |
Authors (3): Jens-Joris Decorte, Matthias De Lange, Jeroen Van Hautte
We introduce JobBERT-V3, a contrastive learning-based model for cross-lingual job title matching. Building on the state-of-the-art monolingual JobBERT-V2, our approach extends support to English, German, Spanish, and Chinese by leveraging synthetic translations and a balanced multilingual dataset of over 21 million job titles. The model retains the efficiency-focused architecture of its predecessor while enabling robust alignment across languages without requiring task-specific supervision. Extensive evaluations on the TalentCLEF 2025 benchmark demonstrate that JobBERT-V3 outperforms strong multilingual baselines and achieves consistent performance across both monolingual and cross-lingual settings. While not the primary focus, we also show that the model can be effectively used to rank relevant skills for a given job title, demonstrating its broader applicability in multilingual labor market intelligence. The model is publicly available: https://huggingface.co/TechWolf/JobBERT-v3.
我们引入了基于学习的跨语言职称对比模式。基于最先进的单一语言职称匹配模式,我们的方法通过利用合成翻译和2 100多万个职称的均衡多语种数据集,向英文、德文、西班牙文和中文提供支持。该模式保留了其前身以效率为重点的架构,同时使各语文之间无需特定任务监督就能进行强有力的统一。对2025年才智CLEF基准的广泛评价表明,该模式超越了强大的多语言基线,实现了单一语言和跨语言环境的一致业绩。我们虽然不是主要重点,但我们也表明该模式可以有效地用于给特定职称的相关技能定级,表明其在多语言劳动力市场情报中的更广泛适用性。该模型可公开查阅:https://ggingface.co/TechWolf/JobERT-v3。
Article 234
Title@2025-07-29 (2): Pralekha: Cross-Lingual Document Alignment for Indic Languages
Title: Pralekha: Cross-Lingual Document Alignment for Indic Languages | Pralekha: Cross-Lingual Document Alignment für indische Sprachen | Pralekha:印度语交叉语言文档协调 2411.19096v2 |
Authors (5): Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Raj Dabre
Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Most approaches rely on metadata such as URLs, which is often unavailable in low-resource language settings, while others represent documents using pooled sentence embeddings, which fail to capture fine-grained alignment cues. Moreover, current sentence embedding models have limited context windows, hindering their ability to represent document-level information effectively. To address these challenges for Indic languages, we introduce PRALEKHA, a large-scale benchmark for evaluating document-level alignment techniques. It contains over 3 million aligned document pairs across 11 Indic languages and English, of which 1.5 million are English–Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based approaches, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that DAC achieves substantial improvements over pooling-based baselines, particularly in noisy scenarios. Extrinsic evaluation further demonstrates that document MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC’s effectiveness for parallel document mining. The PRALEKHA dataset and CLDA evaluation framework will be made publicly available.
用于文件级机器翻译(MT)的采矿平行文档配对依然具有挑战性,因为现有的跨语言文档协调(CLDA)技术存在局限性。大多数方法依赖诸如URL等元数据,而URL往往在低资源语言环境中无法使用,而其他方法则代表使用集合判决嵌入文件的文件,这些嵌入未能捕捉细微细的校准提示。此外,目前嵌入模型的背景窗口有限,妨碍了它们有效代表文件级信息的能力。为了应对印第列语言的这些挑战,我们引入了PRALEKHA,这是评价文件级校准技术的大规模基准。它包含11种英、英两种语言的300多万对匹配文件配对,其中150万对是英英英双配。此外,我们提议文件协调系数(DAC),这是用于细加校准文件校准的新型标准。发援会与小块比对文件的相似性比,一对成对大,我们将采用双对式评估,显示发援会在联合的基建基文件基线上取得了重大改进,特别是在高要求的DRBA基线上。
Article 235
Title@2025-07-29 (2): A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models
Title: A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models | Eine detaillierte Faktorenanalyse für den politischen Kompasstest: Navigieren von Ideologien großer Sprachmodelle | 《政治指南测试的详细要素分析:掌握大语言模式的特征》 2506.22493v2 |
Authors (7): Sadia Kamal, Lalu Prasad Yadav Prakash, S M Rafiuddin, Mohammed Rakib, Arunkumar Bagavathi, Atriya Sen, Sagnik Ray Choudhury
Political Compass Test (PCT) or similar questionnaires have been used to quantify LLM’s political leanings. Building on a recent line of work that examines the validity of PCT tests, we demonstrate that variation in standard generation parameters does not significantly impact the models’ PCT scores. However, external factors such as prompt variations and fine-tuning individually and in combination affect the same. Finally, we demonstrate that when models are fine-tuned on text datasets with higher political content than others, the PCT scores are not differentially affected. This calls for a thorough investigation into the validity of PCT and similar tests, as well as the mechanism by which political leanings are encoded in LLMs.
利用政治指南测试或类似的问卷来量化LLM的政治倾向。根据最近审查PCT测试有效性的工作方针,我们证明标准生成参数的变化不会对模型的PCT分数产生重大影响,但是,迅速变异和个别微调等外部因素和组合影响相同。最后,我们证明,当模型对政治内容高于其他内容的文本数据集进行微调时,PCT分数不会受到不同影响。这要求彻底调查PCT和类似测试的有效性,以及将政治倾斜纳入LMS的机制。
Article 236
Title@2025-07-29 (2): AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Title: AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | AIM: Adaptive Schlussfolgerung von Multi-Modal LLMs über Token Merging und Pruning | AIM:通过 Token 兼并和预留的多模式LMs的适应性推理 2412.03248v2 |
Authors (4): Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that our method substantially reduces computation load (e.g., a $\textbf{7-fold}$ reduction in FLOPs) while preserving the performance of video and image LLMs. Further, at a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (e.g., $\textbf{+4.6}$ on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs. Our code is available at https://github.com/LaVi-Lab/AIM.
大型语言模型(LLMS)使得能够创建多模式的LLMs,这些模型能够对图像和视频等视觉数据表现出强烈的理解;然而,这些模型通常依赖视觉编码器的广泛视觉符号,导致大量计算需求,从而限制其在资源限制环境和长期图像任务中的适用性;在这项工作中,我们建议为多模式LMs制定一种无需培训的适应性推论方法,该方法能够满足一系列广泛的效率要求,而最低性能下降。我们的方法包括基于在LLMS之前嵌入类似数据的迭代象征性合并;以及(b)基于多模式重要性的LLM层内渐进式象征性标语。如果采用最低限度的设计,我们的方法可以适用于视频和图像LMMs。 关于多种视频和图像基准的广泛实验表明,我们的方法可以大幅降低计算负荷(例如,一个$\textb{7-xxxxxxxxxxxxxxxxxxxxxxxxxxxlmmmmmmmmmmmmmmmmmmmmmmmmmmmms),同时保留视频和图像LMMMMMMMMMMsmrus-modal_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLI/LLLLLLLLL
Article 237
Title@2025-07-29 (2): Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers
Title: Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers | Bewertung der kognitiven Realität der spanischen unregelmäßigen morphomischen Muster: Menschen vs. Transformers | 评估西班牙非正常染色体模式的认知现实:人类与变异体 2507.21556v1 |
Authors (3): Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney
This study investigates the cognitive plausibility of the Spanish irregular morphomic pattern by directly comparing transformer-based neural networks to human behavioral data from \citet{Nevins2015TheRA}. Using the same analytical framework as the original human study, we evaluate whether transformer models can replicate human-like sensitivity to a complex linguistic phenomena, the morphome, under controlled input conditions. Our experiments focus on three frequency conditions: natural, low-frequency, and high-frequency distributions of verbs exhibiting irregular morphomic patterns. While the models outperformed humans in stem and suffix accuracy, a clear divergence emerged in response preferences. Unlike humans, who consistently favored natural responses across all test items, models’ preferred irregular responses and were influenced by the proportion of irregular verbs in their training data. Additionally, models trained on the natural and low-frequency distributions, but not the high-frequency distribution, were sensitive to the phonological similarity between test items and real Spanish L-shaped verbs.
这项研究通过直接将基于变压器的神经网络与来自\citet{Nevins2015TheRA} 的人类行为数据进行对比,调查西班牙非正常光谱模式的认知可行性。使用与原始人类研究相同的分析框架,我们评估变压器模型能否在受控输入条件下复制对复杂的语言现象即变异体的类似敏感度。我们的实验侧重于三种频率条件:自然、低频和显示非正常光谱模式的动词的高频分布。虽然模型在干燥和后缀精确度方面比人类表现得要好,但在反应偏好方面却出现了明显的差异。不像人类一样,他们一贯倾向于在所有测试项目中作出自然反应,模型偏好非正常反应,并且受到其培训数据中非正常动词比例的影响。此外,关于自然和低频率分布的模型,而不是高频分布,对测试项目与真实的西班牙L型动词之间的声相相似性非常敏感。
Article 238
Title@2025-07-29 (2): C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
Title: C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning | C2-Evo: Co-Evolving multimodale Daten und Modell zur Selbstverbesserung | C2-Evo:共同演进的多模式数据和自我改进理由模型 2507.16518v2 |
Authors (12): Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang
Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.
多式联运大型语言模型(MLLM)的近期进展显示了令人印象深刻的推理能力,然而,进一步加强现有的MLLMS需要高质量的愿景语言数据集,并需要仔细制定复杂的任务,这些复杂的任务既昂贵又具有规模挑战性。尽管最近自我改进的自我改进模型提供了可行的解决办法,但它们仍面临两个核心挑战:(一) 多数现有方法将视觉数据或文字数据分开,导致数据复杂性的差异(例如,过于简化的图表与多余的文字描述相配);(二) 数据和模型的演变也分离,导致模型暴露于不匹配的困难程度的任务的假设情景。为了解决这些问题,我们建议C2-Evo,一个自动、封闭的自我改进的自我改进框架,共同发展培训数据和模型能力。具体地说,鉴于一个基础数据集和基础模型,C2-Evo通过跨模式数据演变循环和数据模型演变基准循环来增强这些数据。 以前的循环扩大了基础数据集,通过生成复杂的模型模型模型模型、分解的升级模型和滚动的滚动模型,同时选择结构化的次级模型和不断升级的升级的模型,然后又进行模拟的升级的升级的升级的系统。
Article 239
Title@2025-07-29 (2): Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models
Title: Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models | Linguistische und einbettende Profilierung von Texten, die von Menschen und großen Sprachmodellen erzeugt werden | 人类和大语言模式产生的文本的语言和嵌入式图解 2507.13614v2 |
Authors (2): Sergio E. Zanotto, Segun Aroyehun
The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.
大型语言模型(LLMS)的快速发展大大提高了他们创造自然语言的能力,使LLMS产生的文字与人文文本越来越无法区分。虽然最近的研究主要侧重于利用LLMS将文字分类为人文文本和机器生成的文本,但我们的研究重点是利用不同语言层次,如形态学、语法和语义学等一系列语言特征对这些文本进行定性。我们选择了一套涵盖8个领域、由11个不同的LMS制作的人类书写和机器生成的文本的数据集。我们计算了不同语言特征,如依赖长度和情感性等,我们用它们来描述人类书写和机器生成的文本以及不同的抽样战略、重复控制和发布日期。我们的统计分析表明,人类书写文本往往展示更简单的合成结构和更多样化的语义内容。此外,我们计算了我们各模型和机器文本的变异性。人类和机器文本都显示了不同的语言多样性,在我们的特征上表现出更大的差异。最后,我们应用样式嵌入式和机器生成的文本以及不同的样本,以进一步测试人类-机器版本之间的变异性。
Article 240
Title@2025-07-29 (2): Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri
Title: Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri | Achten Sie auf die Sprachlücke in digitalen Geisteswissenschaften: LLM-Aided Translation of SKOS Thesauri | 注意数字人文中的语言差距:SKOS Thesauri的LLM辅助翻译 2507.19537v2 |
Authors (4): Felix Kraus, Nicolas Blumenröhr, Danah Tonne, Achim Streit
We introduce WOKIE, an open-source, modular, and ready-to-use pipeline for the automated translation of SKOS thesauri. This work addresses a critical need in the Digital Humanities (DH), where language diversity can limit access, reuse, and semantic interoperability of knowledge resources. WOKIE combines external translation services with targeted refinement using Large Language Models (LLMs), balancing translation quality, scalability, and cost. Designed to run on everyday hardware and be easily extended, the application requires no prior expertise in machine translation or LLMs. We evaluate WOKIE across several DH thesauri in 15 languages with different parameters, translation services and LLMs, systematically analysing translation quality, performance, and ontology matching improvements. Our results show that WOKIE is suitable to enhance the accessibility, reuse, and cross-lingual interoperability of thesauri by hurdle-free automated translation and improved ontology matching performance, supporting more inclusive and multilingual research infrastructures.
我们引入了开放源码、模块化和即时使用管道WOKIE, 用于SKOS Thesauri的自动翻译; 这项工作解决了数字人文学(DH)的迫切需要,语言多样性可以限制知识资源的获取、再利用和语义互操作性; WOKIE将外部翻译服务与使用大语言模型(LLMS)的有针对性的改进结合起来,平衡翻译质量、可缩放性和成本; 设计该应用程序要用日常硬件运行,并且容易扩展,应用程序不需要在机器翻译或LMS方面事先具备专业知识; 我们用不同参数、翻译服务和LLMS对若干DHsauri的15种语言进行WOKIE, 系统分析翻译质量、性能和本体匹配性改进。 我们的结果表明,WOKIE适合通过无障碍自动翻译和改进本体匹配性能,支持更具包容性和多语种的研究基础设施,提高这些语言的无障碍性能、再利用性和跨语言互操作性。
Article 241
Title@2025-07-29 (2): Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator
Title: Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator | Zeichen als Zeichen: Ein retrieval-erweiterter Mehrsprachiger Zeichen-Generator | 标为 Tokens 的符号: 一个检索增强的多语种手语手语生成器 2411.17799v3 |
Authors (5): Ronglai Zuo, Rolandos Alexandros Potamias, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou
Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task-sign language generation (text-to-sign)-remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE.
手语是一种视觉语言,包含自然语言的所有语言特征,是聋人和听力困难社区的主要沟通方法。虽然许多研究成功地调整了手语翻译(手对文本)的预先培训语言模式(LMs),但倒置任务符号语言生成(文本对手语)的剩余部分基本上没有探索。在这项工作中,我们引入了多语言手语模式(Tokens ),即Tokens 符号(SOKE),它可以生成3D符号自动从文本输入中自动转换,使用预先培训的LM。为了将手语与LM统一起来,我们使用一个拆分解的代号符号,将连续信号分解成代表各身体部分的象征序列。在解码过程中,与将所有部分符号都划为单一序列并同时预测一个符号的现有方法不同,我们建议一种多头解码方法,能够同时预测多个符号。这个方法提高了推论效率,同时保持不同身体部分的有效信息融合。为了进一步简化生成过程,我们建议一种将连续信号分解为代表质量的精确度的方法。我们建议,将ShanhanG 展示了精确度信号的校正读质量。
Article 242
Title@2025-07-29 (2): MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation
Title: MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation | MAGIC: Multi-Hop- und Graphenbasierter Benchmark für Inter-Kontext-Konflikte in der retrieval-generierten Generation | MAGIC: 回收后一代人中多重和基于图表的多重和基于图表的相互冲突基准 2507.21544v1 |
Authors (3): Jungyeon Lee, Kangmin Lee, Taeuk Kim
Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model’s parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection – especially when multi-hop reasoning is required – and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.
现有调查这一现象的基准有显著的局限性,包括对问题解答设置的狭隘关注、对实体替代技术的高度依赖以及有限的冲突类型。为了解决这些问题,我们提议了一个基于知识的图(KG)框架,在两个相似但又截然不同的背景之间产生不同和微妙的冲突,同时确保通过KGs明确的关系结构进行解释。关于我们基准的实验结果(MAGIC)提供了对LLMs内部关于知识冲突的探索性洞察力:开放源和专有模式与冲突探测斗争 – – 特别是在需要多点推理的情况下 – – 往往未能确定矛盾的确切来源。最后,我们提出深入分析,作为改进LLMs整合多样性、有时甚至相互矛盾的信息的基础。
Article 243
Title@2025-07-29 (2): Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language
Title: Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language | Modern Uighur Dependency Treebank (MUDT): Ein integriertes morphosyntaktisches Framework für eine ressourcenarme Sprache | 现代维吾尔依赖性树库(MUDT): 一种低资源语言综合磷合成法框架 2507.21536v1 |
Authors (4): Jiaxin Zuo, Yiquan Wang, Yuan Pan, Xiadiya Yibulayin
To address a critical resource gap in Uyghur Natural Language Processing (NLP), this study introduces a dependency annotation framework designed to overcome the limitations of existing treebanks for the low-resource, agglutinative language. This inventory includes 18 main relations and 26 subtypes, with specific labels such as cop:zero for verbless clauses and instr:case=loc/dat for nuanced instrumental functions. To empirically validate the necessity of this tailored approach, we conducted a cross-standard evaluation using a pre-trained Universal Dependencies parser. The analysis revealed a systematic 47.9% divergence in annotations, pinpointing the inadequacy of universal schemes for handling Uyghur-specific structures. Grounded in nine annotation principles that ensure typological accuracy and semantic transparency, the Modern Uyghur Dependency Treebank (MUDT) provides a more accurate and semantically transparent representation, designed to enable significant improvements in parsing and downstream NLP tasks, and offers a replicable model for other morphologically complex languages.
为解决维吾尔自然语言处理(Uyghur自然语言处理(NLP)中的关键资源缺口,本研究引入了一种依赖性说明框架,旨在克服现有树库对低资源、混凝土语言的限制,该清单包括18种主要关系和26个亚型,具体标签包括条纹:无异词条款零和细微工具功能的内写:case=loc/dat。为了实证这种量身定做方法的必要性,我们使用经过培训的普遍依赖性分类师进行了跨标准评价。分析显示说明有47.9%的系统性差异,指出处理维吾尔特定结构的普遍办法不足。根据九项说明性原则,确保字型准确性和语义透明,现代维吾尔依赖性树库(MUyghur Dependenity Treebank)提供了更准确、更透明的代表性,目的是在区分和下游线任务方面实现重大改进,并为其他变形复杂语言提供可复制的模式。
Article 244
Title@2025-07-29 (2): Automatic Classification of User Requirements from Online Feedback – A Replication Study
Title: Automatic Classification of User Requirements from Online Feedback – A Replication Study | Automatische Klassifizierung der Benutzeranforderungen aus Online-Feedback – Eine Replikationsstudie | 在线反馈用户要求自动分类 – – 复制研究 2507.21532v1 |
Authors (7): Meet Bhatt, Nic Boilard, Muhammad Rehan Chaudhary, Cole Thompson, Jacob Idoko, Aakash Sorathiya, Gouri Ginde
Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), “Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning”, which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study’s replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study.
自然语言处理(NLP)技术在需求工程(RE)领域被广泛应用,以支持分类和模糊性探测等任务。虽然RE研究植根于经验调查,但对复制RE(NLP4RE)研究的NLP(NLP4RE)研究重视有限。NLP的快速发展领域为高效的、机器辅助工作流程创造了新的机会,这可以为前沿带来新的视角和结果。因此,我们复制并推广了先前的NLP4RE研究(基线),“利用深层学习对小型数据集环境的在线反馈的用户复制要求进行分类”,评估了用户审查对需求分类的不同深层学习模式。我们利用公开发布源代码复制了原始结果,从而有助于加强基线研究的外部有效性。我们随后通过评价外部数据集的模型性能和将结果与GPT-4零光分解仪进行比较,我们为基线研究准备的复制研究ID卡,对于评估复制准备情况非常重要。结果显示,不同模型的再复制程度不同,而Nayes显示不同模型,展示了精确性复制的复制性文件,结果也鼓励了我们的GRBL的外部升级性研究。 我们的学习基础性研究。 我们的模型显示了我们的深层的外部研究。我们的研究。我们的研究, 我们的升级的升级的模型显示了我们的深层研究。
Article 245
Title@2025-07-29 (2): HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
Title: HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation | HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation | HIRAG: 高层次研究教学-引导检索-推荐一代 2507.05714v2 |
Authors (8): YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Yue Shen, Jian Wang, Peng Wei
Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
重新获取增强的一代(RAG)已成为解决大型语言模型在处理实时信息和特定领域问题方面所面临的挑战的基本范例。传统的RAG系统主要依赖大语言模型本身的内流学习能力。然而,关于RAG生成模型所需的具体能力的深入研究仍然缺乏,导致文件质量和检索系统不完善的挑战。即使是微调的RAG基因化模型常常\textit{缺乏对RAG任务的微调焦点}或更深入地利用连锁思考进程。为了解决这个问题,我们建议RAG模型应当拥有三种逐步的等级能力:(1) 过滤:选择相关信息的能力;(2) 合并:将各段落的语义信息结合起来的能力;(3) RAG特定推理:利用内部知识进一步处理外部知识的能力。 因此,我们引入了我们新的RAG 指令微调方法,Sierarshi-Sqourat-Retal-RetailQQQQ , 更深入地利用不断更新的MARAAAA 测试战略, 大幅改进HAGAG-BA的模型, 测试战略, 改进HAG-BAG-S-strual-strual-strual-strual-strual-strual-straking-strual-strual-stris-strat-strat-stris
Article 246
Title@2025-07-29 (2): TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling
Title: TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling | TriangleMix: Ein verlustfreies und effizientes Aufmerksamkeitsmuster für den langen Kontext | 三角组合:长期预填无损高效关注模式 2507.21526v1 |
Authors (6): Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu
Large Language Models (LLMs) rely on attention mechanisms whose time complexity grows quadratically with input sequence length, creating significant computational bottlenecks during the prefilling stage. Existing static sparse attention methods typically degrade accuracy, while dynamic sparsity methods introduce additional computational overhead due to runtime sparse index estimation. To address these limitations, we propose TriangleMix, a novel training-free static attention pattern. TriangleMix employs dense attention in shallow layers and switches to a triangle-shaped sparse pattern in deeper layers. Extensive experiments demonstrate that TriangleMix reduces attention overhead by 3.7x to 15.3x in deep layers, and decreases overall Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be seamlessly integrated with dynamic sparsity methods to achieve further speedup, e.g. accelerating MInference by 19% at 128K, highlighting its potential to enhance LLM inference efficiency.
大型语言模型(LLMS)依赖于关注机制,其时间复杂性随着输入序列长度的四倍增长,在预填阶段造成了重大的计算瓶颈。现有的静态稀疏关注方法通常会降低准确性,而动态宽度方法则会由于时间稀少的指数估计而引入额外的计算间接费用。为了解决这些局限性,我们建议三角Mix(TridgeMix),这是一个没有培训的新型静态关注模式。三角Mix在浅层使用密集的注意力,在深层将三角Mix切换成三角形稀疏模式。广泛的实验表明,三角Mix将注意力管理减少3.7x至15.3x深层,并且在不牺牲模型准确性的情况下,将时间到一至二十八K(TTFT)之间的总时间长度减少12%至32%。此外,三角Mix可以与动态宽度方法无缝地结合,以进一步加速速度,例如,在128K时速加快了19%的MInference,突出其提高LM推算效率的潜力。
Article 247
Title@2025-07-29 (2): Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting
Title: Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting | Modellfreies Spekulatives Dekodieren für Transformer-basierte ASR mit Token-Map-Entwurf | 采用 Token 地图起草的基于变换器的ASR无示范投机代号 2507.21522v1 |
Authors (4): Tuan Vu Ho, Hiroaki Kokubo, Masaaki Yamamoto, Yohei Kawaguchi
End-to-end automatic speech recognition (ASR) systems based on transformer architectures, such as Whisper, offer high transcription accuracy and robustness. However, their autoregressive decoding is computationally expensive, hence limiting deployment on CPU-based and resource-constrained devices. Speculative decoding (SD) mitigates this issue by using a smaller draft model to propose candidate tokens, which are then verified by the main model. However, this approach is impractical for devices lacking hardware accelerators like GPUs. To address this, we propose \emph{Token Map Drafting}, a model-free SD technique that eliminates the need for a separate draft model. Instead, we leverage a precomputed n-gram token map derived from domain-specific training data, enabling efficient speculative decoding with minimal overhead. Our method significantly accelerates ASR inference in structured, low-perplexity domains without sacrificing transcription accuracy. Experimental results demonstrate decoding speed-ups of $1.27\times$ on the CI-AVSR dataset and $1.37\times$ on our internal dataset without degrading recognition accuracy. Additionally, our approach achieves a $10\%$ absolute improvement in decoding speed over the Distill-spec baseline running on CPU, highlighting its effectiveness for on-device ASR applications.
以威斯伯等变压器结构为基础的端到端自动语音识别(ASR)系统提供了高转录准确性和稳健性。 但是,它们的自动递减解码技术在计算上成本高昂,从而限制了在基于CPU和资源限制的装置上的部署。 投机解码(SD)通过使用一个较小的模型草案来提出候选人标牌来缓解这一问题,然后由主模型来验证。 但是,对于缺少像GPUs这样的硬件加速器的设备来说,这种方法是不切实际的。 为了解决这个问题,我们建议采用无模型的SD技术来消除对单独模式草案的需要。 相反,我们利用了从特定领域培训数据中衍生出来的预先配置的ngram代号地图,从而能够以最小的间接费用有效地进行投机解码。 我们的方法大大加快了ASR在结构化、低易碎度域中的推断力,同时不牺牲了校正准确性。 实验结果显示,CI-AVSR数据的解码速度上升速度为1.27美元,在AVSR数据设置和A.37\xximates deminal laimmedegraphilling the 10 内部数据升级。
Article 248
Title@2025-07-29 (2): Simulated patient systems are intelligent when powered by large language model-based AI agents
Title: Simulated patient systems are intelligent when powered by large language model-based AI agents | Simulierte Patientensysteme sind intelligent, wenn sie von großen modellbasierten AI-Agenten angetrieben werden | 由大型语言模型型人工智能代理器供电时,模拟的病人系统是智能系统 2409.18924v3 |
Authors (23): Huizi Yu, Jiayan Zhou, Lingyao Li, Shan Chen, Jack Gallifant, Anye Shi, Xiang Li, Jingxian He, Wenyue Hua, Mingyu Jin, Guang Chen, Yang Zhou, Zhao Li, Trisha Gupte, Ming-Li Chen, Zahra Azizi, Yongfeng Zhang, Yanqiu Xing, Themistocles L. Danielle S. Bitterman, Themistocles L. Assimes, Xin Ma, Lin Lu, Lizhou Fan
Simulated patient systems play an important role in modern medical education and research, providing safe, integrative medical training environments and supporting clinical decision-making simulations. We developed AIPatient, an intelligent simulated patient system powered by large language model-based AI agents. The system incorporates the Retrieval Augmented Generation (RAG) framework, powered by six task-specific LLM-based AI agents for complex reasoning. For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data from the Medical Information Mart for Intensive Care (MIMIC)-III database. Primary outcomes showcase the system’s intelligence, including the system’s accuracy in Electronic Record (EHR)-based medical Question Answering (QA), readability, robustness, and stability. The system achieved a QA accuracy of 94.15% when all six AI agents present, surpassing benchmarks with partial or no agent integration. Its knowledgebase demonstrated high validity (F1 score=0.89). Readability scores showed median Flesch Reading Ease at 77.23 and median Flesch Kincaid Grade at 5.6, indicating accessibility to all medical professionals. Robustness and stability were confirmed with non-significant variance (ANOVA F-value=0.6126, p > 0.1; F-value=0.782, p > 0.1). A user study with medical students further demonstrated that AIPatient offers high fidelity, strong usability, and effective educational value, performing comparably or better than human-simulated patients in medical history-taking scenarios. The promising intelligence of the AIPatient system highlights its potential to support a wide range of applications, including medical education, model evaluation, and system integration.
模拟的患者系统在现代医疗教育和研究中发挥重要作用,提供安全、综合的医疗培训环境,并支持临床决策模拟。我们开发了AIPatient,这是一个智能模拟病人系统,由基于大语言模型的AI代理商驱动。该系统包含检索增强型(RAG)框架,由6个特定任务 LLM 的AI代理商驱动,进行复杂推理。模拟现实中,该系统还由AIPatient KG(知识图形)驱动,该系统由强化护理医疗信息 Mart(MIMIIC)-III数据库中已查明的真实病人数据构成。初级结果展示了该系统的智能,包括基于电子记录(EHR)的医疗问答(QAAA)的准确性、可读性、稳健性和稳定性。当所有6个AI代理商都在场时,在部分或无代理支持下超过了基准,其知识基础显示甚高效力(F1分为0.89)。在77.23年的FLES-NOS-可读性应用中位显示ES-LA的准确性,在FALS-SALS-S-SAL Syal Syal Syal Syal-Serviewerviewal 中显示整个医学系统上显示其稳定性为A-A-Syal-Syal-Syal-Syal-S-Sylegal-S-S-S-S-S-S-S-S-S-SAL-IA-S-S-S-S-S-S-Sy-S-S-S-IGISAL-Sylation-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 249
Title@2025-07-29 (2): What Does it Mean for a Neural Network to Learn a “World Model”?
Title: What Does it Mean for a Neural Network to Learn a “World Model”? | Was bedeutet es für ein neurales Netzwerk, ein “Weltmodell” zu lernen? | 神经网络学习“世界模型”意味着什么? 2507.21513v1 |
Authors (3): Kenneth Li, Fernanda Viégas, Martin Wattenberg
We propose a set of precise criteria for saying a neural net learns and uses a “world model.” The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent “state space” of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a “world model” is not a trivial consequence of the neural net’s data or task.
我们提出一套精确的标准来说明神经网的学习,并使用“世界模型 ” 。 目标是给经常非正式使用的术语一个操作意义,以便为实验性调查提供一个共同的语言。 我们特别侧重于代表世界潜在的“状态空间”的想法,将行动的效果建模留给今后的工作。 我们的定义以线性研究文献的理念为基础,并通过数据生成过程的表述正式确定计算要素的概念。 定义的一个基本补充是一系列条件,以检查这种“世界模型”不是神经网数据或任务的一个微不足道的后果。
Article 250
Title@2025-07-29 (2): Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Title: Persona Vectors: Monitoring and Controlling Character Traits in Language Models | Persona-Vektoren: Überwachung und Kontrolle von Charaktereigenschaften in Sprachmodellen | 人向量:监测和控制语言模式中的字符轨迹 2507.21509v1 |
Authors (5): Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
Large language models interact with users through a simulated ‘Assistant’ persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model’s activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant’s personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.
大型语言模型通过模拟的“ 助理” 人性来与用户互动。 虽然助理通常被训练成有用、无害和诚实, 但有时会偏离这些理想。 在本文中, 我们确定该模型激活空间- 人矢量的方向, 以若干特征为基础, 如邪恶、 交错和致幻倾向 。 我们确认这些矢量可用于监测助理在部署时间的个性波动。 然后我们应用人矢量来预测和控制培训期间发生的个性变化。 我们发现, 微调后预期的和意外的个性变化与相关个人矢量的转变密切相关。 这些转变可以通过热后干预来缓解, 或者首先通过新的预防性指导方法来避免。 此外, 人矢量可以用来标出在数据集一级和单个样本一级产生不可取的个性变化的数据。 我们提取个人矢量的方法是自动的, 并且可以应用到任何个性利益特征, 仅提供自然语言描述 。
Article 251
Title@2025-07-29 (2): The Carbon Cost of Conversation, Sustainability in the Age of Language Models
Title: The Carbon Cost of Conversation, Sustainability in the Age of Language Models | Die CO2-Kosten des Gesprächs, Nachhaltigkeit im Zeitalter der Sprachmodelle | 对话的碳成本、语言模式时代的可持续性 2507.20018v2 |
Authors (6): Sayed Mahbub Hasan Amiri, Prasun Goswami, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Naznin Akter
Large language models (LLMs) like GPT-3 and BERT have revolutionized natural language processing (NLP), yet their environmental costs remain dangerously overlooked. This article critiques the sustainability of LLMs, quantifying their carbon footprint, water usage, and contribution to e-waste through case studies of models such as GPT-4 and energy-efficient alternatives like Mistral 7B. Training a single LLM can emit carbon dioxide equivalent to hundreds of cars driven annually, while data centre cooling exacerbates water scarcity in vulnerable regions. Systemic challenges corporate greenwashing, redundant model development, and regulatory voids perpetuate harm, disproportionately burdening marginalized communities in the Global South. However, pathways exist for sustainable NLP: technical innovations (e.g., model pruning, quantum computing), policy reforms (carbon taxes, mandatory emissions reporting), and cultural shifts prioritizing necessity over novelty. By analysing industry leaders (Google, Microsoft) and laggards (Amazon), this work underscores the urgency of ethical accountability and global cooperation. Without immediate action, AIs ecological toll risks outpacing its societal benefits. The article concludes with a call to align technological progress with planetary boundaries, advocating for equitable, transparent, and regenerative AI systems that prioritize both human and environmental well-being.
GPT-3和BERT等大型语言模型(LLMs)实现了自然语言处理(NLP)的革命性,但其环境成本仍然被严重忽视。本篇文章批评LLMs的可持续性,通过GPT-4等模型和Mistral 7B等节能替代品的案例研究量化其碳足迹、水的使用和对电子废物的贡献。培训一个LLM每年可以排放相当于数百辆汽车的二氧化碳,而数据中心冷却则加剧了脆弱区域的缺水问题。系统性挑战:企业洗绿、冗余模式开发和监管真空使全球南部边缘化社区长期遭受伤害,负担过重。然而,可持续NLPs:技术创新(例如模式裁剪裁、量计算)、政策改革(碳税、强制性排放报告)和文化转变(将必要性置于新颖之上。通过分析工业领导人(Google、微软)和滞后者(Amazon),这项工作强调了道德问责和全球合作的紧迫性。如果没有立即采取行动,AIs生态风险就超过其社会效益。文章最后是:技术创新(例如,要求全球)的优先度,要求将技术进步与全球系统联系起来。
Article 252
Title@2025-07-29 (2): Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Title: Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach | Zuverlässige Proof-Generation mit LLMs: Ein neuro-symbolischer Ansatz | 努力利用LLM女士实现可靠的证据生产:神经-双曲方法 2505.14479v4 |
Authors (3): Oren Sultan, Eitan Stern, Dafna Shahaf
Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs’ generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI’s o1 model (58%-70% improvement); both analogous problems and the verifier’s feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.
大型语言模型(LLMS)与需要严格逻辑推算和符号推理的正式领域(如数学校准生成)相抗衡,我们建议一种神经-顺理成章的方法,将LLMS的基因特长与结构化组成部分相结合,以克服这一挑战。作为一个概念的证明,我们侧重于几何问题。我们的方法有两个方面:(1) 我们找出类似的问题,用其证明来指导LLM;(2) 一个正式的核查员评估产生的证明并提供反馈,帮助模型修正错误的证明。我们证明,我们的方法大大提高了O1模型(58%-70%的改进率)的证明准确性; 相似的问题和验证人的反馈都有助于这些成就。 更广泛地说,转向能够产生可辨别正确结论的LLMs可以大大提高其可靠性、准确性和一致性,解锁复杂的任务和需要信任的关键性真实世界应用。
Article 253
Title@2025-07-29 (2): VN-MTEB: Vietnamese Massive Text Embedding Benchmark
Title: VN-MTEB: Vietnamese Massive Text Embedding Benchmark | VN-MTEB: Vietnamesisch Massiver Text Einbettung Benchmark | VN-MTEB:越南大规模文本嵌入基准 2507.21500v1 |
Authors (5): Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang
Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while preserving named entity recognition (NER) and code snippets. Our comprehensive benchmark consists of 41 datasets from six tasks specifically designed for Vietnamese text embeddings. In our analysis, we find that bigger and more complex models using Rotary Positional Embedding outperform those using Absolute Positional Embedding in embedding tasks. Datasets are available at HuggingFace: https://huggingface.co/collections/GreenNode/vn-mteb-68871433f0f7573b8e1a6686
越南在互联网交通和在线毒性方面名列前茅。因此,在应用中实施建议和内容控制义务的嵌入模型至关重要。然而,由于在数量和任务多样性方面缺乏大规模测试数据集,科学家很难在将AI模型部署到现实世界、大型项目之前对AI模型进行有效评价。为解决这一重要问题,我们引入了越南基准,即VN-MTEB,用于嵌入模型,我们通过翻译大量英文样本,利用我们新的自动化框架,从Massive Text嵌入基准中找到大量英文样本。我们利用大型语言模型和尖端嵌入模型的优势,进行翻译和过滤流程,以保留高质量的样本,保证语言和语义的自然流动,同时保留名称实体识别(NER)和代码夹。我们的综合基准包括41个数据集,这些数据集来自专门为越南文本嵌入而设计的6项任务。我们的分析发现,使用扶轮性定位嵌入模型的更大、更复杂的模型超越了在嵌入Gregggggageding 73681/Grealfface中使用绝对定位的模型。数据设置:httpsetsetsats:http://Greglegleglesh3333/Face-ffax187186186/Fading。
Article 254
Title@2025-07-29 (2): Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
Title: Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models | Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen | 为采用大语言模式的高级指示提供激励理由 2506.01413v5 |
Authors (9): Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
现有大型语言模型(LLMS)面临遵循复杂指令的挑战,特别是当存在多种制约因素,并在平行、链条和分支结构中组织多种制约时。一个直观的解决方案,即思维链(CoT),有望普遍提高LLMs的能力。然而,我们发现,香草CO(LLMS)由于其肤浅的推理模式,简单地将指示抛光,对业绩产生了负面影响。它未能消除在确定不同类型和层面的等级关系方面存在的制约的构成。为此,我们建议RAIF(RAIF)采用系统方法,通过激励测试时间计算缩放的推理,促进LMM(CLM)处理复杂的指令。首先,我们在现有的分类中将复杂的指令分解,并提出可再生数据获取的方法。第二,我们利用强化学习(RLLL),用可核查的规则中心奖赏信号,专门为随后的教学提供推理。我们通过样本比对COT执行进行简单对比,从浅、非典型的推理的推理学性质。我们还利用了专家行为演法行为规范,在测试11 将可比较的RLMMMLMS(LM) 进行可比较的演化,从快速分析,从而确认可判的精确性地将精确性分析。
Article 255
Title@2025-07-29 (2): Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Title: Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning | Low-Confidence Gold: Verfeinerung von Low-Confidence-Proben für effizientes Instruktionstuning | 低信任金:改进低信任金样本,以进行高效教学计费 2502.18978v4 |
Authors (4): Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong
The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework’s efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.
培训数据集的质量和效率从根本上限制了大语言模型教学微调的有效性,这项工作引入了低保密金(LCG),这是一个新的过滤框架,采用以机器人为基础的集群和信任制为指南的选择,以确定有价值的教学配对,采用半监督办法,使用经过代表性样本培训的轻量级分类师,在保存数据多样性的同时,保存高质量的子集;实验性评价显示,对LCG过滤的6K样本子集进行微调的模型比现有方法取得优异的性能,在MT-Bench方面大有改进,在综合评价指标方面不断取得收益;框架在保持示范性业绩的同时,在保持示范性业绩的同时,为有效的指导调整确定了有希望的方向。
Article 256
Title@2025-07-29 (2): Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering
Title: Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering | Sem-DPO: Semantische Inkonsistenz bei der Preference-Optimierung für Prompt Engineering mindern | Sem-DPO: 减轻在优先优化即时工程方面的语义不一致现象 2507.20133v2 |
Authors (8): Anas Mohamed, Azal Ahmad Khan, Xinran Wang, Ahmad Faraz Khan, Shuwen Ge, Saman Bahzad Khan, Ayaan Ahmad, Ali Anwar
Generative AI can now synthesize strikingly realistic images from text, yet output quality remains highly sensitive to how prompts are phrased. Direct Preference Optimization (DPO) offers a lightweight, off-policy alternative to RL for automatic prompt engineering, but its token-level regularization leaves semantic inconsistency unchecked as prompts that win higher preference scores can still drift away from the user’s intended meaning. We introduce Sem-DPO, a variant of DPO that preserves semantic consistency yet retains its simplicity and efficiency. Sem-DPO adjusts the DPO loss using a weight based on how different the winning prompt is from the original, reducing the impact of training examples that are semantically misaligned. We provide the first analytical bound on semantic drift for preference-tuned prompt generators, showing that Sem-DPO keeps learned prompts within a provably bounded neighborhood of the original text. On three standard text-to-image prompt-optimization benchmarks and two language models, Sem-DPO achieves 8-12% higher CLIP similarity and 5-9% higher human-preference scores (HPSv2.1, PickScore) than DPO, while also outperforming state-of-the-art baselines. These findings suggest that strong flat baselines augmented with semantic weighting should become the new standard for prompt-optimization studies and lay the groundwork for broader, semantics-aware preference optimization in language models.
直接偏好优化(DPO)为自动快速工程提供了比RL更轻、更宽松的替代政策,但是其象征性的正规化使得语义不统一,因为赢得更高偏好分数的提示仍然可以偏离用户的预期含义。我们引入了Sem-DPO,这是DPO的变种,它保留了语义一致性,但保持了其简单和效率。Sem-DPO根据胜出速度与原来的不同程度,调整了DPO损失。直接偏好优化(DPO)提供了比RLL更轻、更不切合语义的替代政策(DPO) , 但它象征性的正规化使语义性变化不受限制, 表明SEM-DPO仍然在原始文本中一个可辨别不透视的周边学习快感。 关于三个标准文本到图像快速优化基准和两种语言偏好模式, Sem-DPO 实现了8-12 % 更高的CLIP 相似性和5-9% 更高的培训范例, 高分级的SpickS
Article 257
Title@2025-07-29 (2): The pitfalls of next-token prediction
Title: The pitfalls of next-token prediction | Die Fallstricke der Next-Token-Vorhersage | 下吨预测的陷阱 2403.06963v3 |
Authors (2): Gregor Bachmann, Vaishnavh Nagarajan
Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction – autoregressive inference and teacher-forced training – must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner – remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using teacherless training, a simple modification using dummy tokens that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures
仅仅下到脚的预测人能忠实地模拟人类智能吗?我们将这种新出现的关切具体化,纠正围绕它出现的流行误解,并提倡简单多到目的。作为一个起点,我们主张必须明确对待经常被组合的下到的预测的两个阶段 – – 自动递进推论和师资强制培训。错误在自动递进推论期间可能加剧的流行批评,关键地假设教师拒绝教师已经学会了准确的下到脚的预测人。这一假设回避了我们暴露的一个更深层次的问题:在某些任务类别中,教师拒绝教师可能只是无法首先学习准确的下到下到的预测人。我们描述了教师推举失败的一般机制,并设计了一个最起码的规划任务,使变异和曼巴结构在这种方式上都有可能失败 – – 尽管任务非常简单易学。最后,我们提供了初步证据,证明这一失败可以通过_教师无到脚的训练来解决。一个简单的修改,用下一个假象来预示多重的标志。我们希望在未来的探索中找到我们未来的准则。
Article 258
Title@2025-07-29 (2): Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs
Title: Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs | Verbesserung der Aufgabenvielfalt bei der Label-Effizient überwachten Feinsteuerung von LLMs | 改进LLMML在标签高效监督监督下改进任务多样性 2507.21482v1 |
Authors (4): Abhinav Arabelly, Jagrut Nemade, Robert D Nowak, Jifan Zhang
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation – a process that is time-consuming, labor-intensive, and expensive. In this paper, we address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection. This is markedly different from existing methods based on the prompt-diversity. Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks. We combine these facts to devise a simple yet effective sampling strategy: we select examples across tasks using an inverse confidence weighting strategy. This produces models comparable to or better than those trained with more complex sampling procedures, while being significantly easier to implement and less computationally intensive. Notably, our experimental results demonstrate that this method can achieve better accuracy than training on the complete dataset (a 4\% increase in MMLU score). Across various annotation budgets and two instruction finetuning datasets, our algorithm consistently performs at or above the level of the best existing methods, while reducing annotation costs by up to 80\%.
大型语言模型(LLMS)在不同领域表现出了非凡的能力,但为专门应用开发高效模型往往需要大量的人力说明 – – 这一过程耗时费时、劳力密集和昂贵。在本文中,我们通过利用任务多样性作为有效数据选择的基本原则,解决了监督微调的标签效率高的学习问题。这与基于迅速多样性的现有方法明显不同。我们的方法基于两项关键观察:1) 不同提示的任务标签往往随时可得;2) 预先培训的模型在各项任务之间具有不同的信任度。我们将这些事实结合起来,以设计一个简单而有效的抽样战略:我们采用反信任加权战略在各项任务之间选择实例。这产生了与经过更复杂取样程序培训的类似或更好的模型,但执行起来要容易得多,计算强度要小得多。值得注意的是,我们的实验结果表明,这种方法比关于完整数据集的培训(MMLU分数增加4);2) 各种说明预算和两个指示对数据集进行了显著的不同程度的信任度。我们用逆向80或高于现有最佳方法水平的算法来降低或降低现有成本。
Article 259
Title@2025-07-29 (2): Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench
Title: Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench | Welche LLMs bekommen den Spaß? Mit HumorBench nicht-STEM-vernünftige Fähigkeiten beweisen | 哪个LLMs得到的笑话? 2507.21476v1 |
Authors (8): Reuben Narad, Siddharth Suresh, Jiayi Chen, Pine S. L. Dysart-Bricken, Bob Mankoff, Robert Nowak, Jifan Zhang, Lalit Jain
We present HumorBench, a benchmark designed to evaluate large language models’ (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential. Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements. LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements. To perform well on this task, models must form and test hypotheses about associations between concepts, potentially backtracking from initial interpretations to arrive at the most plausible explanation. Our extensive benchmarking of current SOTA models reveals three key insights: (1) LLM progress on STEM reasoning transfers effectively to humor comprehension; (2) models trained exclusively on STEM reasoning data still perform well on HumorBench, demonstrating strong transferability of reasoning abilities; and (3) test-time scaling by increasing thinking token budgets yields mixed results across different models in humor reasoning.
我们介绍了Humorbench, 这是一项旨在评价大型语言模型(LLMs)在漫画字幕中解释和解释精密幽默能力的基准,即Humorbench(LLMs),这是用来评价大型语言模型(LLMs)在漫画标题中解释和解释精密幽默的能力的基准。由于推理模型越来越饱和数学和科学领域的现有基准,因此,在STEM领域之外对示范情报进行新颖和具有挑战性的评估至关重要。理性模型从根本上涉及基于文字的幽默理解,要求确定卡通/戏剧和外部文化参考、文字剧和其他机制的概念之间的联系。Humorbench(Humer Caption Contest and Cartoonstock.com)包括大约300对独特的卡通卡通插配对。 Humorbench(LM) com, 由专家附加注释的评价标注了基本的笑话要素。LLMSMs是根据其对确定笑话元素的幽默和能力的解释进行评估的。为了很好地完成这项任务,模型必须形成和测试关于概念之间关联的假设,可能从最初的解释到最可信的解释。我们对STEEM推理学的推理向幽默理解的广泛基准提供了三个专门训练的模型。
Article 260
Title@2025-07-29 (2): BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
Title: BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data | BIG5-CHAT: LLM-Persönlichkeiten durch Schulung auf menschenverändernden Daten gestalten | BIG5-CHAT:通过提供人际数据培训塑造专业人才 2410.16491v3 |
Authors (6): Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona Diab, Maarten Sap
In this work, we tackle the challenge of embedding realistic human personality traits into LLMs. Previous approaches have primarily focused on prompt-based methods that describe the behavior associated with the desired personality traits, suffering from realism and validity issues. To address these limitations, we introduce BIG5-CHAT, a large-scale dataset containing 100,000 dialogues designed to ground models in how humans express their personality in language. Leveraging this dataset, we explore Supervised Fine-Tuning and Direct Preference Optimization as training-based methods to align LLMs more naturally with human personality patterns. Our methods outperform prompting on personality assessments such as BFI and IPIP-NEO, with trait correlations more closely matching human data. Furthermore, our experiments reveal that models trained to exhibit higher conscientiousness, higher agreeableness, lower extraversion, and lower neuroticism display better performance on reasoning tasks, aligning with psychological findings on how these traits impact human cognitive performance. To our knowledge, this work is the first comprehensive study to demonstrate how training-based methods can shape LLM personalities through learning from real human behaviors.
在这项工作中,我们应对将现实的人格特征纳入LLMs的挑战。以前的做法主要侧重于以迅速为基础的方法描述与期望的个性特征有关的行为,这些特征受到现实主义和有效性问题的影响。为了解决这些局限性,我们引入了BIG5-CHAT,这是一个大型的数据集,包含10万个对话,旨在将模型定位为人类如何用语言表达其个性。利用这一数据集,我们探索以监督性微调和直接偏好为基于培训的方法,使LMS更自然地与人的个性模式接轨。我们采用的方法超越了诸如BFI和IPIP-NEO等个性评估的速效方法,其特质相关性与人类数据更接近。此外,我们的实验显示,经过培训的模型在推理任务上表现得更好,与关于这些特征如何影响人类认知性表现的心理发现相一致。我们了解,这是第一次全面研究,以显示基于培训的方法如何通过学习真实人类行为塑造LM人格特征。
Article 261
Title@2025-07-29 (2): Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning
Title: Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning | Weiche Einspritzung von Task-Embeddings Outperforms Prompt-Based In-Context Learning | 任务嵌入器的软输入超出迅速基于信息学习的绩效 2507.20906v2 |
Authors (2): Jungwon Park, Wonjong Rhee
In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks by conditioning on input-output examples in the prompt, without requiring any update in model parameters. While widely adopted, it remains unclear whether prompting with multiple examples is the most effective and efficient way to convey task information. In this work, we propose Soft Injection of task embeddings. The task embeddings are constructed only once using few-shot ICL prompts and repeatedly used during inference. Soft injection is performed by softly mixing task embeddings with attention head activations using pre-optimized mixing parameters, referred to as soft head-selection parameters. This method not only allows a desired task to be performed without in-prompt demonstrations but also significantly outperforms existing ICL approaches while reducing memory usage and compute cost at inference time. An extensive evaluation is performed across 57 tasks and 12 LLMs, spanning four model families of sizes from 4B to 70B. Averaged across 57 tasks, our method outperforms 10-shot ICL by 10.2%-14.3% across 12 LLMs. Additional analyses show that our method also serves as an insightful tool for analyzing task-relevant roles of attention heads, revealing that task-relevant head positions selected by our method transfer across similar tasks but not across dissimilar ones – underscoring the task-specific nature of head functionality. Our soft injection method opens a new paradigm for reducing prompt length and improving task performance by shifting task conditioning from the prompt space to the activation space.
文本中学习(ICL) 使大语言模型(LLMS) 能够通过对快速输入输出示例进行调整来完成任务, 无需在模型参数中作任何更新。 虽然广泛采用, 但仍不清楚用多个示例来提示任务信息是否是最有效和高效的方式。 在这项工作中, 我们建议对任务嵌入进行柔性输入。 任务嵌入仅仅在使用微小的 ILLL 提示和反复在推断中使用一次。 软性混合任务通过软性混装任务来进行, 并使用优化前混合参数来启动关注头部启动, 被称为软性快速化头部选择参数。 这种方法不仅允许在不进行即时演示的情况下执行所期望的任务, 并且大大优于现有的 ICLL 方法, 同时减少记忆用量, 并在时间推移时再计成本。 广泛评价了57项任务中的4B 至70B 。 平均为57项任务, 我们的方法在10张的 ICLLO 上快速化模型, 以10%-14.3% 移动头部定位定位, 在12 LLMs 上, 分析一个类似的任务转移任务的方法。 分析我们的任务。
Article 262
Title@2025-07-29 (2): Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour
Title: Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour | Auf dem Weg zu lokal einsetzbaren großformatigen großformatigen Sprachmodellen für Modewahlverhalten | 以当地可部署的优质因果大语言模式进行模式选择行为 2507.21432v1 |
Authors (2): Tareq Alsaleh, Bilal Farooq
This study investigates the adoption of open-access, locally deployable causal large language models (LLMs) for travel mode choice prediction and introduces LiTransMC, the first fine-tuned causal LLM developed for this task. We systematically benchmark eleven LLMs (1-12B parameters) across three stated and revealed preference datasets, testing 396 configurations and generating over 79,000 synthetic commuter predictions. Beyond predictive accuracy, we evaluate models generated reasoning using BERTopic for topic modelling and a novel Explanation Strength Index, providing the first structured analysis of how LLMs articulate decision factors in alignment with behavioural theory. LiTransMC, fine-tuned using parameter efficient and loss masking strategy, achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of 0.000245, surpassing both untuned local models and larger proprietary systems, including GPT-4o with advanced persona inference and embedding-based loading, while also outperforming classical mode choice methods such as discrete choice models and machine learning classifiers for the same dataset. This dual improvement, i.e., high instant-level accuracy and near-perfect distributional calibration, demonstrates the feasibility of creating specialist, locally deployable LLMs that integrate prediction and interpretability. Through combining structured behavioural prediction with natural language reasoning, this work unlocks the potential for conversational, multi-task transport models capable of supporting agent-based simulations, policy testing, and behavioural insight generation. These findings establish a pathway for transforming general purpose LLMs into specialized, explainable tools for transportation research and policy formulation, while maintaining privacy, reducing cost, and broadening access through local deployment.
这项研究调查了在旅行模式选择预测中采用可在当地部署的开放性因果大型语言模型(LLMS)的情况,并介绍了LiTransMC,这是为这项任务开发的第一个经精细调整的因果性LMM。我们系统地将11个LMS(1-12B参数)的基准基准设定为三个公开的优惠数据集,测试396个配置,并生成79 000多个合成通勤者预测。除了预测准确性外,我们评估模型还利用BERTopic进行主题模型和新颖的解释力指数进行推理,首次结构化分析LITransMC如何根据行为理论来阐述扩大决定因素。LiTransMC,利用参数高效和损失掩码战略进行微调,实现了0.6845分的加权F1分和0.007025的Jensen-Shannon Divergence, 超过了未调整的本地模型和更大的专利系统,包括GPT-4o, 高级人物的推断和嵌入式装式装,同时超越了典型模式选择方法,例如独立选择模型和机器学习分解等数据集。这种双重改进,即时支持级的精确级精确级精确精确精确精确精确精确和接近的精确分析,同时解释,将这种可操作性分析、可操作化的精确化的精确和接近的逻辑分析分析,并解释和接近性精确性分析分析分析分析分析分析分析分析分析分析分析分析,并解释,这些可进行基础分析,通过基础的计算分析,并解释,通过可操作性演算算法性分析,通过基础分析,通过可操作性推算法性演算算算算制制制制制制制式的计算。
Article 263
Title@2025-07-29 (2): LLAMAPIE: Proactive In-Ear Conversation Assistants
Title: LLAMAPIE: Proactive In-Ear Conversation Assistants | LLAMAPIE: Proaktive In-Ear-Gesprächsassistenten | LLAMAPIE: 主动的在轨在轨对话助理 2505.04066v2 |
Authors (5): Tuochao Chen, Nicholas Batchelder, Alisa Liu, Noah Smith, Shyamnath Gollakota
We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.
我们引入了第一个实时主动助理Llamapie(Llamapie),这是第一个旨在通过可听设备提供的简明指导加强人类对话的实时主动助理。与需要明确用户使用的传统的语言模式不同,该助理在背景中运作,预见用户的需要,而不中断交谈。我们应对若干挑战,包括确定何时作出反应,起草简明的应对措施,加强对话,利用用户知识提供背景认知援助,实时、实时、设备处理。为了实现这一目标,我们建立了一个半合成对话数据集,并提出一个双模样的管道:一个小模型,决定何时作出反应,一个更大的模型,产生反应。我们评估了我们关于现实世界数据集的方法,展示了它在提供无干扰的帮助方面的有效性。与我们的助理进行的用户研究,在苹果硅M2硬件上实施,显示积极助理在没有援助和反应模型的基线上都非常偏好,强调LlamaPie加强现场对话的潜力。
Article 264
Title@2025-07-29 (2): Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling
Title: Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling | Bergbau-Intrinsische Belohnungen aus LLM-Hidden States für effiziente Best-of-N-Probenahme | LLM隐藏国为高效率最佳采样而从LLM公司获得的采矿内部奖赏 2505.12225v2 |
Authors (4): Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu
Enhancing Large Language Model (LLM)’s performance with best-of-N sampling is effective and has attracted significant attention. However, it is computationally prohibitive due to massive, data-hungry text-based reward models. By changing the data source from text to hidden states, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel, lightweight technique that leverages the rich information embedded in LLM hidden states to address these issues, which operates on token-level and consists of only linear layers. Extensive experiments show that SWIFT outperforms baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training, demonstrating significant efficiency improvement. SWIFT’s robust scalability, applicability to some closed-source models via logits, and ability to be combined with traditional reward models to yield further performance gains underscore its practical value.
提高大语言模型(LLM)以最佳N抽样的性能是有效的,并吸引了人们的极大关注。然而,由于大量的数据饥饿文本奖赏模式,它在计算上令人望而却步。通过将数据源从文本转换为隐蔽状态,我们引入了SWIFT(Spreaty weighted Intrinsic conference Technique ) , 这是一种新型的轻量级技术,利用LLM隐藏状态中丰富的信息来解决这些问题,该技术在象征性层面运作,仅包括线性层。 广泛的实验表明SWIFT在基线参数低于0.005%的情况下超过了基线,只需要少数几个样本来进行培训,显示出显著的效率提高。 SWIFT的强大可扩展性、通过登录对一些封闭源模型的可适用性、与传统奖赏模式相结合以产生进一步业绩收益的能力强调了其实际价值。
Article 265
Title@2025-07-29 (2): MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations
Title: MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations | MemTool: Optimierung der Kurzzeit-Speicherverwaltung für dynamisches Werkzeug beim Aufrufen von LLM Agent Multi-Turn-Konversationen | MemTool:优化短期内存管理,以便利用动态工具在LLM代理多转对话中打电话 2507.21428v1 |
Authors (5): Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke
Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically searching and incorporating relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically manage tools or MCP server contexts across multi-turn conversations. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. Evaluating each MemTool mode across 13+ LLMs on the ScaleMCP benchmark, we conducted experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency) and task completion accuracy. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90-94% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0-60%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.
大型语言模型(LLM)代理商在动态搜索和纳入相关工具或用于个别查询的示范背景协议服务器方面表现出了强大的自主能力;然而,固定环境窗口限制了需要反复独立使用工具的多方向互动的实效;我们引入了MemTool,这是一个短期存储框架,使LMTors能够在多方向对话中动态管理工具或MCP服务器环境;MemTool提供三种代理结构:(1)自主代理模式,给予充分工具管理自主权;(2)工作流程模式,给予完全工具管理自主权的确定性控制;(3)混合模式,结合自主和确定性控制;在规模MCP基准上对13+LLMS的每个MTool模式进行评价,我们进行了100多次连续用户互动试验,测量工具清除比率(短期存储效率)和任务完成准确性;在自主代理模式中,推理LMMs实现了高工具清除效率(在3个窗口中为90-94%),而中等规模模型显示效率(0-60%);工作流程和混合模式,在13+LMLMS的基准上对工具清除进行了持续管理,而自主和混合模式则以任务完成方式为基础。
Article 266
Title@2025-07-29 (2): ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
Title: ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs | ReGATE: Schneller und besser lernen mit weniger Token in MLLMs | ReGATE:与较少的男、女、女、女、男、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女 2507.21420v1 |
Authors (3): Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli
The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student’s own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.
培训多式联运大语言模型的计算成本随着所涉物证数量的增加而迅速增加。现有的效率方法主要针对推断,依靠象征性减少或合并,在培训期间提供有限的好处。在本文中,我们提议“ReGATE(参考美元-美元指导适应 Token Elision)”,这是加速MLLM培训的一种适应性象征性模拟方法。具体地说,ReGATE采用师资-学生框架,培训MLLMM作为学生,冻结的参考大语言模型(LLLM)作为教师。教师计算每吨参考损失,结合学生自身困难分数的指数移动平均数(EMA)。这种基于适应性困难的评分使关键物证的选择性处理,同时绕过远道信息较少的标志,大大减少计算间接费用。实验表明,ReGATE在应用VALLAMA2时,与MVBench最高标准培训的准确性达到2美元相比,速度更快。教师计算每吨参考物证损失,与学生自己困难分数的指数平均移动平均数(EMA)相结合。这种基于适应性的评分数的评分法评分后,很快将超过第41号的模型。额外计数将超过基准。
Article 267
Title@2025-07-28 (1): Teaching Language Models To Gather Information Proactively
Title: Teaching Language Models To Gather Information Proactively | Sprachmodelle lehren, um Informationen proaktiv zu sammeln | 积极主动地收集资料的教学语言模式 2507.21389v1 |
Authors (7): Tenghao Huang, Sihao Chen, Muhao Chen, Jonathan May, Longqi Yang, Mengting Wan, Pei Zhou
Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts, falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information – such as hidden domain expertise or fine-grained requirements – that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.
大型语言模型(LLMS)日益被期望作为协作伙伴发挥作用,参与前后对话,以解决复杂、模糊的问题。然而,目前的LLMS往往在现实世界环境中摇摇欲坠,在面对不完全或未充分指明的提示时,默认被动反应或狭隘的澄清,没有主动收集对高质量解决方案至关重要的缺失信息。在这项工作中,我们引入一个新的任务模式:积极主动的信息收集,LLMS必须查明所提供环境中的差距,通过有针对性的问题从战略上获取隐含的用户知识。为了系统学习和培训这一能力,我们设计了一个可扩展的框架,产生部分具体、真实的任务,掩盖关键信息并模拟真实的模糊性。在这个设置中,我们的核心创新是一种强化的微调战略,奖励那些真正产生新的、隐性用户信息的问题 – – 例如隐藏的域专门知识或细微的分类要求 – – 而在其他方面,我们受过训练的 Quen-2.5-7B模式必须找出差距,通过有针对性的问题来大大超越O3-min 18 % 的自动评价指标。更重要的是,人类评估表明,澄清问题和最后概要分别由模型产生的42 % 和最后提要由人类的蓝图分别由人类思考结果显示的澄清结果分别由人类的模型和最后的模型产生的结果。
Article 268
Title@2025-07-28 (1): Ai2 Scholar QA: Organized Literature Synthesis with Attribution
Title: Ai2 Scholar QA: Organized Literature Synthesis with Attribution | Ai2 Scholar QA: Organisierte Literatursynthese mit Attribution | Ai2学者QA:有组织文学综述与归属 2504.10861v2 |
Authors (18): Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber Tanaka, Angele Zamarron, Cecile Nguyen, Jena D. Hwang, Jason Dunkleberger, Matt Latzke, Smita Rao, Jaron Lochner, Rob Evans, Rodney Kinney, Daniel S. Weld, Doug Downey, Sergey Feldman
Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2 Scholar QA, a free online scientific question answering application. To facilitate research, we make our entire pipeline public: as a customizable open-source Python package and interactive web app, along with paper indexes accessible through public APIs and downloadable datasets. We describe our system in detail and present experiments analyzing its key design decisions. In an evaluation on a recent scientific QA benchmark, we find that Ai2 Scholar QA outperforms competing systems.
检索增强型的一代在解答文学科学问题方面越来越有效,但许多最先进的系统是昂贵和封闭源码的。我们引入了Ai2学者QA,这是一个免费的在线科学问题解答应用程序。为了便利研究,我们公布整个管道:作为定制的开放源码Python软件包和互动式网络应用程序,以及可以通过公共API和可下载数据集查阅的纸质索引。我们详细描述我们的系统,并介绍分析其关键设计决定的实验。在对最近科学QA基准的评估中,我们发现Ai2学者QA优于相互竞争的系统。
Article 269
Title@2025-07-28 (1): Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge
Title: Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge | Beyond the Reported Cutoff: Wo große Sprachmodelle auf finanzielles Wissen zurückfallen | 超越报告的截止点:大语言模式对财务知识的缺陷 2504.00042v2 |
Authors (5): Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, Sudheer Chava
Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model’s cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs’ knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. The code, prompts, and model outputs are available on GitHub.
大型语言模型(LLMS)经常被用作解答问题的知识来源,虽然人们知道LLMS可能无法获得在模型截止日之后产生的实时数据或最新数据,但不清楚其知识如何跨越历史信息。在本研究中,我们利用美国公开交易公司的金融数据评估LLMs知识的广度,评估了197个以上的问题,比较了对事实数据的示范答复。我们进一步探讨了公司特点,如规模、零售投资、机构关注和金融档案的可读性,对LLMS中知识的准确性的影响。我们的结果显示,LLMs不太了解过去的财务业绩,但它们表现出对大公司和最新信息的更深刻认识。有趣的是,我们的分析还表明,LLMS更有可能给大公司带来幻灭,特别是近些年的数据。GitHub提供了代码、提示和模型产出。
Article 270
Title@2025-07-28 (1): Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Title: Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models | Audio Flamingo 3: Advancing Audio Intelligence mit vollständig offenen großen Audio-Sprachen-Modelle | 3:以完全开放的大型音频语言模式推进音频情报 2507.08128v2 |
Authors (11): Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
AF3介绍:(一) AF-Whisper,一个使用所有三种语言、声音和音乐模式联合学习的新颖战略培训的统一的录音编码器;(二) 灵活、随需求思考,允许该模式在回答之前进行思维型推理链式推理;(三) 多式、多式聊天;(四) 长期听觉理解和推理(包括演讲),最多10分钟;和(五) 语音对语音互动。为了能够发挥这些能力,我们建议采用若干大型培训数据集,利用新战略,包括AudioSkill-XL、LongAudio-XL、AF-Tink和AF-Chat, 并用新颖的五阶段课程制培训战略对AF3进行培训。仅以开放源的音频数据为培训,AF3在20+(长级)以上的音频理解和推理模型上取得了新的SOTA结果。
Article 271
Title@2025-07-28 (1): Turbocharging Web Automation: The Impact of Compressed History States
Title: Turbocharging Web Automation: The Impact of Compressed History States | Turbocharging Web Automation: Die Auswirkungen von Komprimierten Geschichte Staaten | 涡轮连载网络自动化:压缩历史国家的影响 2507.21369v1 |
Authors (4): Xiyue Zhu, Peng Tang, Haofu Liao, Srikar Appalaraju
Language models have led to a leap forward in web automation. The current web automation approaches take the current web state, history actions, and language instruction as inputs to predict the next action, overlooking the importance of history states. However, the highly verbose nature of web page states can result in long input sequences and sparse information, hampering the effective utilization of history states. In this paper, we propose a novel web history compressor approach to turbocharge web automation using history states. Our approach employs a history compressor module that distills the most task-relevant information from each history state into a fixed-length short representation, mitigating the challenges posed by the highly verbose history states. Experiments are conducted on the Mind2Web and WebLINX datasets to evaluate the effectiveness of our approach. Results show that our approach obtains 1.2-5.4% absolute accuracy improvements compared to the baseline approach without history inputs.
语言模型导致了网络自动化的飞跃。 当前的网络自动化方法将当前的网络状态、历史动作和语言教学作为预测下一步行动的投入,忽略历史的重要性。 但是,网页状态的高度杂乱性质可能导致输入序列长和信息稀少,阻碍历史状态的有效利用。 在本文中,我们建议采用新的网络历史压缩器方法,用历史状态对网络自动化进行涡轮。 我们的方法使用历史压缩机模块,将每个历史状态中最与任务相关的信息压缩成一个固定长度的短期代表,减轻高度扭曲的历史状态带来的挑战。 在Mind2Web和WebLINX数据集上进行了实验,以评价我们的方法的有效性。结果显示,我们的方法比没有历史投入的基线方法获得了1.2-5.4%的绝对准确性改进。
Article 272
Title@2025-07-28 (1): StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation
Title: StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation | StructText: Ein synthetischer Table-to-Text-Ansatz für Benchmark-Erzeugung mit multidimensionaler Bewertung | 条形图文本:以多层次评价方式编制基准的基准的合成表到文本方法 2507.21340v1 |
Authors (4): Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz
Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for converting natural language into structured formats, there is still a lack of benchmarks for evaluating their extraction quality, especially in specific domains or focused documents specific to a given organization. Building such benchmarks by manual annotations is labour-intensive and limits the size and scalability of the benchmarks. In this work, we present StructText, an end-to-end framework for automatically generating high-fidelity benchmarks for key-value extraction from text using existing tabular data. It uses available tabular data as structured ground truth, and follows a two-stage ``plan-then-execute’’ pipeline to synthetically generate corresponding natural-language text. To ensure alignment between text and structured source, we introduce a multi-dimensional evaluation strategy that combines (a) LLM-based judgments on factuality, hallucination, and coherence and (b) objective extraction metrics measuring numeric and temporal accuracy. We evaluated the proposed method on 71,539 examples across 49 datasets. Results reveal that while LLMs achieve strong factual accuracy and avoid hallucination, they struggle with narrative coherence in producing extractable text. Notably, models presume numerical and temporal information with high fidelity yet this information becomes embedded in narratives that resist automated extraction. We release a framework, including datasets, evaluation tools, and baseline extraction systems, to support continued research.
从文本中提取结构化信息,例如能够增加表格数据的关键值对等数据,在许多企业使用案例中非常有用。虽然大型语言模型(LLMS)使许多自动管道能够将自然语言转换成结构化格式,但仍缺乏评估其提取质量的基准,特别是在特定领域或特定组织特有的重点文件方面。通过人工说明建立此类基准是劳动密集型的,限制了基准的规模和可缩放性。在这项工作中,我们介绍了一个端对端框架,即自动生成使用现有表格数据从文本中自动生成高纤维化基准。它用现有表格数据作为结构化的地面真相,并遵循了两个阶段的“计划前执行”管道,以合成方式生成相应的自然语言文本。为确保文本来源和结构化来源之间的一致性,我们引入了一个多维的评价战略,将(a)基于LLM系统对事实质量、幻觉和一致性的判断以及(b)衡量数字和时间精确度的客观提取指标。我们用71、539的列表数据数据数据作为结构化的地面结构化数据,我们评估了这一方法,通过49个深度的精确度分析模型,我们用直观、直观的模型来评估,我们用直观、直观性模型来评估,然后用直观的模型来评估。
Article 273
Title@2025-07-28 (1): A Deep Learning Automatic Speech Recognition Model for Shona Language
Title: A Deep Learning Automatic Speech Recognition Model for Shona Language | Ein Deep Learning automatische Spracherkennung Modell für Shona Sprache | Shona语言深学习自动语音识别模式 2507.21331v1 |
Authors (2): Leslie Wellington Sirora, Mainford Mutandavari
This study presented the development of a deep learning-based Automatic Speech Recognition system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. The research first explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Second, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network for acoustic modelling and a Long Short-Term Memory network for language modelling. To overcome the scarcity of data, data augmentation techniques and transfer learning were employed. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate of 29%, Phoneme Error Rate of 12%, and an overall accuracy of 74%. These metrics indicated the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This study contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.
这项研究介绍了为Shona开发一个深层次的基于学习的自动语音识别系统,这是一种低资源语言,其特点是具有独特的音调和语法复杂性,研究的目的是应对培训数据有限、缺乏贴标签数据以及Shona演讲中出现的复杂的肾脏细微差别所带来的挑战,目的是与传统统计模型相比,在认识准确性方面实现显著改进;研究首先探讨了利用深学习为Shona开发准确的ASR系统的可行性;其次,研究探讨了为Shona演讲设计和实施深层次学习结构所涉及的具体挑战,以及减轻这些挑战的拟议战略;最后,将深层次学习模式的绩效与现有统计模型的准确性进行比较;开发的ASR系统利用了一个混合结构,其中包括声学建模革命性神经网络和语言模拟长期短期记忆网络;为了克服数据匮乏,采用了数据增强技术和转让学习技术;还纳入了关注机制,以适应Shona演讲的通俗性质;由此形成的ASR系统取得了令人印象深刻的结果,在29%的字写错误率下将深层次的ASR错误率最终提升到12 %的A类Shona语言的学习率;这些技术的升级到在12 %的深度的精确性研究中显示,这些技术的升级到提升了全球技术的精确度。
Article 274
Title@2025-07-28 (1): SQuat: Subspace-orthogonal KV Cache Quantization
Title: SQuat: Subspace-orthogonal KV Cache Quantization | SQuat: Subraum-orthogonale KV-Cache-Quantisierung | Suat: 子空间- orthogonal KV 缓存缓存量化 2503.24358v2 |
Authors (4): Hao Wang, Ligong Han, Kai Xu, Akash Srivastava
The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism’s outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
键值缓存缓存会加速通过从先前生成的质证存储 KV Exctors 来解码 LLMs 。 它会减少以更多内存使用为代价的冗余计算。 为了减轻这一间接成本, 现有的方法将 KV Excors 压缩成低位表示式; 但是, 当生成更多的符号时, 可能会导致不受欢迎的输出时, 量化错误会累积起来 。 在本文中, 我们引入 SQuat ( Subspace- orthogonal KV 缓存量) 。 它首先通过查询 数组构建一个子空间, 以获取最关键的任务相关信息 。 在关键 数组化过程中, 它强制将( 量化的) 和 原始键之间的差别压缩成低位表达式表达式; 但是, 当生成更多符号时, Quat 可能会累积出量化错误对关注机制输出的影响, 可能导致不理想的结果 。 在我们开发的理论框架中, 我们通过数字实验, 显示我们的方法会将峰存减少2. 17 到 2.82, 通过245 到 3. 60 并实现更有利的QQev 。
Article 275
Title@2025-07-28 (1): Do Large Language Models Understand Morality Across Cultures?
Title: Do Large Language Models Understand Morality Across Cultures? | Verstehen große Sprachmodelle Moral über Kulturen hinweg? | 大语言模式是否理解各种文化的道德? 2507.21319v1 |
Authors (4): Hadi Mohammadi, Yasmeen F. S. S. Meijer, Efthymia Papadopoulou, Ayoub Bagheri
Recent advancements in large language models (LLMs) have established them as powerful tools across numerous domains. However, persistent concerns about embedded biases, such as gender, racial, and cultural biases arising from their training data, raise significant questions about the ethical use and societal consequences of these technologies. This study investigates the extent to which LLMs capture cross-cultural differences and similarities in moral perspectives. Specifically, we examine whether LLM outputs align with patterns observed in international survey data on moral attitudes. To this end, we employ three complementary methods: (1) comparing variances in moral scores produced by models versus those reported in surveys, (2) conducting cluster alignment analyses to assess correspondence between country groupings derived from LLM outputs and survey data, and (3) directly probing models with comparative prompts using systematically chosen token pairs. Our results reveal that current LLMs often fail to reproduce the full spectrum of cross-cultural moral variation, tending to compress differences and exhibit low alignment with empirical survey patterns. These findings highlight a pressing need for more robust approaches to mitigate biases and improve cultural representativeness in LLMs. We conclude by discussing the implications for the responsible development and global deployment of LLMs, emphasizing fairness and ethical alignment.
最近大语言模型(LLMs)的进展是众多领域的一个有力工具,然而,对其培训数据产生的性别、种族和文化偏见等内在偏见的持续关切,对这些技术的道德使用和社会后果提出了重大问题。本研究报告调查LLMs在多大程度上抓住了跨文化差异和道德观点的相似之处。具体地说,我们研究LLM产出是否与国际道德态度调查数据中观察到的模式相一致。为此,我们采用三种补充方法:(1)比较模型产生的道德分数与调查中报告的差异;(2)进行分组调整分析,以评估从LLM产出和调查数据中得出的国家分组之间的对应关系;(3)利用系统选择的象征性配对直接探索具有比较提示的模型。我们的结果显示,目前的LMS往往不能完全复制跨文化道德差异的全方位,倾向于压缩差异,并显示与经验调查模式不相符。这些结果突出表明迫切需要采取更强有力的办法,减少LMs的偏见,提高文化代表性。我们最后通过讨论LMs负责任发展和全球部署的影响,强调公平和道德一致性。
Article 276
Title@2025-07-28 (1): Can human clinical rationales improve the performance and explainability of clinical text classification models?
Title: Can human clinical rationales improve the performance and explainability of clinical text classification models? | Können menschliche klinische Grundlagen die Leistungsfähigkeit und Erklärbarkeit klinischer Textklassifikationsmodelle verbessern? | 人类临床原理能否改善临床文本分类模型的性能和解释性? 2507.21302v1 |
Authors (4): Christoph Metzner, Shang Gao, Drahomira Herrmannova, Heidi A. Hanson
AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don’t consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.
人工智能驱动的临床文本分类对于人口一级健康信息的可解释的自动检索至关重要。 这项工作调查了基于人类的临床理由是否可以作为额外的监督来提高基于变压器的模型的性能和解释性,从而自动编码临床文件。 我们分析了99,125个基于人类的临床理由,为初级癌症现场诊断提供了合理的解释,用这些理由作为额外的培训样本,同时用128,649份电子病理学报告来评价提取初级癌症现场的基于变压器模型。 我们还调查了充足性作为衡量预选理由的理由质量的一种方法。 我们的结果表明,作为额外培训数据的附加性能可以改善高资源情景中的模型性能,但在资源有限时则会产生不一致的行为。 我们用充足性作为自动衡量理由的衡量标准,为初级癌症现场诊断提供了合理性的解释。 关键是,经过培训的模型一直比其他报告所培训的模型要好,这表示临床理由不能不断改善模型的性能,而仅仅使用更多的报告来弥补。 因此,如果目标正在优化准确性,那么,作为补充性的说明性的努力应该侧重于比模拟性报告更精确性,而不是确定表面性的理由。 然而,我们用模拟性数据来解释性解释性数据作为结论,如果我们可以用来解释性地解释,那么,那么,那么,那么,我们用更精确性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释。 。
Article 277
Title@2025-07-28 (1): FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
Title: FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation | FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung | FlaignEvalMMM:综合多式联运模式评价灵活框架 2506.09081v3 |
Authors (8): Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang
We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible at https://github.com/flageval-baai/FlagEvalMM.
我们提出FlagEvalMM(FlagEvalMM),这是一个开放源码评价框架,旨在全面评估多种视觉语言理解和生成任务,如视觉问答、文字到图像/视频生成和图像-文本检索等多种多式联运模式;我们通过独立评价服务将模型推论与评价脱钩,从而能够灵活分配资源和无缝地整合新任务和新模式;此外,FlagEvalMM(FlagEvalM)利用先进的推论加速工具(如VLLM、SGLang)和不同步的数据负荷,以大大提高评价效率;广泛的实验显示FlagEvalM(FlagEvalM)对模型的长处和局限性提供了准确而有效的洞察力,使其成为推进多式联运研究的宝贵工具;框架可在https://github.com/flageeval-baai/FlageEvalMM(https://githu.com/flageval-baai/FlagEvalMMM)上公开查阅。
Article 278
Title@2025-07-28 (1): Narrative Context Protocol: An Open-Source Storytelling Framework for Generative AI
Title: Narrative Context Protocol: An Open-Source Storytelling Framework for Generative AI | Narrative Context Protocol: Ein Open Source Storytelling Framework für generative KI | 叙述性背景议定书:开源的开源描述框架 2503.04844v5 |
Authors (1): Hank Gerba
Here we introduce Narrative Context Protocol (NCP), an open-source narrative standard designed to enable narrative interoperability, AI-driven authoring tools, real-time emergent narratives, and more. By encoding a story’s structure in a “Storyform,” which is a structured register of its narrative features, NCP enables narrative portability across systems as well as intent-based constraints for generative storytelling systems. We demonstrate the capabilities of NCP through a year-long experiment, during which an author used NCP and a custom authoring platform to create a playable, text-based experience based on her pre-existing novella. This experience is driven by generative AI, with unconstrained natural language input. NCP functions as a set of “guardrails” that allows the generative system to accommodate player agency while also ensuring that narrative context and coherence are maintained.
在此,我们引入了“叙述背景协议 ” ( NCP ) , 这是一种开放源代码叙述性标准, 目的是让叙述性互操作性、 AI驱动的作者工具、 实时突发描述等等成为可能。 通过将一个故事结构化的“故事形式 ” ( Tory Form) , 这是一种结构化的描述性特征登记册, NCP 使得跨系统的叙述性可移动性以及基因化叙事系统的用意限制得以实现。 我们通过一个长达一年的实验展示了NCP的能力, 在此期间, 作者利用了NCP 和一个定制的作者平台, 以她先前存在的小说为基础, 创造了一个可播放的、 基于文本的经验。 这种经验是由基因化的 AI 驱动的, 并有不加限制的自然语言输入。 NCP 功能是一套“ 保护性装置 ” , 使基因化系统能够容纳玩家机构, 同时确保描述性背景和一致性得到维护。
Article 279
Title@2025-07-28 (1): Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues
Title: Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues | Schulung von LLM-basierten Tutoren zur Verbesserung der Lernergebnisse von Studierenden in Dialogen | 培训基于LLLM LLM的辅导员,以改善学生在对话中的学习成果 2503.06424v2 |
Authors (5): Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, Andrew Lan
Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.
培养人工智能(AI)有可能通过大型语言模式扩大个人化辅导(LLMs) 。最近的AI教义通过培训或促使LLMs遵循有效的教学原则,适应辅导任务,尽管他们没有经过培训,在整个对话过程中最大限度地提高学生的学习水平,因此,他们可以以次优方式与学生接触。我们采用培训LLMs的方法解决这一限制,以产生尽可能提高学生正确性的可能性的辅导讲义,同时仍然鼓励学习良好教学做法的模式。具体地说,我们制作了一套候选人辅导讲义,并用(1) 以LLM为基础的学生模型来预测学生作出正确反应的机会,(2) 由GPT-4o评估的教学图解。我们随后利用由此产生的数据来培训开源LM,Llama 3.1 8B,使用直接的优惠优化。我们展示了我们模型产生的讲义导,在保持GPT-4o教学质量的同时,使学生作出正确反应的机会大得多。我们还进行了定性分析和人文评价,以证明我们的模型产生了高质量的讲义。
Article 280
Title@2025-07-28 (1): LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems
Title: LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems | LeMix: Unified Scheduling für LLM-Training und Schlussfolgerung auf Multi-GPU-Systemen | LeMix:关于多功能保U系统的LLM培训和推理的LLM培训统一日程安排 2507.21276v1 |
Authors (4): Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu
Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific behaviors and co-execution interference across shared nodes, LeMix improves utilization and serving quality without compromising serving responsiveness. Our evaluation shows that LeMix improves throughput by up to 3.53x, reduces inference loss by up to 0.61x, and delivers up to 2.12x higher response time SLO attainment over traditional separate setups. To our knowledge, this is the first work to uncover and exploit the opportunities of joint LLM inference and training, paving the way for more resource-efficient deployment of LLMs in production environments.
现代使用大型语言模型(LLMS)经常涉及为适应不断演变的数据和用户反馈而对大型语言模型(LLMS)的现代部署的推论和持续再培训,以适应不断演变的数据和用户反馈。常见做法是将这些工作量分解到孤立的服务器上,造成效率极低(例如,GPU闲置)和对分布式环境中新数据的延迟调整。我们的经验分析表明,这些效率低下是由于在服务期间需求激增和工作量在管道平行培训中出现差异造成的。为了应对这些挑战,我们建议LeMix(一个同时分配和管理LLMMX服务和培训工作量的系统)共同定位和管理。LeMix(LeMix)整合了离线剖析、执行预测机制以及运行时间安排,以便根据工作量特点和系统条件动态调整资源分配。通过理解特定任务的行为和共同执行干扰,在共享节点之间提高利用率和工作质量,同时不损害响应能力。我们的评估表明,LMix(LeMix)的吞吐量增加至3.53x,将误损失降低到0.61x,并将预测损失降低2.12x的响应时间超过传统的SLOE实现传统单独设置和再利用我们的知识,从而探索资源环境。
Article 281
Title@2025-07-28 (1): Levels of Analysis for Large Language Models
Title: Levels of Analysis for Large Language Models | Analyseebenen für große Sprachmodelle | 大语言模式分析水平 2503.13401v2 |
Authors (13): Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths
Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on the levels of analysis that David Marr proposed for studying information processing systems. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.
现代人工智能系统,如大型语言模型,越来越强大,但也越来越难以理解。我们认识到这个问题与理解人类思想的历史困难相似,因此认为认知科学开发的方法有助于理解大型语言模型。我们根据David Marr为研究信息处理系统而提出的分析水平,提出了一个应用这些方法的框架。我们通过重新研究与各级有关的既定认知科学技术,并展示其了解大型语言模型的行为和内部组织的潜力,我们的目标是提供一个工具,使这些新类型的思想变得有意义。
Article 282
Title@2025-07-28 (1): CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting
Title: CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting | CompoST: Ein Benchmark für die Analyse der Fähigkeit von LLMs, Fragen in einer QALD-Einstellung kompositorisch zu interpretieren | CompoST:在质量和限期设计中分析高管公司在组成上解释问题的能力的基准 2507.21257v1 |
Authors (3): David Maria Schmidt, Raoul Schubert, Philipp Cimiano
Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they “understand” the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.
语言解释是一个构成过程,从各个部分的含义中推断出更复杂的语言结构的含义。大型语言模型具有非凡的语言口译能力,并且成功地用于通过将语言翻译能力映射给SPARQL查询来解释问题。一个未决问题是这个解释过程是如何系统化的。关于这个问题,我们在本文件中提出一个基准,用于调查LLMM解释问题的能力在多大程度上实际上具有构成性。我们为此根据DBpedia的图表模式,根据Lemon lexica的口头化,生成了三套不同难度的数据集。我们的数据集是以一种非常受控制的方式创建的,以测试LLMs解释结构复杂问题的能力,因为它们已经看到原子构造块。这使我们能够评估LMS能够在多大程度上解释它们“能理解”原子部分的复杂问题。我们利用各种快速和微量优化技术以及微调来进行不同尺寸模型的实验。我们的结果显示,从0.45美元以上的LMs的计算结果,从0.26美元降至0.09美元,因此将数据从不断偏差的S.00_0.9美元,然后将所有数据压为最低的模型。
Article 283
Title@2025-07-28 (1): Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach
Title: Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach | Bangla BERT für hyperparteiische Nachrichtenerkennung: Ein halbüberwachter und erklärbarer KI-Ansatz | 超党派新闻探测孟加拉BERT:半监督和可解释的AI方法 2507.21242v1 |
Authors (6): Mohammad Mehadi Hasan, Fatema Binte Hassan, Md Al Jubair, Zobayer Ahmed, Sazzatul Yeakin, Md Masum Billah
In the current digital landscape, misinformation circulates rapidly, shaping public perception and causing societal divisions. It is difficult to identify hyperpartisan news in Bangla since there aren’t many sophisticated natural language processing methods available for this low-resource language. Without effective detection methods, biased content can spread unchecked, posing serious risks to informed discourse. To address this gap, our research fine-tunes Bangla BERT. This is a state-of-the-art transformer-based model, designed to enhance classification accuracy for hyperpartisan news. We evaluate its performance against traditional machine learning models and implement semi-supervised learning to enhance predictions further. Not only that, we use LIME to provide transparent explanations of the model’s decision-making process, which helps to build trust in its outcomes. With a remarkable accuracy score of 95.65%, Bangla BERT outperforms conventional approaches, according to our trial data. The findings of this study demonstrate the usefulness of transformer models even in environments with limited resources, which opens the door to further improvements in this area.
在当前的数字景观中,错误信息传播迅速,影响公众认识并造成社会分裂。 在孟加拉语中很难找到超党派新闻,因为对这种低资源语言来说,没有许多复杂的自然语言处理方法。 没有有效的检测方法,偏见的内容就会传播,对知情的言论构成严重风险。为了解决这一差距,我们的研究将孟加拉语BERT 进行微调,这是一个基于最新技术的变压器模型,目的是提高超党派新闻的分类准确性。我们对照传统机器学习模型评估其表现,并采用半监督的学习方法来进一步加强预测。我们不仅使用LIME来提供模型决策过程的透明解释,这有助于建立对其结果的信任。根据我们的实验数据,Bangla BERT的精确度高达95.65%,它超越了常规方法。这项研究的结果表明,即使在资源有限的环境中,变压器模型也非常有用,这为这一领域进一步改进打开了大门。
Article 284
Title@2025-07-28 (1): Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability
Title: Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability | Öffentliche Wahrnehmung der Kriminalität in Bangladesch verstehen: Ein transformerbasierter Ansatz mit Erklärbarkeit | 了解孟加拉国公众对犯罪的认识:基于变革和可解释的方法 2507.21234v1 |
Authors (6): Fatema Binte Hassan, Md Al Jubair, Mohammad Mehadi Hasan, Tahmid Hossain, S M Mehebubur Rahman Khan Shuvo, Mohammad Shamsul Arefin
In recent years, social media platforms have become prominent spaces for individuals to express their opinions on ongoing events, including criminal incidents. As a result, public sentiment can shift dynamically over time. This study investigates the evolving public perception of crime-related news by classifying user-generated comments into three categories: positive, negative, and neutral. A newly curated dataset comprising 28,528 Bangla-language social media comments was developed for this purpose. We propose a transformer-based model utilizing the XLM-RoBERTa Base architecture, which achieves a classification accuracy of 97%, outperforming existing state-of-the-art methods in Bangla sentiment analysis. To enhance model interpretability, explainable AI technique is employed to identify the most influential features driving sentiment classification. The results underscore the effectiveness of transformer-based models in processing low-resource languages such as Bengali and demonstrate their potential to extract actionable insights that can support public policy formulation and crime prevention strategies.
近年来,社交媒体平台已成为个人表达对当前事件(包括犯罪事件)意见的显著空间,因此公众情绪会随着时间的推移而发生动态变化。本研究通过将用户生成的评论分为积极、消极和中性这三类,调查公众对犯罪相关新闻不断变化的看法。为此开发了由28 528种孟加拉语社交媒体评论组成的新整理数据集。我们提议利用XLM-ROBERTA Base架构建立基于变压器的模型,该模型的分类精确度达到97%,优于孟加拉语情感分析的现有最新方法。为了提高模型可解释性,采用了可解释的AI技术来识别驱动情绪分类的最有影响力的特征。结果强调了基于变压器的模型在处理孟加拉语等低资源语言方面的有效性,并展示了这些变压器模式在获取可操作的洞察力方面的潜力,从而支持公共政策的制定和预防犯罪战略。
Article 285
Title@2025-07-28 (1): Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation
Title: Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation | Multi-Agent-as-Judge: LLM-Agent-basierte automatisierte Evaluierung mit multidimensionaler menschlicher Bewertung ausrichten | 多边代理法官:将LLM-基于代理的自动评价与多层次的人力评价统一起来 2507.21028v1 |
Authors (9): Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang
Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging “LLM-as-a-judge” paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts’ ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.
几乎所有的人类工作都是协作的;因此,对现实世界NLP应用的评估往往需要与人类不同视角相适应的多个层面。由于真正的人力资源往往稀缺且成本高昂,新兴的“LLM-as-a-judge”范式为利用LLM代理商进行令人信服的模拟人体评估提供了很有希望的办法。然而,到目前为止,现有的LLM-as-a-judge-s-instal 方法面临两个限制:对代理商的描述往往被任意设计,而框架又无法普遍用于其他任务。为了应对这些挑战,我们建议MAJ-EVAL,一个多机构-as-judge评价框架,可以自动建立多个具有相关文本文件不同层面的评价人(例如研究文件)、与人进行即时法LM代理商与人进行集体辩论,并与多机构进行集体辩论,以产生多维反馈。我们在教育和医疗领域的评价实验表明,MAJ-EVAL能够产生与人类专家评级相比与常规自动评价指标和现有LM-as-a-jud-a-a-judge方法更为一致的评价结果。
Article 286
Title@2025-07-28 (1): Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Title: Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation | Verbesserung der LLM-Vernunft mit iterativem DPO: Eine umfassende empirische Untersuchung | 与具有迭接作用的DPO:全面经验调查加强LLM 2503.12854v3 |
Authors (11): Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao
Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.
最近大语言模型培训后方法的进展突出表明,强化学习是强化推理的一个关键组成部分,但是,基于学习模式的计算成本巨大,导致人们对其他模式的兴趣日益浓厚,例如直接优惠优化(DPO)等替代模式的兴趣日益浓厚。在本研究中,我们调查了DPO通过反复的基于优惠的学习促进LLMS自我改进的有效性。我们证明,单轮粗略过滤法的DPO大大提高了数学推理性能,特别是强健的基础模型。此外,我们为生成者和奖励模式设计了一个迭代强化框架,通过多轮DPO的在线互动,使得它们能够相互改进。最后,有了简单的可核实的奖励,我们的DPO-VP模型在计算间接费用上实现了RL水平的业绩,大大降低了计算费。这些结论强调DPO是可扩展的、成本效益高的替代RL,为在资源紧张的情况下加强LM推理提供了切实可行的解决办法。
Article 287
Title@2025-07-28 (1): Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions
Title: Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions | Bewertung des Versprechens und der Fälle von LLMs bei Hiring-Entscheidungen | 评估LLM女士在雇用决定中的许诺和机会 2507.02087v2 |
Authors (4): Eitan Anzenberg, Arunava Samajpati, Sivasankaran Chandrasekar, Varun Kacholia
The use of large language models (LLMs) in hiring promises to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias where sufficient safeguards are not in place. In this work, we benchmark several state-of-the-art foundational LLMs - including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model’s predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We discuss why pretraining biases may cause LLMs with insufficient safeguards to propagate societal biases in hiring scenarios, whereas a bespoke supervised model can more effectively mitigate these biases. Our findings highlight the importance of domain-specific modeling and bias auditing when deploying AI in high-stakes domains such as hiring, and caution against relying on off-the-shelf LLMs for such tasks without extensive fairness safeguards. Furthermore, we show with empirical evidence that there shouldn’t be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes.
大型语言模型(LLMS)在招聘时的使用有望简化候选人筛选程序,但也引起了人们对准确性和算法偏差的严重关切。 在这项工作中,我们将一些最先进的基础模型(包括OpenAI、Anthroopicic、Google、Meta和DeepSeek的模型)作为基准,并将它们与我们专有的域别特定招聘模式(Match Scord)相比,用于招聘候选人匹配。我们评估了每个模型的预测准确性(ROC ACUC、 Precision-Recall AUC、F1-Score)和公平性(在宣布的性别、种族和交叉分组之间缺乏足够保障的情况下,截断率和算分析的比重)以及公平性(在宣布的性别、种族、种族和交叉分组之间缺乏足够保障的情况下,我们对最先进的基本基本基本基本基本基本基本基本基本基本基本基本基本基本标准 — — 在招聘过程中,我们的标准(Ox906)和(BLMS)之间可以有效地进行准确性评估。
Article 288
Title@2025-07-28 (1): Memorization in Fine-Tuned Large Language Models
Title: Memorization in Fine-Tuned Large Language Models | Auswendiglernen in fein getönten großen Sprachmodellen | 微微调大语言模型的记忆 2507.21009v1 |
Authors (4): Danil Savine, Muni Sreenivas Pydi, Jamal Atif, Olivier Cappé
This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model’s propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns.
这项研究调查了微调大型语言模型(LLMS)中影响记忆化的机制和因素,重点是医疗领域,因为其隐私敏感性。我们研究微调过程的不同方面如何影响模型对培训数据记忆化的倾向,使用PHEE 药理监督事件数据集。我们的研究采用两个主要方法:(1) 价值和产出矩阵对记忆化的贡献比Query和Key矩阵大得多;(2) 微调模型与增加记忆化相关联的难度降低;(3) 较高的LORA等级导致增加记忆化,但随着较高等级的回报减少。这些结果为低级别适应(LORA)微调的排名(LORA)提供了深刻的见解。这些结果为改进模型绩效和隐私风险之间的贸易影响提供了深刻的见解。
Article 289
Title@2025-07-28 (1): LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning
Title: LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning | LoRA-PAR: Ein flexibler Dual-System-LoRA-Partitionsansatz für effizientes LLM-Feintuning | LOLAR-PAR:高效 LLM 微调的灵活双系统滚动分割法 2507.20999v1 |
Authors (4): Yining Huang, Bin Li, Keke Tang, Meilian Chen
Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.
DeepSeek-R1 和 OpenAI-O1 等大型基因模型从思维链推理(CoT)的推理中获益匪浅,但推推推其性能通常需要大量数据、大模型尺寸和全参数微调。尽管参数效率微调(PEFT)有助于降低成本,但大多数现有方法主要针对领域适应或分层,而不是明确根据不同响应需求调整数据和参数。受“思维、快速和慢”的启发,它具有两种不同的思维-系统(快速、直观、往往自动)和系统2(更低、更具审议性和分析性)使用两种不同模式,而系统2(更低、更低、更具审议性和分析性)和系统2(较低)的精度使用更集中的参数,我们得出一个类比,即LLMM参数的不同“子区域”可能同样专门用于需要快速、直观反应或多步逻辑分配的数据,而不是需要逻辑推理的数据和参数。因此,我们建议LAR-PAR-PAR-PAR系统系统将数据和参数按照系统1或系统更低的精细的精细的精细的比值比,用更集中的参数对每个任务进行更集中的参数的参数调整,我们通过双级、更精细的S-级、更精细的S-级、更精细的S-级、更精细的SL-级、更精细的S-级、更精细的S-级、更精细的S-级、更精细的比、更精细的S-级、更精细的校化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级-级-级化-级-级-级-级化-级化-级化-级化-级化-级化-级化-级化-级-级-级化-级化-级调制-级-级-级-级-级调制-级-级-级-级化-级化-级-级-级-级-级-级-级-级-级-级
Article 290
Title@2025-07-28 (1): GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding
Title: GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding | GUI-G$^2$: Gaussian Reward Modeling für GUI Grounding | GUI-G$$2美元:GUI地基的高斯奖赏模型 2507.15846v3 |
Authors (12): Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.
图形用户界面( GUI) 绘制自然语言指示, 以精确的界面位置进行自主互动。 当前强化学习方法使用将元素作为目标目标处理的二进制奖赏, 创建忽略空间互动连续性的微弱信号。 我们受自然以目标元素为核心的高斯分布的人类点击行为驱动, 我们引入了GUI Gausian 定位奖项( GUI- G$2$), 一个原则奖赏框架, 将图形界面元素作为连续的高斯分布在界面中。 GUI- G$2$ 包含两个协同机制: 高斯点奖赏模型, 通过元素固醇的快速衰减版版化分布, 创建零星点的精确本地化模型, 覆盖点评估空间一致性, 通过测量预测高点分布和目标区域之间的重叠。 为了处理不同元素尺度, 我们开发了一个适应性差异机制, 校准基于元素维度的分布。 这个框架将GUIGI从稀少的二级分类到密集的连续优化优化优化。 校正的分布产生丰富的梯度信号信号信号信号, 向最优化的互动定位定位定位定位定位定位 $PROSQS- breal- browst- browst- grealmamamamas
Article 291
Title@2025-07-28 (1): Scaling Physical Reasoning with the PHYSICS Dataset
Title: Scaling Physical Reasoning with the PHYSICS Dataset | Skalierung der physikalischen Vernunft mit dem PHYSICS-Datensatz | 利用PHYSICS数据集调整物理理由 2506.00022v3 |
Authors (12): Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye
Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model’s physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.
大型语言模型(LLMS)在数学和编码竞赛等高级推理任务方面取得了显著的进展,与此同时,物理学尽管在推理上是密集的,对现实世界的理解也至关重要,但得到的学术和工业关注有限,本文件介绍了PHYSICS, 这是一个包含16 568个主题和困难层次的高质量物理学问题的数据集,它包含16 568个主题和困难层次的高质量物理学问题,为这一问题提供了便利。具体地说,PHYSICS通过精心设计的质量控制管道,从100多本教科书的练习中得到整理。它涵盖五个主要物理领域:机械学、电磁学、热力学、光学和现代物理学。它也涉及从高中到研究生物理课程等广泛的困难程度。为了利用数据来改进和评价模型的物理推理能力,我们将数据集分成培训和测试组,并为培训数据提供强大的推理模型产生的推理路径。此外,我们发现现有的评价框架在物理领域的单位、简化和精确度等方面显示出偏向。为了平衡和准确性,我们采用规则+模型的现行物理标准评估框架,我们目前对物理物理模型的正确评估框架进行评估。
Article 292
Title@2025-07-28 (1): Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands
Title: Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands | Cog-TiPRO: Iterative Prompt-Verfeinerung mit LLMs zur Erkennung kognitiver Deklination über Longitudinal Voice Assistant-Befehle | COg-TiPRO:与LLMs一起与LLMs进行自动迅速改进,以便通过纵向语音助理指挥部检测认知衰减 2505.17137v2 |
Authors (5): Kristin Qi, Youxiang Zhu, Caroline Summerour, John A. Batsis, Xiaohui Liang
Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.
早期发现认知衰落对于能够减缓神经退化性疾病蔓延的干预至关重要。传统的诊断方法依赖劳动密集型临床评估,而这种评估对频繁监测是不切实际的。我们的试点研究将语音助理系统(VAS)作为非侵入性工具,通过对语音指令中的语音模式进行纵向分析来调查认知衰落。在18个月期间,我们收集了35个老年人的语音指令,15名参与者每天在家中提供VAS互动。为了应对分析这些短、无结构、吵闹命令的挑战,我们建议Cog-TiPRO,这是一个将(1)LLM驱动的语言特征提取迭代快速完善、(2)基于HuBERT的声学特征提取和(3)基于变压器的时间模型相结合的框架。我们使用 Transinform,在检测MCI方面实现了73.80%的精度和72.67%的F1芯,比其基线高出27.13%。我们通过LM方法,确定了在认知衰落的个人日常指令使用模式的独特语言特征。
Article 293
Title@2025-07-28 (1): A Survey of Deep Learning for Geometry Problem Solving
Title: A Survey of Deep Learning for Geometry Problem Solving | Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen | 解决几何问题深层学习调查 2507.11936v4 |
Authors (3): Jianzhe Ma, Wenxuan Wang, Qin Jin
Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
解决几何问题是数学推理的一个关键领域,它广泛涉及许多重要领域,例如教育、人工智能数学能力评估和多式联运能力评估。近年来,深层次学习技术的迅速发展,特别是多式联运大型语言模型的兴起,引发了广泛的研究繁荣。本文调查了深层次学习在解决几何问题方面的应用,包括:(一) 全面概述几何问题解决中的相关任务;(二) 彻底审查相关的深层次学习方法;(三) 详细分析评价指标和方法;(四) 批判性地讨论目前的挑战和今后可探讨的方向。我们的目标是为解决几何问题的深层次学习提供全面和实用的参考,以促进该领域的进一步发展。我们不断更新关于GitHub的文件清单:https://github.com/majianz/dl4gps。
Article 294
Title@2025-07-28 (1): Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning
Title: Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning | Mehrbildbeschreibungen für mehrsprachige, leichte Kognitive Impairment-Erkennung durch kontrastives Lernen enthüllen | 通过差异学习发现多语种轻视认知缺陷的单形多语种描述 2505.17067v3 |
Authors (5): Kristin Qi, Jiali Cheng, Youxiang Zhu, Hadi Amiri, Xiaohui Liang
Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the ‘Cookie Theft’). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework’s effectiveness in multilingual and multi-picture MCI detection.
从图片描述中检测出微弱的认知障碍,特别是在多语种和多种图片环境中,是关键但具有挑战性的。先前的工作主要侧重于描述单一图片(如“Cookie Theft”)的讲英语者。TAUKDIAL 2024挑战通过引入多语种语言者和多图片扩大了这一范围,这在分析依赖图片的内容方面提出了新的挑战。为了应对这些挑战,我们提议了一个包含三个组成部分的框架:(1) 通过监督对比学习加强有区别的代表性学习,(2) 涉及图像模式,而不仅仅是语言和文本模式,(3) 应用专家产品(PoE)战略来减少虚假的关联和过度匹配。我们的框架提高了MCI的检测性能,实现了未加权平均回调+7.1%(UAR)(从68.1%增加到75.2%),与文本单式基线相比,F1分(从80.6%增加到83.5%)增加了2.9%。值得注意的是,对比学习部分的文本模式比语音和多语种识别效果更大。这些结果突出表明了我们的框架在多语种和多图像中的有效性。
Article 295
Title@2025-07-28 (1): Your AI, Not Your View: The Bias of LLMs in Investment Analysis
Title: Your AI, Not Your View: The Bias of LLMs in Investment Analysis | Ihre KI, nicht Ihre Ansicht: Die Bias von LLMs in der Investitionsanalyse | 您的AI, 而不是您的观点: 投资分析中LLM 的偏见 2507.20957v1 |
Authors (8): Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee
In finance, Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data. These conflicts become particularly problematic when LLMs are deployed in real-world investment services, where misalignment between a model’s embedded preferences and those of the financial institution can lead to unreliable recommendations. Yet little research has examined what investment views LLMs actually hold. We propose an experimental framework to investigate such conflicts, offering the first quantitative analysis of confirmation bias in LLM-based investment analysis. Using hypothetical scenarios with balanced and imbalanced arguments, we extract models’ latent preferences and measure their persistence. Focusing on sector, size, and momentum, our analysis reveals distinct, model-specific tendencies. In particular, we observe a consistent preference for large-cap stocks and contrarian strategies across most models. These preferences often harden into confirmation bias, with models clinging to initial judgments despite counter-evidence.
在金融方面,大语言模型(LLMs)由于预先培训的参数知识与实时市场数据之间的差异而经常面临知识冲突。当LLMs被部署在现实世界的投资服务中时,这些冲突特别成问题,因为模型的内在偏好与金融机构的偏好不匹配可能导致不可靠的建议。然而,研究很少研究投资观点中LLMs实际上持有什么样的观点。我们提议了一个实验框架来调查这种冲突,在基于LLM的投资分析中首次对确认偏差进行定量分析。我们利用平衡和不平衡的假设假设情景,提取模型的潜在偏好并衡量其持久性。我们的分析以部门、规模和势头为重点,揭示了不同的模式趋势。特别是,我们观察到了对大头股票和反面战略的一贯偏好。这些偏好往往硬化为确认偏差,模型不顾反证证据坚持初步判断。
Article 296
Title@2025-07-28 (1): Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models
Title: Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models | Mind the Gap: Konformative Dekodierung zur Verbesserung der Output-Vielfalt von instruction-tuned großen Sprachmodellen | 注意差距:改进教学型大语言模式产出多样性的合规化配方 2507.20956v1 |
Authors (4): Max Peeperkorn, Tom Kouwenhoven, Dan Brown, Anna Jordanous
Instruction-tuning large language models (LLMs) reduces the diversity of their outputs, which has implications for many tasks, particularly for creative tasks. This paper investigates the ``diversity gap’’ for a writing prompt narrative generation task. This gap emerges as measured by current diversity metrics for various open-weight and open-source LLMs. The results show significant decreases in diversity due to instruction-tuning. We explore the diversity loss at each fine-tuning stage for the OLMo and OLMo 2 models to further understand how output diversity is affected. The results indicate that DPO has the most substantial impact on diversity. Motivated by these findings, we present a new decoding strategy, conformative decoding, which guides an instruct model using its more diverse base model to reintroduce output diversity. We show that conformative decoding typically increases diversity and even maintains or improves quality.
教学调整大型语言模型(LLMS)减少了其产出的多样性,这影响到许多任务,特别是创造性任务。本文件调查“多样性差距”对于写作快速叙述生成任务的影响。这一差距以当前各种开放重量和开放源码LMS的多样性指标来衡量。结果显示由于教学调整,多样性显著下降。我们探索了OLMO和OLMO 2模型每个微调阶段的多样性损失,以进一步了解产出多样性是如何受到影响的。结果显示DPO对多样性的影响最大。根据这些发现,我们提出了一个新的解码战略,符合要求的解码,用以指导使用更多样化的基础模型的教学模式重新引入产出多样性。我们显示,兼容的解码通常会增加多样性,甚至保持或改进质量。
Article 297
Title@2025-07-28 (1): Dissecting Persona-Driven Reasoning in Language Models via Activation Patching
Title: Dissecting Persona-Driven Reasoning in Language Models via Activation Patching | Persona-Driven Reasoning in Sprachmodellen per Aktivierungs-Patching auflösen | 通过激活补丁在语言模型中通过激活补丁解剖人-人-驱动原因 2507.20936v1 |
Authors (2): Ansh Poonia, Maeghal Jain
Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model’s reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model’s output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.
大语言模型(LLMS)在采用多种人方面表现出非凡的多功能性。 在这次研究中, 我们研究一个人的指派如何影响模型对客观任务的推理。 使用激活补丁, 我们迈出第一步, 了解模型的关键组成部分如何将特定个人的信息编码。 我们的研究结果显示, 早期的多语言跨视谱层不仅关注输入的合成结构, 也处理其语义内容。 这些层将个人符号转换为更丰富的表达方式, 然后由中等多极关注层( MHA) 用来塑造模型的输出。 此外, 我们确定那些不相称地关注种族和肤色身份的焦点 。
Article 298
Title@2025-07-28 (1): LLM2TEA: An Agentic AI Designer for Discovery with Generative Evolutionary Multitasking
Title: LLM2TEA: An Agentic AI Designer for Discovery with Generative Evolutionary Multitasking | LLM2TEA: Agentischer AI-Designer für Entdeckung mit generativem evolutionären Multitasking | LLM2TEA: 利用产生进化多任务探索的代理AI 设计器 2406.14917v3 |
Authors (5): Melvin Wong, Jiao Liu, Thiago Rios, Stefan Menzel, Yew Soon Ong
This paper presents LLM2TEA, a Large Language Model (LLM) driven MultiTask Evolutionary Algorithm, representing the first agentic AI designer of its kind operating with generative evolutionary multitasking (GEM). LLM2TEA enables the crossbreeding of solutions from multiple domains, fostering novel solutions that transcend disciplinary boundaries. Of particular interest is the ability to discover designs that are both novel and conforming to real-world physical specifications. LLM2TEA comprises an LLM to generate genotype samples from text prompts describing target objects, a text-to-3D generative model to produce corresponding phenotypes, a classifier to interpret its semantic representations, and a computational simulator to assess its physical properties. Novel LLM-based multitask evolutionary operators are introduced to guide the search towards high-performing, practically viable designs. Experimental results in conceptual design optimization validate the effectiveness of LLM2TEA, showing 97% to 174% improvements in the diversity of novel designs over the current text-to-3D baseline. Moreover, over 73% of the generated designs outperform the top 1% of designs produced by the text-to-3D baseline in terms of physical performance. The designs produced by LLM2TEA are not only aesthetically creative but also functional in real-world contexts. Several of these designs have been successfully 3D printed, demonstrating the ability of our approach to transform AI-generated outputs into tangible, physical designs. These designs underscore the potential of LLM2TEA as a powerful tool for complex design optimization and discovery, capable of producing novel and physically viable designs.
本文展示了LLM2TEA(LLM2TEA),这是一个大型语言模型(LLM)驱动的多语种进化演化算法,代表了首个使用基因进化多任务(GEM)的AI代理设计师。LLM2TEA(LLM2TEA)能够交叉利用多个域的解决方案,促进超越学科界限的新解决方案。特别令人感兴趣的是能够发现既新颖又符合现实世界物理规格的设计。LLM2TEA(LM)包含一个LM(LM),用来从描述目标对象的文本提示中生成基因型样本的基因样本,一个文本到3D(D)的基因化模型,用来生成相应的字符型类型,一个用于解释其语义表达的物理结构。LM(LM)多功能演化操作员的计算模拟器,用来指导寻找高性能、实际可行的设计。LM2TEA(LM)的实验性能优化验证了LM2TEA(LMTEA)的效能,显示在目前文本至3D基线的新型设计上97%至174。此外智能设计中的73%(D)的精精精化能力,通过印刷设计,也显示了SD(D)的精制成了SD(D)的精制成)的精制的精制的精制成的精制的精制成的精制版图)。
Article 299
Title@2025-07-28 (1): FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models
Title: FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models | FHSTP@EXIST 2025 Benchmark: Sexismuserkennung mit transparenten Sprachkonzepten Engpassmodelle | FHSTP@EXIST 2025 基准:用透明言论概念瓶颈模型探测性别主义 2507.20924v1 |
Authors (6): Roberto Labadie-Tamayo, Adrian Jaques Böck, Djordje Slijepčević, Xihui Chen, Andreas Babic, Matthias Zeppelzauer
Sexism has become widespread on social media and in online conversation. To help address this issue, the fifth Sexism Identification in Social Networks (EXIST) challenge is initiated at CLEF 2025. Among this year’s international benchmarks, we concentrate on solving the first task aiming to identify and classify sexism in social media textual posts. In this paper, we describe our solutions and report results for three subtasks: Subtask 1.1 - Sexism Identification in Tweets, Subtask 1.2 - Source Intention in Tweets, and Subtask 1.3 - Sexism Categorization in Tweets. We implement three models to address each subtask which constitute three individual runs: Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT), and a fine-tuned XLM-RoBERTa transformer model. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to encode input texts into a human-interpretable representation of adjectives, then used to train a lightweight classifier for downstream tasks. SCBMT extends SCBM by fusing adjective-based representation with contextual embeddings from transformers to balance interpretability and classification performance. Beyond competitive results, these two models offer fine-grained explanations at both instance (local) and class (global) levels. We also investigate how additional metadata, e.g., annotators’ demographic profiles, can be leveraged. For Subtask 1.1, XLM-RoBERTa, fine-tuned on provided data augmented with prior datasets, ranks 6th for English and Spanish and 4th for English in the Soft-Soft evaluation. Our SCBMT achieves 7th for English and Spanish and 6th for Spanish.
为了解决这一问题,在2025年CLEF中启动了第五个社会网络性别识别(EXIST)挑战。在今年的国际基准中,我们集中解决第一个旨在识别和分类社交媒体文本文章中的性别主义的任务。在本文中,我们描述了我们的三个子任务的解决办法和报告结果:Tweets的Subtask 1.1 - Suptask 1.2 - Tweet的性别识别;Subtask 1.2 - Tweets 的源源识别;和 Subtask 1.3 - Tweet的性别分类。我们实施了三个模型来解决构成三个单个运行的每一个子任务: Speople Notleeck 模型(SCBMM)、 Speople Noteck 模型(SCBMM) 和一个经过精细调的 XLM-ROBM 变异模型。SBM 用于人与人之间交替的瓶概念概念概念概念。SBMFlocketrial-dealal dealations 和Slational-lational-deal-deal-deal-demodeal dealal lavelmental ladeal deal deal ladal exal ladeal dal listrations. Scal 和Smal-deal-s 和Slational-SBildal-s ial-s 和Slational-s ibal-s lautal-dealtistrational-s 和Slational-deal-s 和Slational-在英国的Slational-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-在英国数据级别上,在英国数据级数据上,在SB) 和SB) 和SB) 和SBI-SB-SB-SBal 上可以使用两个级数据上提供提供提供。S-S-SBD-S-S-SBD-SBD-SBD-SBD-SBD-SB 和SBD-SB
Article 300
Title@2025-07-28 (1): MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
Title: MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation | MediQAl: Eine französische medizinische Frage zur Beantwortung von Datensätzen für Wissens- und Begründungsbewertung | MediQAl:用于知识和合理评估的法国医学问题解答数据集 2507.20917v1 |
Authors (1): Adrien Bazoge
This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models’ cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models’ performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.
nan
Article 301
Title@2025-07-28 (1): Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Title: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models | Benchmarking Open-Ended Audio Dialogue Understanding für große Audio-Language-Modelle | 确定大型音频语言模型不限成员名额音频对话理解基准 2412.05167v2 |
Authors (5): Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu
Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., “Really!?” with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.
nan
Article 302
Title@2025-07-28 (1): Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?
Title: Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery? | Sollte Top-Down-Clustering Grenzen in unüberwachten Word Discovery beeinflussen? | 在无人监督的“发现字”中, 上下层群集是否应该影响边界? 2507.19204v2 |
Authors (3): Simon Malan, Benjamin van Niekerk, Herman Kamper
We investigate the problem of segmenting unlabeled speech into word-like units and clustering these to create a lexicon. Prior work can be categorized into two frameworks. Bottom-up methods first determine boundaries and then cluster the fixed segmented words into a lexicon. In contrast, top-down methods incorporate information from the clustered words to inform boundary selection. However, it is unclear whether top-down information is necessary to improve segmentation. To explore this, we look at two similar approaches that differ in whether top-down clustering informs boundary selection. Our simple bottom-up strategy predicts word boundaries using the dissimilarity between adjacent self-supervised features, then clusters the resulting segments to construct a lexicon. Our top-down system is an updated version of the ES-KMeans dynamic programming method that iteratively uses K-means to update its boundaries. On the five-language ZeroSpeech benchmarks, both approaches achieve comparable state-of-the-art results, with the bottom-up system being nearly five times faster. Through detailed analyses, we show that the top-down influence of ES-KMeans can be beneficial (depending on factors like the candidate boundaries), but in many cases the simple bottom-up method performs just as well. For both methods, we show that the clustering step is a limiting factor. Therefore, we recommend that future work focus on improved clustering techniques and learning more discriminative word-like representations. Project code repository: https://github.com/s-malan/prom-seg-clus.
nan
Article 303
Title@2025-07-28 (1): $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement
Title: $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement | $A^2R^2$: Verbesserung der Img2LaTeX-Umwandlung durch visuelles Reasoning mit aufmerksamkeitsgeführter Verfeinerung | $A2R2美元:通过关注引导的精炼,通过视觉理性推进Img2LaTeX转换 2507.20890v1 |
Authors (6): Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai
Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.
nan
Article 304
Title@2025-07-28 (1): Enhancing Project-Specific Code Completion by Inferring Internal API Information
Title: Enhancing Project-Specific Code Completion by Inferring Internal API Information | Verbesserung der projektspezifischen Code-Vervollständigung durch Schlussfolgerung interner API-Informationen | 通过推断内部API信息加强具体项目法规的完成 2507.20888v1 |
Authors (6): Le Deng, Xiaoxue Ren, Chao Ni, Ming Liang, David Lo, Zhongxin Liu
Project-specific code completion is a critical task that leverages context from a project to generate accurate code. State-of-the-art methods use retrieval-augmented generation (RAG) with large language models (LLMs) and project information for code completion. However, they often struggle to incorporate internal API information, which is crucial for accuracy, especially when APIs are not explicitly imported in the file. To address this, we propose a method to infer internal API information without relying on imports. Our method extends the representation of APIs by constructing usage examples and semantic descriptions, building a knowledge base for LLMs to generate relevant completions. We also introduce ProjBench, a benchmark that avoids leaked imports and consists of large-scale real-world projects. Experiments on ProjBench and CrossCodeEval show that our approach significantly outperforms existing methods, improving code exact match by 22.72% and identifier exact match by 18.31%. Additionally, integrating our method with existing baselines boosts code match by 47.80% and identifier match by 35.55%.
nan
Article 305
Title@2025-07-28 (1): Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings
Title: Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings | Nutzung von Open-Source-Großsprachenmodellen für die Extraktion klinischer Informationen in ressourcenbeschränkten Einstellungen | 利用开放源码大语言模型,在受资源限制的环境下进行临床信息采掘 2507.20859v1 |
Authors (5): Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering
Medical reports contain rich clinical information but are often unstructured and written in domain-specific language, posing challenges for information extraction. While proprietary large language models (LLMs) have shown promise in clinical natural language processing, their lack of transparency and data privacy concerns limit their utility in healthcare. This study therefore evaluates nine open-source generative LLMs on the DRAGON benchmark, which includes 28 clinical information extraction tasks in Dutch. We developed \texttt{llm_extractinator}, a publicly available framework for information extraction using open-source generative LLMs, and used it to assess model performance in a zero-shot setting. Several 14 billion parameter models, Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results, while the bigger Llama-3.3-70B model achieved slightly higher performance at greater computational cost. Translation to English prior to inference consistently degraded performance, highlighting the need of native-language processing. These findings demonstrate that open-source LLMs, when used with our framework, offer effective, scalable, and privacy-conscious solutions for clinical information extraction in low-resource settings.
nan
Article 306
Title@2025-07-28 (1): A survey of diversity quantification in natural language processing: The why, what, where and how
Title: A survey of diversity quantification in natural language processing: The why, what, where and how | Eine Übersicht der Diversitätsquantifizierung in der natürlichen Sprachverarbeitung: Das Warum, Was, Wo und Wie | 自然语言处理中多样性量化调查:原因、内容、地点和方式 2507.20858v1 |
Authors (5): Louis Estève, Marie-Catherine de Marneffe, Nurit Melnik, Agata Savary, Olha Kanishcheva
The concept of diversity has received increased consideration in Natural Language Processing (NLP) in recent years. This is due to various motivations like promoting and inclusion, approximating human linguistic behavior, and increasing systems’ performance. Diversity has however often been addressed in an ad hoc manner in NLP, and with few explicit links to other domains where this notion is better theorized. We survey articles in the ACL Anthology from the past 6 years, with “diversity” or “diverse” in their title. We find a wide range of settings in which diversity is quantified, often highly specialized and using inconsistent terminology. We put forward a unified taxonomy of why, what on, where, and how diversity is measured in NLP. Diversity measures are cast upon a unified framework from ecology and economy (Stirling, 2007) with 3 dimensions of diversity: variety, balance and disparity. We discuss the trends which emerge due to this systematized approach. We believe that this study paves the way towards a better formalization of diversity in NLP, which should bring a better understanding of this notion and a better comparability between various approaches.
nan
Article 307
Title@2025-07-28 (1): Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities
Title: Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities | Sprachenmodellierung für die Zukunft der Finanzen: Eine Umfrage zu Metrics, Aufgaben und Datenmöglichkeiten | 未来融资语言建模:计量、任务和数据机会调查 2504.07274v2 |
Authors (4): Nikita Tatarinov, Siddhant Sukhani, Agam Shah, Sudheer Chava
Recent advances in language modeling have led to growing interest in applying Natural Language Processing (NLP) techniques to financial problems, enabling new approaches to analysis and decision-making. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 quantitative and qualitative dimensions, and our study identifies the following opportunities: (i) expanding the scope of forecasting tasks; (ii) enriching evaluation with financial metrics; (iii) leveraging multilingual and crisis-period datasets; and (iv) balancing PLMs with efficient or interpretable alternatives. We identify actionable directions for research and practice, supported by dataset and tool recommendations, with implications for both the academia and industry communities.
nan
Article 308
Title@2025-07-28 (1): Latent Inter-User Difference Modeling for LLM Personalization
Title: Latent Inter-User Difference Modeling for LLM Personalization | Latent Inter-User Difference Modeling für LLM Personalisierung | LLM个性化不同模型 2507.20849v1 |
Authors (6): Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, Fuli Feng
Large language models (LLMs) are increasingly integrated into users’ daily lives, leading to a growing demand for personalized outputs. Previous work focuses on leveraging a user’s own history, overlooking inter-user differences that are crucial for effective personalization. While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions. To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user’s embedding with those of peers who engaged with similar content, highlighting relative behavioral signals. A sparse autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen LLM. Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics. Our code is available at https://github.com/SnowCharmQ/DEP.
nan
Article 309
Title@2025-07-28 (1): Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models
Title: Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models | Kritik des unreinen Grundes: Enthüllen des Argumentationsverhaltens medizinischer Großsprachenmodelle | 简便理由的批评:统一医学大语言模式的推理行为 2412.15748v2 |
Authors (2): Shamus Sim, Tyrone Chen
Background: Despite the current ubiquity of Large Language Models (LLMs) across the medical domain, there is a surprising lack of studies which address their reasoning behaviour. We emphasise the importance of understanding reasoning behaviour as opposed to high-level prediction accuracies, since it is equivalent to explainable AI (XAI) in this context. In particular, achieving XAI in medical LLMs used in the clinical domain will have a significant impact across the healthcare sector. Results: Therefore, in this work, we adapt the existing concept of reasoning behaviour and articulate its interpretation within the specific context of medical LLMs. We survey and categorise current state-of-the-art approaches for modeling and evaluating reasoning reasoning in medical LLMs. Additionally, we propose theoretical frameworks which can empower medical professionals or machine learning engineers to gain insight into the low-level reasoning operations of these previously obscure models. We also outline key open challenges facing the development of Large Reasoning Models. Conclusion: The subsequent increased transparency and trust in medical machine learning models by clinicians as well as patients will accelerate the integration, application as well as further development of medical AI for the healthcare system as a whole.
nan
Article 310
Title@2025-07-28 (1): FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings
Title: FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings | FocalPO: Verbesserung der Preference-Optimierung durch Fokussierung auf korrekte Preference-Rankings | 重点:通过注重正确的优先排序,加强优惠优化 2501.06645v3 |
Authors (5): Tong Liu, Xiao Yu, Wenxuan Zhou, Jindong Gu, Volker Tresp
Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textit{down-weighs} misranked preference pairs and prioritizes enhancing the model’s understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B, with the introduced hyperparameter fixed. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.
nan
Article 311
Title@2025-07-28 (1): Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models
Title: Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models | Automatisieren der thematischen Überprüfung der Prävention von zukünftigen Todesfällen Berichte: Nachahmung der ONS-Kinder-Selbstmord-Studie mit großen Sprachmodellen | 对预防今后死亡报告进行自动化专题审查:利用大语言模式复制ONS儿童自杀研究 2507.20786v1 |
Authors (5): Sam Osian, Arpan Dutta, Sahil Bhandari, Iain E. Buchan, Dan W. Joyce
Prevention of Future Deaths (PFD) reports, issued by coroners in England and Wales, flag systemic hazards that may lead to further loss of life. Analysis of these reports has previously been constrained by the manual effort required to identify and code relevant cases. In 2025, the Office for National Statistics (ONS) published a national thematic review of child-suicide PFD reports ($\leq$ 18 years), identifying 37 cases from January 2015 to November 2023 - a process based entirely on manual curation and coding. We evaluated whether a fully automated, open source “text-to-table” language-model pipeline (PFD Toolkit) could reproduce the ONS’s identification and thematic analysis of child-suicide PFD reports, and assessed gains in efficiency and reliability. All 4,249 PFD reports published from July 2013 to November 2023 were processed via PFD Toolkit’s large language model pipelines. Automated screening identified cases where the coroner attributed death to suicide in individuals aged 18 or younger, and eligible reports were coded for recipient category and 23 concern sub-themes, replicating the ONS coding frame. PFD Toolkit identified 72 child-suicide PFD reports - almost twice the ONS count. Three blinded clinicians adjudicated a stratified sample of 144 reports to validate the child-suicide screening. Against the post-consensus clinical annotations, the LLM-based workflow showed substantial to almost-perfect agreement (Cohen’s $\kappa$ = 0.82, 95% CI: 0.66-0.98, raw agreement = 91%). The end-to-end script runtime was 8m 16s, transforming a process that previously took months into one that can be completed in minutes. This demonstrates that automated LLM analysis can reliably and efficiently replicate manual thematic reviews of coronial data, enabling scalable, reproducible, and timely insights for public health and safety. The PFD Toolkit is openly available for future research.
nan
Article 312
Title@2025-07-28 (1): On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
Title: On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey | Über die Rolle von vorgebildeten Sprachmodellen in allgemeinen Text-Embeddings: Eine Umfrage | 关于 “ 预先培训的语言模式在一般用途文本嵌入中所起的作用:调查 “ 2507.20783v1 |
Authors (6): Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang
Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
nan
Article 313
Title@2025-07-28 (1): TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks
Title: TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks | TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks | TN-AutoRCA:电信网络中自我改进基于警报的原始原因分析的基准建设和示范框架 2507.18190v2 |
Authors (7): Keyu Wu, Qianjin Yu, Manlin Mei, Ruiting Liu, Jun Wang, Kailai Zhang, Yelun Bao
Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.
nan
Article 314
Title@2025-07-28 (1): The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints
Title: The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints | Die Auswirkungen von LoRA-Adaptern auf LLMs für die klinische Textklassifikation unter Computational und Data Constraints | LoRA适应器对在计算和数据限制下临床文本分类的LLMs的影响 2407.19299v3 |
Authors (6): Thanh-Dung Le, Ti Ti Nguyen, Vu Nguyen Ha, Symeon Chatzinotas, Philippe Jouvet, Rita Noumeir
Fine-tuning Large Language Models (LLMs) for clinical Natural Language Processing (NLP) poses significant challenges due to domain gap, limited data, and stringent hardware constraints. In this study, we evaluate four adapter techniques-Adapter, Lightweight, TinyAttention, and Gated Residual Network (GRN) - equivalent to Low-Rank Adaptation (LoRA), for clinical note classification under real-world, resource-constrained conditions. All experiments were conducted on a single NVIDIA Quadro P620 GPU (2 GB VRAM, 512 CUDA cores, 1.386 TFLOPS FP32), limiting batch sizes to <8 sequences and maximum sequence length to 256 tokens. Our clinical corpus comprises only 580 000 tokens, several orders of magnitude smaller than standard LLM pre-training datasets. We fine-tuned three biomedical pre-trained LLMs (CamemBERT-bio, AliBERT, DrBERT) and two lightweight Transformer models trained from scratch. Results show that 1) adapter structures provide no consistent gains when fine-tuning biomedical LLMs under these constraints, and 2) simpler Transformers, with minimal parameter counts and training times under six hours, outperform adapter-augmented LLMs, which required over 1000 GPU-hours. Among adapters, GRN achieved the best metrics (accuracy, precision, recall, F1 = 0.88). These findings demonstrate that, in low-resource clinical settings with limited data and compute, lightweight Transformers trained from scratch offer a more practical and efficient solution than large LLMs, while GRN remains a viable adapter choice when minimal adaptation is needed.
nan
Article 315
Title@2025-07-28 (1): Multilingual Self-Taught Faithfulness Evaluators
Title: Multilingual Self-Taught Faithfulness Evaluators | Mehrsprachige Selbstlernende Bewertung von Treue | 多语言自学自学信仰评价员 2507.20752v1 |
Authors (6): Carlo Alfano, Aymen Al Marjani, Zeno Jonke, Amin Mantrach, Saab Mansour, Marcello Federico
The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents Self-Taught Evaluators for Multilingual Faithfulness, a framework that learns exclusively from synthetic multilingual summarization data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM’s general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
nan
Article 316
Title@2025-07-28 (1): Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study
Title: Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study | Untersuchung struktureller Pruning- und Recovery-Techniken zur Komprimierung multimodaler Großsprachenmodelle: Eine empirische Studie | 压缩多式大语言模式结构保护和恢复调查技术:经验研究 2507.20749v1 |
Authors (5): Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata
While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms–layerwise and widthwise pruning–applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.
nan
Article 317
Title@2025-07-28 (1): Everything is a Video: Unifying Modalities through Next-Frame Prediction
Title: Everything is a Video: Unifying Modalities through Next-Frame Prediction | Alles ist ein Video: Vereinheitlichen von Modalitäten durch Next-Frame-Vorhersage | 一切都是一部视频:通过下框架预测实现统一的方式 2411.10503v2 |
Authors (7): G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed
Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text, demonstrating the model’s ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.
nan
Article 318
Title@2025-07-28 (1): Group Sequence Policy Optimization
Title: Group Sequence Policy Optimization | Optimierung der Gruppensequenzpolitik | 组序列政策优化 2507.18071v2 |
Authors (12): Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
nan
Article 319
Title@2025-07-28 (1): Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models
Title: Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models | Text2VLM: Anpassung von Text-Only-Datensätzen an die Auswertung von Alignment-Trainings in visuellen Sprachmodellen | Text2VLM: 调整纯文本数据集以评价视觉语言模型的对齐培训 2507.20704v1 |
Authors (4): Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas
The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.
nan
Article 320
Title@2025-07-28 (1): Computational Analysis of Character Development in Holocaust Testimonies
Title: Computational Analysis of Character Development in Holocaust Testimonies | Computational Analyse der Charakterentwicklung in Holocaust-Zeugnissen | 大屠杀证词特征发展计算分析 2412.17063v4 |
Authors (4): Esther Shizgal, Eitan Wagner, Renana Keydar, Omri Abend
This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes the inner and outer changes the protagonist undergoes within a narrative, and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor’s religious trajectory, examining the evolution of their disposition toward religious belief and practice along the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, most present a constant disposition, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing techniques for analyzing character evolution through thematic trajectories in narratives.
nan
Article 321
Title@2025-07-28 (1): When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification
Title: When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification | Wenn Scale auf Vielfalt trifft: Bewertung von Sprachmodellen auf feinkörnige Mehrsprachigkeitsprüfung | 规模达到多样性时:评价精细多语言索赔核实的语言模式 2507.20700v1 |
Authors (4): Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova, Josef van Genabith
The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.
nan
Article 322
Title@2025-07-28 (1): Geometric-Mean Policy Optimization
Title: Geometric-Mean Policy Optimization | Geometrisch-Mean-Policy-Optimierung | 几何海洋政策优化 2507.20673v1 |
Authors (12): Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.
nan
Article 323
Title@2025-07-28 (1): Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs
Title: Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs | Benchmarking Graph Neural Networks für die Dokumentenlayout-Analyse in öffentlichen Angelegenheiten | 用于公共事务文件布局分析的图表神经网络 2505.14699v2 |
Authors (6): Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Ruben Tolosana, Oscar Delgado-Mohatar, Alvaro Ortigosa
The automatic analysis of document layouts in digital-born PDF documents remains a challenging problem due to the heterogeneous arrangement of textual and nontextual elements and the imprecision of the textual metadata in the Portable Document Format. In this work, we benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents. We introduce two graph construction structures: a k-closest-neighbor graph and a fully connected graph, and generate node features via pre-trained text and vision models, thus avoiding manual feature engineering. Three experimental frameworks are evaluated: single-modality (text or visual), concatenated multimodal, and dual-branch multimodal. We evaluated four foundational GNN models and compared them with the baseline. Our experiments are specifically conducted on a rich dataset of public affairs documents that includes more than 20 sources (e.g., regional and national-level official gazettes), 37K PDF documents, with 441K pages in total. Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy, outperforming the baseline in some sources. These findings confirm the importance of local layout relationships and multimodal fusion exploited through GNNs for the analysis of native digital document layouts.
nan
Article 324
Title@2025-07-28 (1): Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study
Title: Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study | Nachweis von unerwünschten Arzneimittelereignissen in niederländischen klinischen Textdokumenten mit Transformer-Modellen: Benchmark-Studie | 利用变换模型发现荷兰临床免费文本文件中的不良毒品事件:基准研究 2507.19396v2 |
Authors (8): Rachel M. Murphy, Nishant Mishra, Nicolette F. de Keizer, Dave A. Dongelmans, Kitty J. Jager, Ameen Abu-Hanna, Joanna E. Klopotowska, Iacer Calixto
In this study, we establish a benchmark for adverse drug event (ADE) detection in Dutch clinical free-text documents using several transformer models, clinical scenarios, and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, MedRoBERTa(.)nl, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free-text clinical progress notes of patients admitted to the intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using the gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated for detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the dataset imbalance in ADEs. Although differences for the ADE RC task between the models were small, MedRoBERTa(.)nl was the best performing model with a macro-averaged F1 score of 0.63 using the gold standard and 0.62 using predicted entities. The MedRoBERTa(.)nl models also performed the best in our external validation and achieved a recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free-text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.
nan
Article 325
Title@2025-07-28 (1): Ontology-Enhanced Knowledge Graph Completion using Large Language Models
Title: Ontology-Enhanced Knowledge Graph Completion using Large Language Models | Ontologie-erweiterte Wissensgraphenvervollständigung mit großen Sprachmodellen | 利用大语言模式完成本部强化知识图 2507.20643v1 |
Authors (5): Wenbin Guo, Xin Wang, Jiaoyan Chen, Zhao Li, Zirui Chen
Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs – OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks – FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.
nan
Article 326
Title@2025-07-28 (1): Explainable Synthetic Image Detection through Diffusion Timestep Ensembling
Title: Explainable Synthetic Image Detection through Diffusion Timestep Ensembling | Erklärbare Synthetische Bilderkennung durch Diffusionszeitpunkt Zusammenbauen | 通过传播时间步骤组合进行可解释的合成图像探测 2503.06201v2 |
Authors (10): Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, in the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module to identify and explain AI-generated flaws. Additionally, we construct the GenHard and GenExplain benchmarks to provide detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness. Our code and datasets are available at https://github.com/Shadowlized/ESIDE.
nan
Article 327
Title@2025-07-28 (1): Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior
Title: Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior | Vor der Empörung: Herausforderungen und Fortschritte bei der Vorhersage von Online-Antisozialverhalten | 暴政前:预测在线反社会行为的挑战和进展 2507.20614v1 |
Authors (1): Anaïs Ollagnier
Antisocial behavior (ASB) on social media-including hate speech, harassment, and trolling-poses growing challenges for platform safety and societal wellbeing. While prior work has primarily focused on detecting harmful content after it appears, predictive approaches aim to forecast future harmful behaviors-such as hate speech propagation, conversation derailment, or user recidivism-before they fully unfold. Despite increasing interest, the field remains fragmented, lacking a unified taxonomy or clear synthesis of existing methods. This paper presents a systematic review of over 49 studies on ASB prediction, offering a structured taxonomy of five core task types: early harm detection, harm emergence prediction, harm propagation prediction, behavioral risk prediction, and proactive moderation support. We analyze how these tasks differ by temporal framing, prediction granularity, and operational goals. In addition, we examine trends in modeling techniques-from classical machine learning to pre-trained language models-and assess the influence of dataset characteristics on task feasibility and generalization. Our review highlights methodological challenges, such as dataset scarcity, temporal drift, and limited benchmarks, while outlining emerging research directions including multilingual modeling, cross-platform generalization, and human-in-the-loop systems. By organizing the field around a coherent framework, this survey aims to guide future work toward more robust and socially responsible ASB prediction.
nan
Article 328
Title@2025-07-28 (1): AutoLibra: Agent Metric Induction from Open-Ended Feedback
Title: AutoLibra: Agent Metric Induction from Open-Ended Feedback | AutoLibra: Agent Metric Induktion aus offenem Feedback | AutoLibra: 不限名额反馈的计量介绍代理 2505.02820v2 |
Authors (6): Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own” into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
nan
Article 329
Title@2025-07-28 (1): ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning
Title: ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning | ZSE-Cap: Ein Zero-Shot-Ensemble für Bildwiederherstellung und Prompt-Führung | ZSE-Cap: 用于图像检索和即时指导说明的零热组合 2507.20564v1 |
Authors (2): Duc-Tai Dinh, Duc Anh Khoa Dinh
We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition’s data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.
nan
Article 330
Title@2025-07-28 (1): Enhancing Hallucination Detection via Future Context
Title: Enhancing Hallucination Detection via Future Context | Halluzinationserkennung durch zukünftigen Kontext verbessern | 通过未来环境加强幻觉探测 2507.20546v1 |
Authors (6): Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo
Large Language Models (LLMs) are widely used to generate plausible text on online platforms, without revealing the generation process. As users increasingly encounter such black-box outputs, detecting hallucinations has become a critical challenge. To address this challenge, we focus on developing a hallucination detection framework for black-box generators. Motivated by the observation that hallucinations, once introduced, tend to persist, we sample future contexts. The sampled future contexts provide valuable clues for hallucination detection and can be effectively integrated with various sampling-based methods. We extensively demonstrate performance improvements across multiple methods using our proposed sampling approach.
nan
Article 331
Title@2025-07-28 (1): From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought
Title: From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought | Von Antworten zu Rationalen: Selbstjustierung multimodaler Vernunft mit answer-oriented Chain-of-Thought | 从答案到理由:自调整的多式联运理由与以回答为主的探索链 2507.02984v2 |
Authors (5): Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding
Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methods primarily focus on synthesizing positive rationales, typically relying on manual annotations or complex systems. Moreover, they often overlook negative reasoning, which limits the model’s generalization ability and robustness in multimodal inference. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). SMART employs an answer-oriented chain-of-thought (AoT) prompt to automatically construct high-quality data. Drawing inspiration from human proof-based strategies, AoT leverages both correct and incorrect answers to extract key visual information that links questions and answers. When provided with correct answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with incorrect alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model’s reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code is available at https://github.com/WentaoTan/SMART.
nan
Article 332
Title@2025-07-28 (1): Kimi K2: Open Agentic Intelligence
Title: Kimi K2: Open Agentic Intelligence | Kimi K2: Offene Agentische Intelligenz | Kimi K2:开放特工情报 2507.20534v1 |
Authors (169): Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, Xinxing Zu
We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual – surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.
nan
Article 333
Title@2025-07-28 (1): SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law
Title: SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law | SafeWork-R1: Koevolving Safety and Intelligence unter dem AI-45$^{\circ}$ Gesetz | 安全工作-R1:根据AI-45$ circ}$ 法发展安全和情报 2507.18576v2 |
Authors (118): Shanghai AI Lab, :, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, Yu Cheng, Dengke Deng, Yizhuo Ding, Dan Ding, Xiaoshan Ding, Yi Ding, Zhichen Dong, Lingxiao Du, Yuyu Fan, Xinshun Feng, Yanwei Fu, Yuxuan Gao, Ruijun Ge, Tianle Gu, Lujun Gui, Jiaxuan Guo, Qianxi He, Yuenan Hou, Xuhao Hu, Hong Huang, Kaichen Huang, Shiyang Huang, Yuxian Jiang, Shanzhe Lei, Jie Li, Lijun Li, Hao Li, Juncheng Li, Xiangtian Li, Yafu Li, Lingyu Li, Xueyan Li, Haotian Liang, Dongrui Liu, Qihua Liu, Zhixuan Liu, Bangwei Liu, Huacan Liu, Yuexiao Liu, Zongkai Liu, Chaochao Lu, Yudong Lu, Xiaoya Lu, Zhenghao Lu, Qitan Lv, Caoyuan Ma, Jiachen Ma, Xiaoya Ma, Zhongtian Ma, Lingyu Meng, Ziqi Miao, Yazhe Niu, Yuezhang Peng, Yuan Pu, Han Qi, Chen Qian, Xingge Qiao, Jingjing Qu, Jiashu Qu, Wanying Qu, Wenwen Qu, Xiaoye Qu, Qihan Ren, Qingnan Ren, Qingyu Ren, Jing Shao, Wenqi Shao, Shuai Shao, Dongxing Shi, Xin Song, Xinhao Song, Yan Teng, Xuan Tong, Yingchun Wang, Xuhong Wang, Shujie Wang, Xin Wang, Yige Wang, Yixu Wang, Yuanfu Wang, Futing Wang, Ruofan Wang, Wenjie Wang, Yajie Wang, Muhao Wei, Xiaoyu Wen, Fenghua Weng, Yuqi Wu, Yingtong Xiong, Xingcheng Xu, Chao Yang, Yue Yang, Yang Yao, Yulei Ye, Zhenyun Yin, Yi Yu, Bo Zhang, Qiaosheng Zhang, Jinxuan Zhang, Yexin Zhang, Yinqiang Zheng, Hefeng Zhou, Zhanhui Zhou, Pengyu Zhu, Qingzi Zhu, Yubo Zhu, Bowen Zhou
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
nan
Article 334
Title@2025-07-28 (1): Otter: A Multi-Modal Model with In-Context Instruction Tuning
Title: Otter: A Multi-Modal Model with In-Context Instruction Tuning | Otter: Ein Multi-Modal-Modell mit In-Context-Anleitung Tuning | Ottter:具有内文指导图纸的多模式模型 2305.03726v2 |
Authors (8): Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Ziwei Liu
Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.
nan
Article 335
Title@2025-07-28 (1): Dialogues of Dissent: Thematic and Rhetorical Dimensions of Hate and Counter-Hate Speech in Social Media Conversations
Title: Dialogues of Dissent: Thematic and Rhetorical Dimensions of Hate and Counter-Hate Speech in Social Media Conversations | Dialoge von Dissent: Thematische und rhetorische Dimensionen von Hass und Gegenhass in Social Media-Gesprächen | 不同意见对话:社会媒体对话中的仇恨和反仇恨言论的主题和风湿方面 2507.20528v1 |
Authors (4): Effi Levi, Gal Ron, Odelia Oshri, Shaul R. Shenhav
We introduce a novel multi-labeled scheme for joint annotation of hate and counter-hate speech in social media conversations, categorizing hate and counter-hate messages into thematic and rhetorical dimensions. The thematic categories outline different discursive aspects of each type of speech, while the rhetorical dimension captures how hate and counter messages are communicated, drawing on Aristotle’s Logos, Ethos and Pathos. We annotate a sample of 92 conversations, consisting of 720 tweets, and conduct statistical analyses, incorporating public metrics, to explore patterns of interaction between the thematic and rhetorical dimensions within and between hate and counter-hate speech. Our findings provide insights into the spread of hate messages on social media, the strategies used to counter them, and their potential impact on online behavior.
nan
Article 336
Title@2025-07-28 (1): Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Title: Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards | Versehentliche Sicherheitslücke: Faktoren bei Feinsteuerung, die das Modell schützen | 意外脆弱性:改变模式保障保障措施的微调因素 2505.16789v2 |
Authors (4): Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.
nan
Article 337
Title@2025-07-28 (1): Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
Title: Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition | Sicherheitsherausforderungen bei der Bereitstellung von KI-Agenten: Einblicke aus einem groß angelegten öffentlichen Wettbewerb | AI 代理部署在安全方面面临的挑战:大规模公共竞争的展望 2507.20526v1 |
Authors (17): Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson
Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today’s AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.
nan
Article 338
Title@2025-07-28 (1): AQUA: A Large Language Model for Aquaculture & Fisheries
Title: AQUA: A Large Language Model for Aquaculture & Fisheries | AQUA: Ein großes Sprachmodell für Aquakultur und Fischerei | AQUA:水产养殖和渔业大语言模式 2507.20520v1 |
Authors (7): Praneeth Narisetty, Uday Kumar Reddy Kattamanchi, Lohit Akshant Nimma, Sri Ram Kaushik Karnati, Shiva Nagendra Babu Kore, Mounika Golamari, Tejashree Nageshreddy
Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain-specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high-quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM-driven innovations in aquaculture research, advisory systems, and decision-making tools.
nan
Article 339
Title@2025-07-28 (1): Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training
Title: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training | Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Pre-Training | 推进藏藏人大语言模式,提供 “ 扩展数据 “ 和 “ 持续培训前 “ 。 2507.09205v4 |
Authors (17): Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong
Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.
nan
Article 340
Title@2025-07-28 (1): REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models | REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle | REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v7 |
Authors (4): Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.
nan
Article 341
Title@2025-07-28 (1): Customize Multi-modal RAI Guardrails with Precedent-based predictions
Title: Customize Multi-modal RAI Guardrails with Precedent-based predictions | Multimodale RAI-Guardrails mit vorausschauenden Vorhersagen anpassen | 定制具有先例预测的多式RAI护卫车 2507.20503v1 |
Authors (6): Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang
A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model’s judgment on “precedents”, which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.
nan
Article 342
Title@2025-07-28 (1): Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT
Title: Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT | Pruning for Performance: Effiziente Idiom- und Metaphor-Klassifikation in Low-Resource Konkani mit mBERT | 利用mBERT, 低资源 Konkani 中高效的低资源 Konkani 和同义词分类 2506.02005v2 |
Authors (7): Timothy Do, Pranav Saran, Harshita Poojary, Pranav Prabhu, Sean O’Brien, Vasu Sharma, Kevin Zhu
In this paper, we address the persistent challenges that figurative language expressions pose for natural language processing (NLP) systems, particularly in low-resource languages such as Konkani. We present a hybrid model that integrates a pre-trained Multilingual BERT (mBERT) with a bidirectional LSTM and a linear classifier. This architecture is fine-tuned on a newly introduced annotated dataset for metaphor classification, developed as part of this work. To improve the model’s efficiency, we implement a gradient-based attention head pruning strategy. For metaphor classification, the pruned model achieves an accuracy of 78%. We also applied our pruning approach to expand on an existing idiom classification task, achieving 83% accuracy. These results demonstrate the effectiveness of attention head pruning for building efficient NLP tools in underrepresented languages.
nan
Article 343
Title@2025-07-28 (1): Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems
Title: Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems | Sprechen in Worten, Denken in Logik: Ein Dual-Process-Framework in QA-Systemen | 用文字说,用逻辑思考:质量保证系统中的双重处理框架 2507.20491v1 |
Authors (8): Tuan Bui, Trong Le, Phat Thai, Sang Nguyen, Minh Hua, Ngan Pham, Thang Bui, Tho Quan
Recent advances in large language models (LLMs) have significantly enhanced question-answering (QA) capabilities, particularly in open-domain contexts. However, in closed-domain scenarios such as education, healthcare, and law, users demand not only accurate answers but also transparent reasoning and explainable decision-making processes. While neural-symbolic (NeSy) frameworks have emerged as a promising solution, leveraging LLMs for natural language understanding and symbolic systems for formal reasoning, existing approaches often rely on large-scale models and exhibit inefficiencies in translating natural language into formal logic representations. To address these limitations, we introduce Text-JEPA (Text-based Joint-Embedding Predictive Architecture), a lightweight yet effective framework for converting natural language into first-order logic (NL2FOL). Drawing inspiration from dual-system cognitive theory, Text-JEPA emulates System 1 by efficiently generating logic representations, while the Z3 solver operates as System 2, enabling robust logical inference. To rigorously evaluate the NL2FOL-to-reasoning pipeline, we propose a comprehensive evaluation framework comprising three custom metrics: conversion score, reasoning score, and Spearman rho score, which collectively capture the quality of logical translation and its downstream impact on reasoning accuracy. Empirical results on domain-specific datasets demonstrate that Text-JEPA achieves competitive performance with significantly lower computational overhead compared to larger LLM-based systems. Our findings highlight the potential of structured, interpretable reasoning frameworks for building efficient and explainable QA systems in specialized domains.
nan
Article 344
Title@2025-07-28 (1): Juru: Legal Brazilian Large Language Model from Reputable Sources
Title: Juru: Legal Brazilian Large Language Model from Reputable Sources | Juru: Rechtliches brasilianisches Large Language Model aus seriösen Quellen | Juru:来自有名来源的巴西大语言法律模型 2403.18140v2 |
Authors (4): Roseval Malaquias Junior, Ramon Pires, Roseli Romero, Rodrigo Nogueira
The high compute cost associated with pretraining large language models limits their research. Two strategies have emerged to address this issue: domain specialization and pretraining with high-quality data. To explore these strategies, we specialized the Mistral-7B model with 1.9 billion unique tokens from reputable Brazilian legal sources and conducted few-shot evaluations on legal and general knowledge test suites. Our model, Juru, demonstrates the benefits of domain specialization by achieving improved performance on legal benchmarks, even with a reduced amount of pretraining data. However, this domain specialization through continued pretraining comes at the cost of increased forgetting in unrelated domains, as evidenced by performance degradation on general knowledge test suites in both Portuguese and English. This study contributes to the growing body of scientific evidence showing that pretraining data selection may enhance the performance of large language models, enabling the exploration of these models at a lower cost. Juru is publicly available at https://huggingface.co/roseval/Juru-7B .
nan
Article 345
Title@2025-07-28 (1): Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
Title: Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents | Benutzer vor ihnen selbst schützen: Schutz der kontextuellen Privatsphäre in Interaktionen mit Gesprächspartnern | 保护用户免受自我伤害:在与交流代理人的互动中保护环境隐私 2502.18509v2 |
Authors (7): Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy
Conversational agents are increasingly woven into individuals’ personal lives, yet users often underestimate the privacy risks associated with them. The moment users share information with these agents-such as large language models (LLMs)-their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLM-based Conversational Agents (LCAs). It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LCAs (untrusted receivers). Through a formative design user study, we observe how even “privacy-conscious” users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally deployable framework that operates between users and LCAs, identifying and reformulating out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user’s intended interaction goals. Notably, about 76% of participants in our human evaluation preferred the reformulated prompts over the original ones, validating the usability and effectiveness of contextual privacy in our proposed framework. We opensource the code at https://github.com/IBM/contextual-privacy-LLM.
nan
Article 346
Title@2025-07-28 (1): Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM
Title: Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM | Ähnliches Beispiel verbessern Retrieval-Ranking-Performance durch Revisiting RankSVM | 通过重审RanksSVM改进类似案例检索排名 2502.11131v2 |
Authors (2): Yuqi Liu, Yan Zheng
Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks–similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method–RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in https://github.com/liuyuqi123study/RankSVM_for_SLR
nan
Article 347
Title@2025-07-28 (1): In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents
Title: In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents | In Prospect und Retrospect: Reflektierendes Speichermanagement für langfristige Personalisierte Dialogagenten | 展望和回顾:长期个人化对话代理人的反思记忆管理 2503.08026v2 |
Authors (15): Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
nan
Article 348
Title@2025-07-27 (7): Critiques of World Models
Title: Critiques of World Models | Kritik an Weltmodellen | 世界模式的证明 2507.05169v3 |
Authors (4): Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu
World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.
nan
Article 349
Title@2025-07-27 (7): CodeNER: Code Prompting for Named Entity Recognition
Title: CodeNER: Code Prompting for Named Entity Recognition | CodeNER: Codeaufforderung für die benannte Entitätserkennung | 识别名称实体的代码提示代码 2507.20423v1 |
Authors (5): Sungwoo Han, Hyeyeon Kim, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.
nan
Article 350
Title@2025-07-27 (7): Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?
Title: Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks? | Umfrage zu NLU-Benchmarks Diagnose Linguistische Phänomene: Warum nicht Diagnose-Benchmarks standardisieren? | NLU基准诊断语言神话调查:为什么不使诊断基准标准化? 2507.20419v1 |
Authors (3): Khloud AL Jallad, Nada Ghneim, Ghaida Rebdawi
Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks and providing in-depth error analysis. When highlighting the gaps in the state-of-the-art, we noted that there is no naming convention for macro and micro categories or even a standard set of linguistic phenomena that should be covered. Consequently, we formulated a research question regarding the evaluation metrics of the evaluation diagnostics benchmarks: “Why do not we have an evaluation standard for the NLU evaluation diagnostics benchmarks?” similar to ISO standard in industry. We conducted a deep analysis and comparisons of the covered linguistic phenomena in order to support experts in building a global hierarchy for linguistic phenomena in future. We think that having evaluation metrics for diagnostics evaluation could be valuable to gain more insights when comparing the results of the studied models on different diagnostics benchmarks.
nan
Article 351
Title@2025-07-27 (7): CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning
Title: CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning | CONCAP: Über das Englische hinaussehen mit Konzepten Retrieval-Augmented Captioning | CONCACM: 以概念检索增强说明方式在英语以外看问题 2507.20411v1 |
Authors (3): George Ibrahim, Rita Ramos, Yova Kementchedjhieva
Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset indicate that CONCAP enables strong performance on low- and mid-resource languages, with highly reduced data requirements. Our findings highlight the effectiveness of concept-aware retrieval augmentation in bridging multilingual performance gaps.
nan
Article 352
Title@2025-07-27 (7): Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training
Title: Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training | Clarify lernen: Multiturn-Gespräche mit aktionsbasiertem Kontrast-Selbst-Training | 学习澄清:与基于行动的差异性自我培训进行多方向对话 2406.00222v2 |
Authors (4): Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arık
Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation – when they are faced with ambiguity, they often overhedge or implicitly guess users’ true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs’ ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT’s efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs’ ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.
nan
Article 353
Title@2025-07-27 (7): Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification
Title: Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification | Selbstregularisierung mit Sparse Autoencodern für steuerbare LLM-basierte Klassifizierung | 与基于可控 LLM 的可控 LLM 分类的 Sparse 自动编码器的自调节 2502.14133v3 |
Authors (4): Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu
Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impact of these unintended features on classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed self-regularization framework can improve the classifier’s generalizability by regularizing those features that are not semantically correlated to the task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. The code and data are publicly available at https://github.com/JacksonWuxs/Controllable_LLM_Classifier.
nan
Article 354
Title@2025-07-27 (7): Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations
Title: Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations | Kognitive Denkkette: Strukturierte multimodale Begründung über soziale Situationen | 认知思考链:社会状况的结构性多模式原因 2507.20409v1 |
Authors (5): Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap
Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8\% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.
nan
Article 355
Title@2025-07-27 (7): Length Representations in Large Language Models
Title: Length Representations in Large Language Models | Längendarstellungen in großen Sprachmodellen | 大语言模式中的长长代表 2507.20398v1 |
Authors (5): Sangjun Moon, Dasom Choi, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Large language models (LLMs) have shown remarkable capabilities across various tasks, that are learned from massive amounts of text-based data. Although LLMs can control output sequence length, particularly in instruction-based settings, the internal mechanisms behind this control have been unexplored yet. In this study, we provide empirical evidence on how output sequence length information is encoded within the internal representations in LLMs. In particular, our findings show that multi-head attention mechanisms are critical in determining output sequence length, which can be adjusted in a disentangled manner. By scaling specific hidden units within the model, we can control the output sequence length without losing the informativeness of the generated text, thereby indicating that length information is partially disentangled from semantic information. Moreover, some hidden units become increasingly active as prompts become more length-specific, thus reflecting the model’s internal awareness of this attribute. Our findings suggest that LLMs have learned robust and adaptable internal mechanisms for controlling output length without any external control.
nan
Article 356
Title@2025-07-27 (7): Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation
Title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation | Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation | 以证据为依据的反健康错误信息反言多证据检索强化框架 2507.07307v2 |
Authors (6): Anirban Saha Anik, Xiaoying Song, Elliott Wang, Bryan Wang, Bengisu Yarimbas, Lingzi Hong
Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, cross evaluations show that our system generalizes well across diverse health misinformation topics and datasets. And human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.
nan
Article 357
Title@2025-07-27 (7): Memorization: A Close Look at Books
Title: Memorization: A Close Look at Books | Auswendiglernen: Ein genauer Blick auf Bücher | 记忆化:对书籍的近视 2504.12549v2 |
Authors (5): Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes
To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the “prefix-prompting” extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice’s Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.
nan
Article 358
Title@2025-07-27 (7): Scaling Analysis of Interleaved Speech-Text Language Models
Title: Scaling Analysis of Interleaved Speech-Text Language Models | Skalierungsanalyse interleaved Speech-Text Language Models | 剖分间语音-文字语言模式扩大分析 2504.02398v2 |
Authors (4): Gallil Maimon, Michael Hassid, Amit Roth, Yossi Adi
Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. It predicts that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - “Do interleaved SLMs scale more efficiently than textless-SLMs?” In this paper we answer a resounding yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling dynamics significantly differ from textless-SLMs, suggesting one should allocate notably more of the compute budget to increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest that our scaled up model achieves comparable semantic speech performance to leading models, while using less compute and data. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims/ .
nan
Article 359
Title@2025-07-27 (7): RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing
Title: RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing | RMTBench: Benchmarking von LLMs durch Multi-Turn-Benutzer-Centric-Rollenspiel | RMTBench:通过多发用户中心发挥作用,确定LLMs基准 2507.20352v1 |
Authors (13): Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun
Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon.
nan
Article 360
Title@2025-07-27 (7): DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns
Title: DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns | DYNARTmo: Ein dynamisches Artikulationsmodell zur Visualisierung von Sprachbewegungsmustern | DYNARTmo:语音移动模式视觉化动态脉动模型 2507.20343v1 |
Authors (1): Bernd J. Kröger
We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.
nan
Article 361
Title@2025-07-27 (7): FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Title: FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | FMSD-TTS: Wenige Aufnahmen Multi-Speeaker Multi-Dialekt Text-zu-Speech-Synthese für Ü-Tsang, Amdo und Kham Speech Dataset Generation | FMSD-TTS:为于赞、阿姆多和康言语数据集制作而制作的微小多声多声多功能多语音文本到语音合成合成 2505.14351v2 |
Authors (10): Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi
Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.
nan
Article 362
Title@2025-07-27 (7): ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios
Title: ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios | ELMES: Ein automatisierter Rahmen für die Bewertung großer Sprachmodelle in Bildungsszenarien | ELMES:评估教育情景中大语言模式自动框架 2507.22947v1 |
Authors (12): Shou’ang Wei, Xinyun Wang, Shuzhen Bi, Jian Chen, Ruijia Li, Bo Jiang, Xin Lin, Min Zhang, Yu Song, BingDong Li, Aimin Zhou, Hao Hao
The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM-as-a-Judge methodology. We conduct systematic benchmarking of state-of-the-art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine-grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context-specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. The framework is publicly available at \emph{https://github.com/sii-research/elmes.git}.
nan
Article 363
Title@2025-07-27 (7): What is Wrong with Perplexity for Long-context Language Modeling?
Title: What is Wrong with Perplexity for Long-context Language Modeling? | Was ist falsch an Verwirrung für Langkontext-Sprachenmodellierung? | 长文本语言建模的复杂性有什么问题? 2410.23771v5 |
Authors (8): Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang
Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at https://github.com/PKU-ML/LongPPL.
nan
Article 364
Title@2025-07-27 (7): Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation
Title: Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation | Förderung dialektischer arabischer zu moderner arabischer Standard-Maschinenübersetzung | 向现代标准阿拉伯文机器翻译推广阿拉伯语 2507.20301v1 |
Authors (3): Abdullah Alabdullah, Lifeng Han, Chenghua Lin
Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide limits access to digital services and educational resources and impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: a comprehensive evaluation of training-free prompting techniques, and the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. GPT-4o achieved the highest performance across all prompting settings. For fine-tuning, a quantized Gemma2-9B model achieved a CHrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% CHrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.
nan
Article 365
Title@2025-07-27 (7): Real-time Factuality Assessment from Adversarial Feedback
Title: Real-time Factuality Assessment from Adversarial Feedback | Echtzeit-Faktualitätsbeurteilung aus dem Adversarial Feedback | 从反反向反馈反馈中实时进行实况评估 2410.14651v3 |
Authors (3): Sanxing Chen, Yukun Huang, Bhuwan Dhingra
We show that existing evaluations for assessing the factuality of news from conventional sources, such as claims on fact-checking websites, result in high accuracies over time for LLM-based detectors-even after their knowledge cutoffs. This suggests that recent popular false information from such sources can be easily identified due to its likely presence in pre-training/retrieval corpora or the emergence of salient, yet shallow, patterns in these datasets. Instead, we argue that a proper factuality evaluation dataset should test a model’s ability to reason about current events by retrieving and reading related evidence. To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive variants that challenge LLMs. Our iterative rewrite decreases the binary classification ROC-AUC by an absolute 17.5 percent for a strong RAG-based GPT-4o detector. Our experiments reveal the important role of RAG in both evaluating and generating challenging news examples, as retrieval-free LLM detectors are vulnerable to unseen events and adversarial attacks, while feedback from RAG-based evaluation helps discover more deceitful patterns.
nan
Article 366
Title@2025-07-27 (7): SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration
Title: SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration | SciToolAgent: Ein wissensbasierter wissenschaftlicher Agent für Multi-Tool-Integration | SciToolAgent: 多工具整合知识图表驱动科学代理 2507.20280v1 |
Authors (6): Keyan Ding, Jing Yu, Junjie Huang, Yuchen Yang, Qiang Zhang, Huajun Chen
Scientific research increasingly relies on specialized computational tools, yet effectively utilizing these tools demands substantial domain expertise. While Large Language Models (LLMs) show promise in tool automation, they struggle to seamlessly integrate and orchestrate multiple tools for complex scientific workflows. Here, we present SciToolAgent, an LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science. At its core, SciToolAgent leverages a scientific tool knowledge graph that enables intelligent tool selection and execution through graph-based retrieval-augmented generation. The agent also incorporates a comprehensive safety-checking module to ensure responsible and ethical tool usage. Extensive evaluations on a curated benchmark demonstrate that SciToolAgent significantly outperforms existing approaches. Case studies in protein engineering, chemical reactivity prediction, chemical synthesis, and metal-organic framework screening further demonstrate SciToolAgent’s capability to automate complex scientific workflows, making advanced research tools accessible to both experts and non-experts.
nan
Article 367
Title@2025-07-27 (7): What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
Title: What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations | Welche Sprache(n) denkt Aya-23? Wie Mehrsprachigkeit die Repräsentationen der internen Sprache beeinflusst | Aya-23 思考什么语言?多语言如何影响内部语言代表性? 2507.20279v1 |
Authors (4): Katharina Trinley, Toshiki Nakai, Tatiana Anikina, Tanja Baeumel
Large language models (LLMs) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only LLM trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23’s languagespecific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron overlap analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes LLM internals and inform future cross-lingual transfer research.
nan
Article 368
Title@2025-07-27 (7): Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning
Title: Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning | Agent-Fin-R1: Verbesserung der Finanzintelligenz durch Domain-Expertise, Trainingseffizienz und Advanced Reasoning | Agentar Fin-Fin-R1:通过域域专门知识、培训效率和高级理由加强金融情报 2507.16802v4 |
Authors (13): Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Jingze Song, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang
Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.
nan
Article 369
Title@2025-07-27 (7): MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning
Title: MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning | MoL-RL: Destillieren von mehrstufigem Umweltfeedback in LLMs zur feedbackunabhängigen Begründung | MoL-RL:将多层环境反馈保留到LLMs,用于提供反馈-独立理由 2507.20278v1 |
Authors (5): Kang Yang, Jingxue Chen, Qingkun Tang, Tianxiang Zhang, Qianchun Lu
Large language models (LLMs) face significant challenges in effectively leveraging sequential environmental feedback (EF) signals, such as natural language evaluations, for feedback-independent chain-of-thought (CoT) reasoning. Existing approaches either convert EF into scalar rewards, losing rich contextual information, or employ refinement datasets, failing to exploit the multi-step and discrete nature of EF interactions. To address these limitations, we propose MoL-RL, a novel training paradigm that integrates multi-step EF signals into LLMs through a dual-objective optimization framework. Our method combines MoL (Mixture-of-Losses) continual training, which decouples domain-specific EF signals (optimized via cross-entropy loss) and general language capabilities (preserved via Kullback-Leibler divergence), with GRPO-based post-training to distill sequential EF interactions into single-step inferences. This synergy enables robust feedback-independent reasoning without relying on external feedback loops. Experimental results on mathematical reasoning (MATH-500, AIME24/AIME25) and code generation (CodeAgent-Test) benchmarks demonstrate that MoL-RL achieves state-of-the-art performance with the Qwen3-8B model, while maintaining strong generalization across model scales (Qwen3-4B). This work provides a promising approach for leveraging multi-step textual feedback to enhance LLMs’ reasoning capabilities in diverse domains.
nan
Article 370
Title@2025-07-27 (7): ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech
Title: ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech | ChildGuard: Ein spezieller Datensatz zur Bekämpfung von kindgewordener Hassrede | 儿童指南:打击针对儿童的仇恨言论专门数据集 2506.21613v2 |
Authors (6): Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem
Hate speech targeting children on social media is a serious and growing problem, yet current NLP systems struggle to detect it effectively. This gap exists mainly because existing datasets focus on adults, lack age specific labels, miss nuanced linguistic cues, and are often too small for robust modeling. To address this, we introduce ChildGuard, the first large scale English dataset dedicated to hate speech aimed at children. It contains 351,877 annotated examples from X (formerly Twitter), Reddit, and YouTube, labeled by three age groups: younger children (under 11), pre teens (11–12), and teens (13–17). The dataset is split into two subsets for fine grained analysis: a contextual subset (157K) focusing on discourse level features, and a lexical subset (194K) emphasizing word-level sentiment and vocabulary. Benchmarking state of the art hate speech models on ChildGuard reveals notable drops in performance, highlighting the challenges of detecting child directed hate speech.
nan
Article 371
Title@2025-07-27 (7): EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms
Title: EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms | EMBRACE: Inclusive Opinion Representation gestalten, indem implizite Gespräche mit sozialen Normen ausgerichtet werden | EMBRACE:通过与社会规范的关联性交流,形成包容性的见解代表制 2507.20264v1 |
Authors (2): Abeer Aldayel, Areej Alokaili
Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a lens on how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.
nan
Article 372
Title@2025-07-27 (7): Post-Completion Learning for Language Models
Title: Post-Completion Learning for Language Models | Post-Completion-Lernen für Sprachmodelle | 语文模式完成后学习 2507.20252v1 |
Authors (7): Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Can Huang
Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (
nan
Article 373
Title@2025-07-27 (7): Modeling Professionalism in Expert Questioning through Linguistic Differentiation
Title: Modeling Professionalism in Expert Questioning through Linguistic Differentiation | Modellierung von Professionalität in der Expertenbefragung durch sprachliche Differenzierung | 通过语言差异问题专家提问的示范专业精神 2507.20249v1 |
Authors (2): Giulia D’Agostino, Chung-Chi Chen
Professionalism is a crucial yet underexplored dimension of expert communication, particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model (LLM)-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.
nan
Article 374
Title@2025-07-27 (7): Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers
Title: Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers | Contrast-CAT: Kontrastierende Aktivierungen für verbesserte Interpretierbarkeit in Transformer-basierten Textklassifikatoren | 反对-CAT:在基于变换器的文本分类中增强解释力的对比活动 2507.21186v1 |
Authors (3): Sungmin Han, Jeonghyun Lee, Sangkyun Lee
Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of x1.30 in AOPC and x2.25 in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.
nan
Article 375
Title@2025-07-27 (7): Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models
Title: Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models | Reframe Your Life Story: Interaktiver Erzähltherapeut und innovative Moment-Assessment mit großen Sprachmodellen | 重构你的生活故事:与大语言模式互动叙述治疗师和创新时间评估 2507.20241v1 |
Authors (9): Yi Feng, Jiaqi Wang, Wenxuan Zhang, Zhuang Chen, Yutong Shen, Xiyao Xiao, Minlie Huang, Liping Jing, Jian Yu
Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, INT (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses. Second, IMA (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking “Innovative Moments” (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that INT consistently outperforms standard LLMs in therapeutic quality and depth. We further demonstrate the effectiveness of INT in synthesizing high-quality support conversations to facilitate social applications.
nan
Article 376
Title@2025-07-27 (7): DoubleDipper: Improving Long-Context LLMs via Context Recycling
Title: DoubleDipper: Improving Long-Context LLMs via Context Recycling | DoubleDipper: Verbesserung der Langkontext-LLMs über Kontext-Recycling | 双重顶点:通过上下文再循环改进长文本LLMs 2406.13632v4 |
Authors (11): Arie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi Caciularu
Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In this work, we propose DoubleDipper, a novel In-Context-Learning method that automatically generates few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+16 absolute points on average across models) on various QA datasets with long context. Surprisingly, despite introducing only single-hop ICL examples, LLMs successfully generalize to multi-hop long-context QA using our approach.
nan
Article 377
Title@2025-07-27 (7): Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines
Title: Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines | Lernende-LLM-Chatbot-Interaktionen verstehen und die Auswirkungen von Sofortrichtlinien verstehen | 了解学习者-LLLM 聊天室互动和推动准则的影响 2504.07840v3 |
Authors (16): Cansu Koyuturk, Emily Theophilou, Sabrina Patania, Gregor Donabauer, Andrea Martinenghi, Chiara Antico, Alessia Telari, Alessia Testa, Sathya Bursic, Franca Garzotto, Davinia Hernandez-Leo, Udo Kruschwitz, Davide Taibi, Simona Amenta, Martin Ruskov, Dimitri Ognibene
Large Language Models (LLMs) have transformed human-computer interaction by enabling natural language-based communication with AI-powered chatbots. These models are designed to be intuitive and user-friendly, allowing users to articulate requests with minimal effort. However, despite their accessibility, studies reveal that users often struggle with effective prompting, resulting in inefficient responses. Existing research has highlighted both the limitations of LLMs in interpreting vague or poorly structured prompts and the difficulties users face in crafting precise queries. This study investigates learner-AI interactions through an educational experiment in which participants receive structured guidance on effective prompting. We introduce and compare three types of prompting guidelines: a task-specific framework developed through a structured methodology and two baseline approaches. To assess user behavior and prompting efficacy, we analyze a dataset of 642 interactions from 107 users. Using Von NeuMidas, an extended pragmatic annotation schema for LLM interaction analysis, we categorize common prompting errors and identify recurring behavioral patterns. We then evaluate the impact of different guidelines by examining changes in user behavior, adherence to prompting strategies, and the overall quality of AI-generated responses. Our findings provide a deeper understanding of how users engage with LLMs and the role of structured prompting guidance in enhancing AI-assisted communication. By comparing different instructional frameworks, we offer insights into more effective approaches for improving user competency in AI interactions, with implications for AI literacy, chatbot usability, and the design of more responsive AI systems.
nan
Article 378
Title@2025-07-27 (7): Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation
Title: Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation | Co-NAML-LSTUR: Ein kombiniertes Modell mit attentivem Multi-View-Lernen und Langzeit- und Kurzzeit-Benutzervertretungen für News-Empfehlungen | NAML-LTUR:与多视学习和新闻建议长期及短期用户代表相结合的综合模式 2507.20210v1 |
Authors (3): Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta
News recommendation systems play a vital role in mitigating information overload by delivering personalized news content. A central challenge is to effectively model both multi-view news representations and the dynamic nature of user interests, which often span both short- and long-term preferences. Existing methods typically rely on single-view features of news articles (e.g., titles or categories) or fail to comprehensively capture user preferences across time scales. In this work, we propose Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news modeling and LSTUR for capturing both long- and short-term user representations. Our model also incorporates BERT-based word embeddings to enhance semantic feature extraction. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Experimental results show that Co-NAML-LSTUR achieves substantial improvements over most state-of-the-art baselines on MIND-small and MIND-large, respectively. These results demonstrate the effectiveness of combining multi-view news representations with dual-scale user modeling. The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR.
nan
Article 379
Title@2025-07-27 (7): IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs
Title: IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs | IQ-Test für LLMs: Ein Bewertungsrahmen für die Entdeckung von Kernkompetenzen in LLMs | LLMLM的IQ测试:LLM中核心技能覆盖的评估框架 2507.20208v1 |
Authors (5): Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty
Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model’s overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models’ wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.
nan
Article 380
Title@2025-07-27 (7): Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data
Title: Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data | Günstiges Lernen: Maximierung der Leistungsfähigkeit von Sprachmodellen für die Sozialdatenwissenschaft mit minimalen Daten | 廉价学习:利用最低数据使社会数据科学语言模型的绩效最大化 2401.12295v2 |
Authors (7): Leonardo Castro-Gonzalez, Yi-Ling Chung, Hannak Rose Kirk, John Francis, Angus R. Williams, Pica Johansson, Jonathan Bright
The field of machine learning has recently made significant progress in reducing the requirements for labelled training data when building new models. These cheaper' learning techniques hold significant potential for the social sciences, where development of large labelled training datasets is often a significant practical impediment to the use of machine learning for analytical tasks. In this article we review three
cheap’ techniques that have developed in recent years: weak supervision, transfer learning and prompt engineering. For the latter, we also review the particular case of zero-shot prompting of large language models. For each technique we provide a guide of how it works and demonstrate its application across six different realistic social science applications (two different tasks paired with three different dataset makeups). We show good performance for all techniques, and in particular we demonstrate how prompting of large language models can achieve high accuracy at very low cost. Our results are accompanied by a code repository to make it easy for others to duplicate our work and use it in their own research. Overall, our article is intended to stimulate further uptake of these techniques in the social sciences.
nan
Article 381
Title@2025-07-27 (7): Diversity-Enhanced Reasoning for Subjective Questions
Title: Diversity-Enhanced Reasoning for Subjective Questions | Diversity-Enhanced Reasoning für subjektive Fragen | 主观问题的多样性强化理由 2507.20187v1 |
Authors (4): Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Yi R. Fung
Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1’s effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.
nan
Article 382
Title@2025-07-27 (7): SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding
Title: SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding | SessionIntentBench: Ein Multi-Task Inter-Session Intention-Shift Modelling Benchmark für E-Commerce Kundenverhalten Verständnis | A. 会议内容:电子商务客户行为理解的多任务、多任务、跨会期、出于利益转移的 电子商业客户行为理解示范基准 2507.20185v1 |
Authors (16): Yuqi Yang, Weiqi Wang, Baixuan Xu, Wei Fan, Qing Zong, Chunkit Chan, Zheye Deng, Xin Liu, Yifan Gao, Changlong Yu, Chen Luo, Yang Li, Zheng Li, Qingyu Yin, Bing Yin, Yangqiu Song
Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don’t satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs’ capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs’ performances.
nan
Article 383
Title@2025-07-27 (7): SGPO: Self-Generated Preference Optimization based on Self-Improver
Title: SGPO: Self-Generated Preference Optimization based on Self-Improver | SGPO: Selbsterzeugte Preference-Optimierung auf Basis von Self-Improver | SGPO:基于自我改造的自发优惠优化 2507.20181v1 |
Authors (4): Hyeonji Lee, Daejin Jo, Seohwan Yun, Sungwoong Kim
Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make incremental yet discernible improvements to the current responses by referencing supervised fine-tuning outputs. Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods without using external preference data.
nan
Article 384
Title@2025-07-27 (7): Checklist Engineering Empowers Multilingual LLM Judges
Title: Checklist Engineering Empowers Multilingual LLM Judges | Checkliste Engineering Empowers Mehrsprachige LLM-Richter | 多语种LLM法官 2507.06774v2 |
Authors (2): Mohammad Ghiasvand Mohammadkhani, Hamid Beigy
Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.
nan
Article 385
Title@2025-07-27 (7): Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective
Title: Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective | Intersektionale Bias in japanischen großen Sprachmodellen aus einer kontextualisierten Perspektive | 日本大语言模型中从背景角度分析的交叉比阿语 2506.12327v2 |
Authors (9): Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu
An increasing number of studies have examined the social bias of rapidly developed large language models (LLMs). Although most of these studies have focused on bias occurring in a single social attribute, research in social science has shown that social bias often occurs in the form of intersectionality – the constitutive and contextualized perspective on bias aroused by social attributes. In this study, we construct the Japanese benchmark inter-JBBQ, designed to evaluate the intersectional bias in LLMs on the question-answering setting. Using inter-JBBQ to analyze GPT-4o and Swallow, we find that biased output varies according to its contexts even with the equal combination of social attributes.
nan
Article 386
Title@2025-07-27 (7): Goal Alignment in LLM-Based User Simulators for Conversational AI
Title: Goal Alignment in LLM-Based User Simulators for Conversational AI | Zielausrichtung in LLM-basierten Benutzersimulatoren für KI | 在基于LLM的LLM用户模拟器中实现目标对齐 2507.20152v1 |
Authors (6): Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-Tür
User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions. While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multi-turn conversations–a critical limitation that compromises their reliability in downstream applications. We introduce User Goal State Tracking (UGST), a novel framework that tracks user goal progression throughout conversations. Leveraging UGST, we present a three-stage methodology for developing user simulators that can autonomously track goal progression and reason to generate goal-aligned responses. Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and {\tau}-Bench). Our contributions address a critical gap in conversational AI and establish UGST as an essential framework for developing goal-aligned user simulators.
nan
Article 387
Title@2025-07-27 (7): The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models
Title: The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models | The Policy Cliff: Eine theoretische Analyse von Belohnungs-Policy-Karten in großen Sprachmodellen | 政策悬崖:大语言模式奖励政策图的理论分析 2507.20150v1 |
Authors (1): Xingcheng Xu
Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an “effective reward” aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.
nan
Article 388
Title@2025-07-27 (7): Multi-Agent Interactive Question Generation Framework for Long Document Understanding
Title: Multi-Agent Interactive Question Generation Framework for Long Document Understanding | Multi-Agent Interactive Question Generierung Framework for Long Document Understanding | 长期文件理解问题多机构互动问题生成框架 2507.20145v1 |
Authors (9): Kesen Wang, Daulet Toibazar, Abdulrahman Alfulayt, Abdulaziz S. Albadawi, Ranya A. Alkahtani, Asma A. Ibrahim, Haneen A. Alhomoud, Sherif Mohamed, Pedro J. Moreno
Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbf{AraEngLongBench}) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in https://github.com/wangk0b/Multi_Agentic_QA_Long_Doc.git. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.
nan
Article 389
Title@2025-07-27 (7): Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG
Title: Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG | Multi-Stage Verifikations-Centric Framework zur Eindämmung der Halluzination in Multi-Modal RAG | 多模式RAG中减轻幻觉多阶段核查-中心框架 2507.20136v1 |
Authors (5): Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim
This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition’s scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .
nan
Article 390
Title@2025-07-27 (7): EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models
Title: EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models | EvoSLD: Automatisierte Neural Scaling Law Discovery mit großen Sprachmodellen | EvoSLD: 用大语言模型发现自动神经放大法 2507.21184v1 |
Authors (4): Haowei Lin, Xiangyu Wang, Jianzhu Ma, Yitao Liang
Scaling laws are fundamental mathematical relationships that predict how neural network performance evolves with changes in variables such as model size, dataset size, and computational resources. Traditionally, discovering these laws requires extensive human expertise and manual experimentation. We introduce EvoSLD, an automated framework for Scaling Law Discovery (SLD) that leverages evolutionary algorithms guided by Large Language Models (LLMs) to co-evolve symbolic expressions and their optimization routines. Formulated to handle scaling variables, control variables, and response metrics across diverse experimental settings, EvoSLD searches for parsimonious, universal functional forms that minimize fitting errors on grouped data subsets. Evaluated on five real-world scenarios from recent literature, EvoSLD rediscovers exact human-derived laws in two cases and surpasses them in others, achieving up to orders-of-magnitude reductions in normalized mean squared error on held-out test sets. Compared to baselines like symbolic regression and ablated variants, EvoSLD demonstrates superior accuracy, interpretability, and efficiency, highlighting its potential to accelerate AI research. Code is available at https://github.com/linhaowei1/SLD.
nan
Article 391
Title@2025-07-27 (7): When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
Title: When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars | Wann funktioniert Metadata Conditioning (NOT) für Sprachmodell-Vorschulungen? Eine Studie mit kontextfreien Grammatiken | 元数据条件(NOT)何时能为语言示范培训前培训提供语言示范?无背景语法研究 2504.17562v2 |
Authors (10): Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki
The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task’s prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model’s performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.
nan
Article 392
Title@2025-07-27 (7): MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Title: MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge | MaPPO: Maximale Posteriori-Preference-Optimierung mit vorherigem Wissen | MaPPPO: 与先前知识最优化的后世偏好 2507.21183v1 |
Authors (10): Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton
As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.
nan
Article 393
Title@2025-07-27 (7): TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling
Title: TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling | TIB-STC: Ein großflächiger strukturierter tibetischer Benchmark für ressourcenarme Sprachmodellierung | TIB-STC: 低资源语言建模的西藏大型结构化基准 2503.18288v4 |
Authors (14): Cheng Huang, Fan Gao, Yutong Liu, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu
Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain benchmark specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the benchmark’s effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available at: https://github.com/Vicentvankor/sun-shine
nan
Article 394
Title@2025-07-27 (7): Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
Title: Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice | Seed LiveInterpret 2.0: End-to-End Simultanübersetzung mit Ihrer Stimme | 种子实况解释2.0:用声音翻译终端到终端同声语音语音 2507.17527v3 |
Authors (28): Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Yujiao Du, Ting Han, Yuxiang Hu, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Jun Zhang, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.
nan
Article 395
Title@2025-07-27 (7): Measuring Information Distortion in Hierarchical Ultra long Novel Reconstruction:The Optimal Expansion Ratio
Title: Measuring Information Distortion in Hierarchical Ultra long Novel Reconstruction:The Optimal Expansion Ratio | Messung von Informationsverzerrung bei hierarchischer Ultralangem Novel Reconstruction:The Optimal Expansion Ratio | 测量高层次超长新世纪重建中的信息扭曲:最佳扩展比率 2505.12572v2 |
Authors (2): Hanwen Shen, Ting Ying
A two stage novel generation framework (outline -> section outline -> manuscript) is widely used in long novel generation,(e.g., \textsc{DOME}, \textsc{Plan\&Write}, \textsc{Long Writer}), but study of such framework in ultra long novel(>1M words) reconstruction is little. Building on recent text compression methods (\textsc{LLMZip}, \textsc{LLM2Vec}), we conduct an information-theoretic analysis to quantify semantic distortion under different compression-expansion ratios. We examine how outline length affects information preservation. Experiments on ultra-long novels show that the optimal compression-expansion ratio significantly reduces semantic distortion compared to other non-optimal compression-expansion ratio.
nan
Article 396
Title@2025-07-27 (7): Language Models Resist Alignment: Evidence From Data Compression
Title: Language Models Resist Alignment: Evidence From Data Compression | Sprachmodelle widerstehen Ausrichtung: Beweise aus Datenkompression | 语言模型阻力对齐:数据压缩中的证据 2406.06144v5 |
Authors (10): Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang
Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.
nan
Article 397
Title@2025-07-27 (7): AI-Driven Generation of Old English: A Framework for Low-Resource Languages
Title: AI-Driven Generation of Old English: A Framework for Low-Resource Languages | AI-Driven Generation of Old English: Ein Rahmen für Low-Resource-Sprachen | AI-Driven 一代老英语:低资源语言框架 2507.20111v1 |
Authors (4): Rodrigo Gabriel Salazar Alva, Matías Nuñez, Cristian López, Javier Martín Arista
Preserving ancient languages is essential for understanding humanity’s cultural and linguistic heritage, yet Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts, addressing this gap. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation, LoRA), data augmentation via backtranslation, and a dual-agent pipeline that separates the tasks of content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows significant improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment also confirms high grammatical accuracy and stylistic fidelity in the generated texts. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, effectively uniting AI innovation with the goals of cultural preservation.
nan
Article 398
Title@2025-07-27 (7): Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
Title: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations | Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen | 超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs 2507.04886v2 |
Authors (1): A. Bochkov
Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.
nan
Article 399
Title@2025-07-27 (7): EcoTransformer: Attention without Multiplication
Title: EcoTransformer: Attention without Multiplication | EcoTransformer: Achtung ohne Multiplikation | 生态转换:注意不乘数 2507.20096v1 |
Authors (2): Xin Gao, Xingming Xu
The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy.
nan
Article 400
Title@2025-07-27 (7): ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
Title: ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models | ProsodyLM: Enthüllen der neu entstehenden Prosody-Verarbeitungsfähigkeiten in Sprachmodellen | ProsodyLM: 解决语言模式中新出现的处理能力问题 2507.20091v1 |
Authors (7): Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang
Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information – we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.
nan
Article 401
Title@2025-07-27 (7): Reinforcement learning fine-tuning of language model for instruction following and math reasoning
Title: Reinforcement learning fine-tuning of language model for instruction following and math reasoning | Verstärktes Lernen der Feinabstimmung des Sprachmodells für Unterricht und Mathe-Reinigung | 强化学习,微调用于教学的语文模式和数学推理 2506.21560v2 |
Authors (2): Yifu Han, Geo Zhang
This study investigates the effectiveness of reinforcement learning (RL) fine-tuning techniques on a compact language model (Qwen2.5-0.5B Base) for two challenging tasks: instruction following and mathematical reasoning. We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models. Our experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results. For math reasoing tasks, synthetic data augmentation and best-of-N sampling with an external verifier significantly improve accuracy, showing the potential of combining fine-tuning with inference-time tools. This study highlights key trade-offs and practical strategies for training lightweight, task-aligned small-scale language models.
nan
Article 402
Title@2025-07-26 (6): The Devil is in the EOS: Sequence Training for Detailed Image Captioning
Title: The Devil is in the EOS: Sequence Training for Detailed Image Captioning | Der Teufel ist im EOS: Sequenztraining für detaillierte Bildunterschriften | 魔鬼在EOS:详细图像说明的序列训练 2507.20077v1 |
Authors (2): Abdelrahman Mohamed, Yova Kementchedjhieva
Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model’s tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.
nan
Article 403
Title@2025-07-26 (6): PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training
Title: PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training | PITA: Präferenz-geführte Inferenz-Zeit-Ausrichtung für LLM nach dem Training | PITA:LLM培训后培训的优先指导推论-时间协调 2507.20067v1 |
Authors (4): Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai
Inference-time alignment enables large language models (LLMs) to generate outputs aligned with end-user preferences without further training. Recent post-training methods achieve this by using small guidance models to modify token generation during inference. These methods typically optimize a reward function KL-regularized by the original LLM taken as the reference policy. A critical limitation, however, is their dependence on a pre-trained reward model, which requires fitting to human preference feedback–a potentially unstable process. In contrast, we introduce PITA, a novel framework that integrates preference feedback directly into the LLM’s token generation, eliminating the need for a reward model. PITA learns a small preference-based guidance policy to modify token probabilities at inference time without LLM fine-tuning, reducing computational cost and bypassing the pre-trained reward model dependency. The problem is framed as identifying an underlying preference distribution, solved through stochastic search and iterative refinement of the preference-based guidance model. We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification, demonstrating its effectiveness in aligning LLM outputs with user preferences.
nan
Article 404
Title@2025-07-26 (6): RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation
Title: RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation | RAG in the Wild: Über die (In)Wirksamkeit von LLMs mit Mixture-of-Knowledge Retrieval Augmentation | 野生ROG:关于利用混合知识回收增加的LLMs(内)效力 2507.20059v1 |
Authors (6): Ran Xu, Yuchen Zhuang, Yue Yu, Haoyu Wang, Wenqi Shi, Carl Yang
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved at inference time. While RAG demonstrates strong performance on benchmarks largely derived from general-domain corpora like Wikipedia, its effectiveness under realistic, diverse retrieval scenarios remains underexplored. We evaluated RAG systems using MassiveDS, a large-scale datastore with mixture of knowledge, and identified critical limitations: retrieval mainly benefits smaller models, rerankers add minimal value, and no single retrieval source consistently excels. Moreover, current LLMs struggle to route queries across heterogeneous knowledge sources. These findings highlight the need for adaptive retrieval strategies before deploying RAG in real-world settings. Our code and data can be found at https://github.com/ritaranx/RAG_in_the_Wild.
nan
Article 405
Title@2025-07-26 (6): A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications
Title: A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications | Ein Tensor-basierter Compiler und eine Laufzeit für die Spezifikationen des Neuron-Level DNN Certifier | 一个基于 Tensor 的编纂器和中子级别 DNN 验证符规格的运行时间 2507.20055v1 |
Authors (6): Avaljot Singh, Yamin Chandini Sarita, Aditya Mishra, Ishaan Goyal, Gagandeep Singh, Charith Mendis
The uninterpretability of DNNs has led to the adoption of abstract interpretation-based certification as a practical means to establish trust in real-world systems that rely on DNNs. However, the current landscape supports only a limited set of certifiers, and developing new ones or modifying existing ones for different applications remains difficult. This is because the mathematical design of certifiers is expressed at the neuron level, while their implementations are optimized and executed at the tensor level. This mismatch creates a semantic gap between design and implementation, making manual bridging both complex and expertise-intensive – requiring deep knowledge in formal methods, high-performance computing, etc. We propose a compiler framework that automatically translates neuron-level specifications of DNN certifiers into tensor-based, layer-level implementations. This is enabled by two key innovations: a novel stack-based intermediate representation (IR) and a shape analysis that infers the implicit tensor operations needed to simulate the neuron-level semantics. During lifting, the shape analysis creates tensors in the minimal shape required to perform the corresponding operations. The IR also enables domain-specific optimizations as rewrites. At runtime, the resulting tensor computations exhibit sparsity tied to the DNN architecture. This sparsity does not align well with existing formats. To address this, we introduce g-BCSR, a double-compression format that represents tensors as collections of blocks of varying sizes, each possibly internally sparse. Using our compiler and g-BCSR, we make it easy to develop new certifiers and analyze their utility across diverse DNNs. Despite its flexibility, the compiler achieves performance comparable to hand-optimized implementations.
nan
Article 406
Title@2025-07-26 (6): $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning
Title: $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning | $K^4$: Online Log Anomalienerkennung durch unüberwachtes Lernen | 4K元:在线记录异常探测不受监督的典型学习 2507.20051v1 |
Authors (6): Weicong Chen, Vikash Singh, Zahra Rahmani, Debargha Ganguly, Mohsen Hariri, Vipin Chaudhary
Existing Log Anomaly Detection (LogAD) methods are often slow, dependent on error-prone parsing, and use unrealistic evaluation protocols. We introduce $K^4$, an unsupervised and parser-independent framework for high-performance online detection. $K^4$ transforms arbitrary log embeddings into compact four-dimensional descriptors (Precision, Recall, Density, Coverage) using efficient k-nearest neighbor (k-NN) statistics. These descriptors enable lightweight detectors to accurately score anomalies without retraining. Using a more realistic online evaluation protocol, $K^4$ sets a new state-of-the-art (AUROC: 0.995-0.999), outperforming baselines by large margins while being orders of magnitude faster, with training under 4 seconds and inference as low as 4 $\mu$s.
nan
Article 407
Title@2025-07-26 (6): AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants
Title: AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants | KI als beratender Partner fördert interkulturelles Empathie für Amerikaner, scheitert aber für lateinamerikanische Teilnehmer | 作为审议伙伴的大赦国际促进美国人的文化间同情,但拉丁美洲参与者却失败 2504.13887v2 |
Authors (5): Isabel Villanueva, Tara Bobinac, Binwei Yao, Junjie Hu, Kaiping Chen
Despite increasing AI chatbot deployment in public discourse, empirical evidence on their capacity to foster intercultural empathy remains limited. Through a randomized experiment, we assessed how different AI deliberation approaches–cross-cultural deliberation (presenting other-culture perspectives), own-culture deliberation (representing participants’ own culture), and non-deliberative control–affect intercultural empathy across American and Latin American participants. Cross-cultural deliberation increased intercultural empathy among American participants through positive emotional engagement, but produced no such effects for Latin American participants, who perceived AI responses as culturally inauthentic despite explicit prompting to represent their cultural perspectives. Our analysis of participant-driven feedback, where users directly flagged and explained culturally inappropriate AI responses, revealed systematic gaps in AI’s representation of Latin American contexts that persist despite sophisticated prompt engineering. These findings demonstrate that current approaches to AI cultural alignment–including linguistic adaptation and explicit cultural prompting–cannot fully address deeper representational asymmetries in AI systems. Our work advances both deliberation theory and AI alignment research by revealing how the same AI system can simultaneously promote intercultural understanding for one cultural group while failing for another, with critical implications for designing equitable AI systems for cross-cultural democratic discourse.
nan
Article 408
Title@2025-07-26 (6): Infogen: Generating Complex Statistical Infographics from Documents
Title: Infogen: Generating Complex Statistical Infographics from Documents | Infogen: Erzeugen komplexer statistischer Infografiken aus Dokumenten | 信息源:从文件生成复杂的统计图表 2507.20046v1 |
Authors (5): Akash Ghosh, Aparna Garimella, Pritika Ramu, Sambaran Bandyopadhyay, Sriparna Saha
Statistical infographics are powerful tools that simplify complex data into visually engaging and easy-to-understand formats. Despite advancements in AI, particularly with LLMs, existing efforts have been limited to generating simple charts, with no prior work addressing the creation of complex infographics from text-heavy documents that demand a deep understanding of the content. We address this gap by introducing the task of generating statistical infographics composed of multiple sub-charts (e.g., line, bar, pie) that are contextually accurate, insightful, and visually aligned. To achieve this, we define infographic metadata that includes its title and textual insights, along with sub-chart-specific details such as their corresponding data and alignment. We also present Infodat, the first benchmark dataset for text-to-infographic metadata generation, where each sample links a document to its metadata. We propose Infogen, a two-stage framework where fine-tuned LLMs first generate metadata, which is then converted into infographic code. Extensive evaluations on Infodat demonstrate that Infogen achieves state-of-the-art performance, outperforming both closed and open-source LLMs in text-to-statistical infographic generation.
nan
Article 409
Title@2025-07-26 (6): Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs
Title: Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs | Kolumbianische Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Empfehlungen von LLMs | Colombia Worress y juéces canadienses:LLM公司在占领建议中的性别和乡村差别 2505.02456v2 |
Authors (5): Elisa Forcada Rodríguez, Olatz Perez-de-Viñaspre, Jon Ander Campos, Dietrich Klakow, Vagrant Gautam
One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.
nan
Article 410
Title@2025-07-26 (6): FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression
Title: FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression | FAEDKV: Unendliche Window Fourier-Transformation für unvoreingenommene KV-Cache-Kompression | FAEDKV: 用于无偏见的 KV 缓存压缩的无限窗口 Fleier 变换 2507.20030v1 |
Authors (6): Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, Pan Li
The efficacy of Large Language Models (LLMs) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value (KV) cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations – either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context – and may require costly model retraining. We present FAEDKV (Frequency-Adaptive Infinite-Window for KV cache), a novel, training-free KV cache compression framework that ensures unbiased information retention. FAEDKV operates by transforming the KV cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAEDKV’s superiority over existing methods by up to 22\%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.
nan
Article 411
Title@2025-07-26 (6): Selective Prompt Anchoring for Code Generation
Title: Selective Prompt Anchoring for Code Generation | Selektive Prompt-Ankerung für die Code-Generierung | 代代代代代代代代代代代代代代代代代 代代代代代代代代代代代代代 代代代代代代代代代代代代 2408.09121v6 |
Authors (2): Yuan Tian, Tianyi Zhang
Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we propose Selective Prompt Anchoring (SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available at https://github.com/magic-YuanTian/Selective-Prompt-Anchoring.
nan
Article 412
Title@2025-07-26 (6): Preference learning made easy: Everything should be understood through win rate
Title: Preference learning made easy: Everything should be understood through win rate | Vorliebe Lernen leicht gemacht: Alles sollte durch Win-Rate verstanden werden | 首选学习容易:人人都应通过双赢率来理解一切 2502.10505v2 |
Authors (2): Lily H. Zhang, Rajesh Ranganath
Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective’s solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.
nan
Article 413
Title@2025-07-26 (6): Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach
Title: Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach | Anomalieerkennung in der menschlichen Sprache durch Meta-Learning: Ein wenig heißer Ansatz | 通过元学习在人文语言中异常探测: “ 几热 “ 方法 2507.20019v1 |
Authors (4): Saurav Singla, Aarav Singla, Advik Gupta, Parnika Gupta
We propose a meta learning framework for detecting anomalies in human language across diverse domains with limited labeled data. Anomalies in language ranging from spam and fake news to hate speech pose a major challenge due to their sparsity and variability. We treat anomaly detection as a few shot binary classification problem and leverage meta-learning to train models that generalize across tasks. Using datasets from domains such as SMS spam, COVID-19 fake news, and hate speech, we evaluate model generalization on unseen tasks with minimal labeled anomalies. Our method combines episodic training with prototypical networks and domain resampling to adapt quickly to new anomaly detection tasks. Empirical results show that our method outperforms strong baselines in F1 and AUC scores. We also release the code and benchmarks to facilitate further research in few-shot text anomaly detection.
nan
Article 414
Title@2025-07-26 (6): A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio
Title: A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio | Eine Praxis des Post-Trainings auf Llama-3 70B mit optimaler Auswahl des zusätzlichen Sprachmischverhältnisses | Llama-3-70B培训后做法,最佳选择其他语言混合比率 2409.06624v2 |
Authors (6): Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Luo Ji
Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.
nan
Article 415
Title@2025-07-26 (6): MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning
Title: MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning | MeTHanol: Modularisiertes Denken von Sprachmodellen mit Intermediate Layer Thinking, Decodierung und Bootstrapping Reasoning | METHanol:含有中间层思考、解毒和诱导理由的模块化思维语言模型 2409.12059v5 |
Authors (10): Ningyuan Xi, Xiaoyu Wang, Yetao Wu, Teng Chen, Qingqing Gu, Yue Zhao, Jinxian Qu, Zhonglin Jiang, Yong Chen, Luo Ji
Current research efforts are focused on enhancing the thinking and reasoning capability of large language model (LLM) by prompting, data-driven emergence and inference-time computation. In this study, we consider stimulating language model’s thinking and cognitive abilities from a modular perspective, which mimics the human brain architecture. We select a specific intermediate attention layer with newly implemented language heads. We conduct dual-layer fine-tuning by annotated (query, thought, answer) samples and show that the intermediate layer can also learn to decode fluent and reasonable language tokens. A two-pass inference mechanism is designed to generate thoughts then formal responses. The entire framework is called modularized thinking language model (MeTHanol) which can enhance LLM’s cognitive behaviors as indicated by Theory of Mind (ToM) and Vignette-based experiments. Case studies also show that MeTHanol can plan and self-reflect and generate human-like thoughts and answers, even on unseen and open-domain tasks. MeTHanol can also adapt to a personalized prompt and behave as the specified character. Our study holds promise for significant cognitive gains from a modular perspective. Our code, model and data are available at https://bachozean.github.io/methanol-page
nan
Article 416
Title@2025-07-26 (6): VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering
Title: VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering | VLQA: Der erste umfassende, große und hochqualitative vietnamesische Datensatz für die Beantwortung rechtlicher Fragen | VLQA:用于法律问题解答的第一综合、大、高质量越南数据集 2507.19995v1 |
Authors (6): Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao, Xuan-Hieu Phan, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong
The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.
nan
Article 417
Title@2025-07-26 (6): Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model
Title: Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model | Verbesserung der Leistungsfähigkeit sequentieller Empfehlungssysteme mit einem erweiterten Großsprachenmodell | 利用扩展大语言模式改进序列建议系统的绩效 2507.19990v1 |
Authors (2): Sinnyum Choi, Woong Kim
Recently, competition in the field of artificial intelligence (AI) has intensified among major technological companies, resulting in the continuous release of new large-language models (LLMs) that exhibit improved language understanding and context-based reasoning capabilities. It is expected that these advances will enable more efficient personalized recommendations in LLM-based recommendation systems through improved quality of training data and architectural design. However, many studies have not considered these recent developments. In this study, it was proposed to improve LLM-based recommendation systems by replacing Llama2 with Llama3 in the LlamaRec framework. To ensure a fair comparison, random seed values were set and identical input data was provided during preprocessing and training. The experimental results show average performance improvements of 38.65\%, 8.69\%, and 8.19\% for the ML-100K, Beauty, and Games datasets, respectively, thus confirming the practicality of this method. Notably, the significant improvements achieved by model replacement indicate that the recommendation quality can be improved cost-effectively without the need to make structural changes to the system. Based on these results, it is our contention that the proposed approach is a viable solution for improving the performance of current recommendation systems.
nan
Article 418
Title@2025-07-26 (6): Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge
Title: Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge | Robustes Daten-Wasserzeichen in Sprachmodellen durch Einspritzen fiktiver Kenntnisse | 在语言模型中,通过输入有说服力的知识在语言模型中进行强力数据水上标记 2503.04036v3 |
Authors (4): Xinyue Cui, Johnny Tian-Zheng Wei, Swabha Swayamdipta, Robin Jia
Data watermarking in language models injects traceable signals, such as specific token sequences or stylistic patterns, into copyrighted text, allowing copyright holders to track and verify training data ownership. Previous data watermarking techniques primarily focus on effective memorization during pretraining, while overlooking challenges that arise in other stages of the LLM lifecycle, such as the risk of watermark filtering during data preprocessing and verification difficulties due to API-only access. To address these challenges, we propose a novel data watermarking approach that injects plausible yet fictitious knowledge into training data using generated passages describing a fictitious entity and its associated attributes. Our watermarks are designed to be memorized by the LLM through seamlessly integrating in its training data, making them harder to detect lexically during preprocessing. We demonstrate that our watermarks can be effectively memorized by LLMs, and that increasing our watermarks’ density, length, and diversity of attributes strengthens their memorization. We further show that our watermarks remain effective after continual pretraining and supervised finetuning. Finally, we show that our data watermarks can be evaluated even under API-only access via question answering.
nan
Article 419
Title@2025-07-26 (6): Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization
Title: Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization | Leveraging Fine-Tuned Large Language Models for Interpretable Pankreatic Cystic Lesion Feature Extraction and Risk Categorization | 利用微量使用大语言模型来利用可解释性恐慌性锥性电磁性悬浮物地物采掘和风险分类 2507.19973v1 |
Authors (17): Ebrahim Rasromani, Stella K. Kang, Yanqi Xu, Beisong Liu, Garvit Luhadia, Wan Fung Chui, Felicia L. Pasadyn, Yu Chih Hung, Julie Y. An, Edwin Mathieu, Zehui Gu, Carlos Fernandez-Granda, Ammar A. Javed, Greg D. Sacks, Tamas Gonda, Chenchan Huang, Yiqiu Shen
Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss’ Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss’ Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss’ Kappa = 0.893) or GPT-CoT (Fleiss’ Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.
nan
Article 420
Title@2025-07-26 (6): Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
Title: Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text | Text2Vis: Ein anspruchsvolles und vielfältiges Benchmark zur Generierung multimodaler Visualisierungen aus Text | Text2Vis: 从文本中生成多式视觉化的质疑性和多样化基准 2507.19969v1 |
Authors (4): Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at https://github.com/vis-nlp/Text2Vis.
nan
Article 421
Title@2025-07-26 (6): KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models
Title: KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models | KLAAD: Verfeinerung von Aufmerksamkeitsmechanismen zur Reduzierung gesellschaftlicher Bias in generativen Sprachmodellen | CPRAD: 完善关注机制,在产生语言模式中减少社会偏见 2507.19962v1 |
Authors (3): Seorin Kim, Dongyoung Lee, Jaejin Lee
Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.
nan
Article 422
Title@2025-07-26 (6): Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Title: Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs | 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v4 |
Authors (5): Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara
The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
nan
Article 423
Title@2025-07-26 (6): Large Language Models Are Human-Like Internally
Title: Large Language Models Are Human-Like Internally | Große Sprachmodelle sind menschlich-innerlich | 大语言模型是人与人之间的内部大语言模型 2502.01615v2 |
Authors (5): Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, Timothy Baldwin
Recent cognitive modeling studies have reported that larger language models (LMs) exhibit a poorer fit to human reading behavior (Oh and Schuler, 2023b; Shain et al., 2024; Kuribayashi et al., 2024), leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.
nan
Article 424
Title@2025-07-26 (6): Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA
Title: Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA | Aufmerksamkeitsköpfe vor dem Zusammenführen ausrichten: Ein effektiver Weg, MHA in GQA umzuwandeln | 合并主题前对齐关注头部对齐:将MAHA转换为GQA的有效途径 2412.20677v2 |
Authors (4): Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin
Large language models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, as the model size and the input sequence’s length increase, the linearly increasing key-value (KV) cache significantly degrades inference throughput. Therefore, grouped-query attention (GQA), as an alternative to multi-head attention (MHA), has been widely introduced into LLMs. In this work, we propose a cost-effective method for converting MHA into GQA with any compression ratio of KV heads. The key point of our method lies in the application of Procrustes analysis to the attention heads, which enhances the similarity among attention heads while preserving computational invariance, thereby improving the model’s post-training performance. Subsequently, we employ $\mathit{L_0}$ regularization to prune redundant parameters. The model after pruning can be adapted to the standard GQA framework. Experimental results show that our strategy can compress up to 87.5\% KV heads of LLaMA2-7B model and 75\% KV heads of Sheared-LLaMA-1.3B with acceptable performance degradation. Our code is released at https://github.com/fpcsong/mha2gqa.
nan
Article 425
Title@2025-07-26 (6): Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation
Title: Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation | Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung | 结合人口偏见和示范解释评估LLM 说明的可靠性 2507.13138v2 |
Authors (6): Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou
Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.
nan
Article 426
Title@2025-07-26 (6): Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report
Title: Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report | Frontier AI Risk Management Framework in der Praxis: Ein technischer Bericht zur Risikoanalyse | 《国际边界风险管理框架实际操作:风险分析技术报告》 2507.16534v2 |
Authors (38): Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou
To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the “AI-$45^\circ$ Law,” we evaluate these risks using “red lines” (intolerable thresholds) and “yellow lines” (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
nan
Article 427
Title@2025-07-26 (6): The Impact of Fine-tuning Large Language Models on Automated Program Repair
Title: The Impact of Fine-tuning Large Language Models on Automated Program Repair | Die Auswirkungen von Feinabstimmungen großer Sprachmodelle auf die automatisierte Programmreparatur | 微调大语言模型对自动方案维修的影响 2507.19909v1 |
Authors (4): Roman Macháček, Anastasiia Grishina, Max Hort, Leon Moonen
Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster. In recent years, Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility. However, training such models requires a significant amount of resources. Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch. In this study, we empirically investigate the impact of various fine-tuning techniques on the performance of LLMs used for APR. Our experiments provide insights into the performance of a selection of state-of-the-art LLMs pre-trained on code. The evaluation is done on three popular APR benchmarks (i.e., QuixBugs, Defects4J and HumanEval-Java) and considers six different LLMs with varying parameter sizes (resp. CodeGen, CodeT5, StarCoder, DeepSeekCoder, Bloom, and CodeLlama-2). We consider three training regimens: no fine-tuning, full fine-tuning, and parameter-efficient fine-tuning (PEFT) using LoRA and IA3. We observe that full fine-tuning techniques decrease the benchmarking performance of various models due to different data distributions and overfitting. By using parameter-efficient fine-tuning methods, we restrict models in the amount of trainable parameters and achieve better results. Keywords: large language models, automated program repair, parameter-efficient fine-tuning, AI4Code, AI4SE, ML4SE.
nan
Article 428
Title@2025-07-26 (6): CaliDrop: KV Cache Compression with Calibration
Title: CaliDrop: KV Cache Compression with Calibration | CaliDrop: KV Cache-Kompression mit Kalibrierung | CaliDrop: KV 缓存压缩加校准 2507.19906v1 |
Authors (9): Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.
nan
Article 429
Title@2025-07-26 (6): A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs
Title: A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs | Ein Gold Standard Datensatz und Evaluation Framework für Depression Erkennung und Erklärung in Social Media mit LLMs | 利用LLMM公司在社会媒体中发现和解释抑郁症的黄金标准数据集和评价框架 2507.19899v1 |
Authors (2): Prajval Bolegave, Pushpak Bhattacharya
Early detection of depression from online social media posts holds promise for providing timely mental health interventions. In this work, we present a high-quality, expert-annotated dataset of 1,017 social media posts labeled with depressive spans and mapped to 12 depression symptom categories. Unlike prior datasets that primarily offer coarse post-level labels \cite{cohan-etal-2018-smhd}, our dataset enables fine-grained evaluation of both model predictions and generated explanations. We develop an evaluation framework that leverages this clinically grounded dataset to assess the faithfulness and quality of natural language explanations generated by large language models (LLMs). Through carefully designed prompting strategies, including zero-shot and few-shot approaches with domain-adapted examples, we evaluate state-of-the-art proprietary LLMs including GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 Sonnet. Our comprehensive empirical analysis reveals significant differences in how these models perform on clinical explanation tasks, with zero-shot and few-shot prompting. Our findings underscore the value of human expertise in guiding LLM behavior and offer a step toward safer, more transparent AI systems for psychological well-being.
nan
Article 430
Title@2025-07-26 (6): Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs
Title: Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs | Automatisieren der mathematischen Proof-Generierung mit Large Language Model Agents und Wissensgraphen | 使用大语言模型代理和知识图 2503.11657v2 |
Authors (5): Vincent Li, Tim Knappe, Yule Fu, Kevin Han, Kevin Zhu
Large language models have demonstrated remarkable capabilities in natural language processing tasks requiring multi-step logical reasoning capabilities, such as automated theorem proving. However, challenges persist within theorem proving, such as the identification of key mathematical concepts, understanding their interrelationships, and formalizing proofs correctly within natural language. We present KG-prover, a novel framework that leverages knowledge graphs mined from reputable mathematical texts to augment general-purpose LLMs to construct and formalize mathematical proofs. We also study the effects of scaling graph-based, test-time compute using KG-Prover, demonstrating significant performance improvements over baselines across multiple datasets. General-purpose LLMs improve up to 21\% on miniF2F-test when combined with KG-Prover, with consistent improvements ranging from 2-11\% on the ProofNet, miniF2F-test, and MUSTARD datasets without additional scaling. Furthermore, KG-Prover with o4-mini achieves over 50% miniF2F-test. This work provides a promising approach for augmenting natural language proof reasoning with knowledge graphs without the need for additional finetuning.
nan
Article 431
Title@2025-07-26 (6): Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam
Title: Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam | Zero-shot Leistung von Generative KI in brasilianischer portugiesischer medizinischer Prüfung | 巴西葡萄牙医学考试中创用AI的零弹性能 2507.19885v1 |
Authors (10): Cesar Augusto Madid Truyts, Amanda Gomes Rabelo, Gabriel Mesquita de Souza, Daniel Scaldaferri Lages, Adriano Jose Pereira, Uri Adrian Prync Flato, Eduardo Pontes dos Reis, Joaquim Edson Vieira, Paulo Sergio Panse Silveira, Edson Amaro Junior
Artificial intelligence (AI) has shown the potential to revolutionize healthcare by improving diagnostic accuracy, optimizing workflows, and personalizing treatment plans. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved notable advancements in natural language processing and medical applications. However, the evaluation of these models has focused predominantly on the English language, leading to potential biases in their performance across different languages. This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in Brazilian spoken portuguese from the medical residency entrance exam of the Hospital das Cl'inicas da Faculdade de Medicina da Universidade de S~ao Paulo (HCFMUSP) - the largest health complex in South America. The performance of the models was benchmarked against human candidates, analyzing accuracy, processing time, and coherence of the generated explanations. The results show that while some models, particularly Claude-3.5-Sonnet and Claude-3-Opus, achieved accuracy levels comparable to human candidates, performance gaps persist, particularly in multimodal questions requiring image interpretation. Furthermore, the study highlights language disparities, emphasizing the need for further fine-tuning and data set augmentation for non-English medical AI applications. Our findings reinforce the importance of evaluating generative AI in various linguistic and clinical settings to ensure a fair and reliable deployment in healthcare. Future research should explore improved training methodologies, improved multimodal reasoning, and real-world clinical integration of AI-driven medical assistance.
nan
Article 432
Title@2025-07-26 (6): Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
Title: Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning | Causal Sufficiency und Necessity verbessert Kette-of-Thought-Reasoning | C. 因果关系和必要性 改进审议链 理由 2506.09853v2 |
Authors (8): Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
nan
Article 433
Title@2025-07-26 (6): FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models
Title: FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models | FactReasoner: Ein probabilistischer Ansatz zur Langform-Faktivitätsbewertung für große Sprachmodelle | 事实研究者:对大语言模式长期实际评估的概率办法 2502.18573v2 |
Authors (8): Radu Marinescu, Debarun Bhattacharjya, Junkyu Lee, Tigran Tchrakian, Javier Carnerero Cano, Yufang Hou, Elizabeth Daly, Alessandra Pascale
Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.
nan
Article 434
Title@2025-07-26 (6): The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment
Title: The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment | Der polnische Vokabular-Größentest: Ein neuartiger adaptiver Test für die rezeptive Vokabular-Bewertung | 波兰词汇大小测试:接受词汇评估的新适应性测试 2507.19869v1 |
Authors (3): Danil Fokin, Monika Płużyczka, Grigory Golovin
We present the Polish Vocabulary Size Test (PVST), a novel tool for assessing the receptive vocabulary size of both native and non-native Polish speakers. Based on Item Response Theory and Computerized Adaptive Testing, PVST dynamically adjusts to each test-taker’s proficiency level, ensuring high accuracy while keeping the test duration short. To validate the test, a pilot study was conducted with 1.475 participants. Native Polish speakers demonstrated significantly larger vocabularies compared to non-native speakers. For native speakers, vocabulary size showed a strong positive correlation with age. The PVST is available online at myvocab.info/pl.
nan
Article 435
Title@2025-07-26 (6): DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments
Title: DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments | DRIVE: Disfluency-Rich Synthetic Dialog Data Generierung Framework für intelligente Fahrzeugumgebungen | DIVE: 智能车辆环境数据生成框架 2507.19867v1 |
Authors (6): Anshul Chavda, M Jagadeesh, Chintalapalli Raja Kullayappa, B Jayaprakash, Medchalimi Sruthi, Pushpak Bhattacharyya
In-car conversational AI is becoming increasingly critical as autonomous vehicles and smart assistants gain widespread adoption. Yet, existing datasets fail to capture the spontaneous disfluencies such as hesitations, false starts, repetitions, and self-corrections that characterize real driver-AI dialogs. To address this, we introduce DiscoDrive, a synthetic corpus of 3500 multi-turn dialogs across seven automotive domains, generated using a two-stage, prompt-driven pipeline that dynamically integrates disfluencies during synthesis. We show that DiscoDrive is effective both as a training resource, enabling DialoGPT-Medium and T5-Base to match or exceed KVRET-trained models on the MultiWOZ 2.2 and Schema-Guided Dialogue (SGD) relevant test sets (BLEU-4 improvements of 0.26 to 0.61; METEOR +2.10; ROUGE-L +3.48; BERTScore F1 improvements of 1.35 to 3.48), and as a data augmentation resource in low-resource scenarios, delivering additional gains of up to BLEU-4 +0.38, METEOR +1.95, ROUGE-L +2.87, and BERTScore F1 +4.00 when combined with 10 percent of KVRET. Human evaluations further confirm that dialogs sampled from DiscoDrive are rated higher than KVRET’s human-collected dialogs in naturalness (3.8 vs 3.6) and coherence (4.1 vs 4.0), and are perceived as more context-appropriate than leading post-hoc methods (such as LARD), without compromising clarity. DiscoDrive fills a critical gap in existing resources and serves as a versatile corpus for both training and augmenting conversational AI, enabling robust handling of real-world, disfluent in-car interactions.
nan
Article 436
Title@2025-07-26 (6): Agentic Reinforced Policy Optimization
Title: Agentic Reinforced Policy Optimization | Agentische verstärkte politische Optimierung | 强化政策优化 2507.19849v1 |
Authors (14): Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models’ intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO’s superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO
nan
Article 437
Title@2025-07-26 (6): Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs
Title: Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs | Gemeinsames Verständnis von Fehlausrichtung im zielorientierten Dialog: Eine Fallstudie mit Ubuntu Chat Logs | 理解目标导向对话框中的共同点不匹配:与Ubuntu聊天日志的案例研究 2503.12370v2 |
Authors (6): Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik
While it is commonly accepted that maintaining common ground plays a role in conversational success, little prior research exists connecting conversational grounding to success in task-oriented conversations. We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues. We find that disruptions in conversational flow often stem from a misalignment in common ground, driven by a divergence in beliefs and assumptions held by participants. These disruptions, which we call conversational friction, significantly correlate with task success. We find that although LLMs can identify overt cases of conversational friction, they struggle with subtler and more context-dependent instances requiring pragmatic or domain-specific reasoning.
nan
Article 438
Title@2025-07-26 (6): AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition
Title: AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition | AutoSign: Direkte Pose-zu-Text-Übersetzung für die kontinuierliche Erkennung von Zeichensprachen | 自动签名: 用于持续手语识别的直导 Pose-to- Text 翻译 2507.19840v1 |
Authors (4): Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen, Assane Gueye
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1\% in WER score compared to the best existing method.
nan
Article 439
Title@2025-07-26 (6): HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
Title: HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs | HCAtention: Extreme KV Cache Compression via Heterogenes Aufmerksamkeitsrechnen für LLMs | HCAttention:通过不同式注意计算法对LLMs进行极端KV缓存压缩 2507.19823v1 |
Authors (5): Dongquan Yang, Yifan Yang, Xiaotian Yu, Xianbiao Qi, Rong Xiao
Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value (KV) cache during inference. Existing KV cache compression methods exhibit noticeable performance degradation when memory is reduced by more than 85%. Additionally, strategies that leverage GPU-CPU collaboration for approximate attention remain underexplored in this setting. We propose HCAttention, a heterogeneous attention computation framework that integrates key quantization, value offloading, and dynamic KV eviction to enable efficient inference under extreme memory constraints. The method is compatible with existing transformer architectures and does not require model fine-tuning. Experimental results on the LongBench benchmark demonstrate that our approach preserves the accuracy of full-attention model while shrinking the KV cache memory footprint to 25% of its original size. Remarkably, it stays competitive with only 12.5% of the cache, setting a new state-of-the-art in LLM KV cache compression. To the best of our knowledge, HCAttention is the first to extend the Llama-3-8B model to process 4 million tokens on a single A100 GPU with 80GB memory.
nan
Article 440
Title@2025-07-26 (6): A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy
Title: A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy | Ein strukturierter Bangla-Datensatz von Krankheits-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit | 改善诊断准确性疾病 – – 症状协会结构化孟加拉数据集 2506.13610v3 |
Authors (4): Abdullah Al Shafi, Rowzatul Zannat, Abdul Muntakim, Mahmudul Hasan
Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications. These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection. The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases. The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports. Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded. The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. Each symptom cell contains a binary value, indicating whether a symptom is associated with a disease. Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies. Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance
nan
Article 441
Title@2025-07-26 (6): LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models
Title: LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models | LLM-Barber: Block-Aware Rebuilder für Sparsity Maske in One-Shot für große Sprachmodelle | LLM-Barber:大语言模型单点单层面罩块件重建器 2408.10631v2 |
Authors (9): Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Zhengfei Chen, Graziano Chesi, Ngai Wong, Hao Yu
Large language models (LLMs) have seen substantial growth, necessitating efficient model pruning techniques. Existing post-training pruning methods primarily measure weight importance in converged dense models, often overlooking changes in weight significance during the pruning process, leading to performance degradation. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, facilitating global performance optimization. We are the first to employ the product of weights and gradients as a pruning metric in the context of LLM post-training pruning. This enables accurate identification of weight importance in massive models and significantly reduces computational complexity compared to methods using secondorder information. Our experiments show that LLM-Barber efficiently prunes models from LLaMA and OPT families (7B to 13B) on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.
nan
Article 442
Title@2025-07-26 (6): Flora: Effortless Context Construction to Arbitrary Length and Scale
Title: Flora: Effortless Context Construction to Arbitrary Length and Scale | Flora: Müheloser Kontext Aufbau zu willkürlicher Länge und Skala | Flora: 以任意长度和规模建造环境以达到任意长度和规模 2507.19786v1 |
Authors (8): Tianxiang Chen, Zhentao Tan, Xiaofan Bo, Yue Wu, Tao Gong, Qi Chu, Jieping Ye, Nenghai Yu
Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{https://github.com/txchen-USTC/Flora}{https://github.com/txchen-USTC/Flora}.
nan
Article 443
Title@2025-07-26 (6): UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities
Title: UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities | UloRL:Ein Ultra-Long-Output-Verstärkungs-Lernansatz zur Förderung großer Sprachmodelle | UloRL: 推进大语言模式解释能力超长输出强化学习方法 2507.19766v1 |
Authors (5): Dong Du, Shulin Liu, Tao Yang, Shaohua Chen, Yang Li
Recent advances in large language models (LLMs) have highlighted the potential of reinforcement learning with verifiable rewards (RLVR) to enhance reasoning capabilities through extended output sequences. However, traditional RL frameworks face inefficiencies when handling ultra-long outputs due to long-tail sequence distributions and entropy collapse during training. To address these challenges, we propose an Ultra-Long Output Reinforcement Learning (UloRL) approach for advancing large language models’ reasoning abilities. Specifically, we divide ultra long output decoding into short segments, enabling efficient training by mitigating delays caused by long-tail samples. Additionally, we introduce dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse. Experimental results demonstrate the effectiveness of our approach. On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model’s performance on AIME2025 from 70.9\% to 85.1\% and on BeyondAIME from 50.7\% to 61.9\%, even surpassing Qwen3-235B-A22B with remarkable gains. These findings underscore the potential of our methods to advance the reasoning capabilities of LLMs with ultra-long sequence generation. We will release our code and model for further use by the community.
nan
Article 444
Title@2025-07-26 (6): Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs
Title: Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs | Sind Sie dort Gott? Leichte narrative Anmerkung der christlichen Fiction mit LMs | 轻量量级的基督教小说和LMs 2507.19756v1 |
Authors (5): Rebecca M. M. Hicke, Brian Haggard, Mia Ferrante, Rayhan Khanna, David Mimno
In addition to its more widely studied political activities, the American Evangelical movement has a well-developed but less externally visible cultural and literary side. Christian Fiction, however, has been little studied, and what scholarly attention there is has focused on the explosively popular Left Behind series. In this work, we use computational tools to provide both a broad topical overview of Christian Fiction as a genre and a more directed exploration of how its authors depict divine acts. Working with human annotators we first developed definitions and a codebook for “acts of God.” We then adapted those instructions designed for human annotators for use by a recent, lightweight LM with the assistance of a much larger model. The laptop-scale LM is capable of matching human annotations, even when the task is subtle and challenging. Using these annotations, we show that significant and meaningful differences exist between the Left Behind books and Christian Fiction more broadly and between books by male and female authors.
nan
Article 445
Title@2025-07-26 (6): JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models
Title: JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models | JT-Math: Ein Multi-Stage-Framework für fortgeschrittene mathematische Vernunft in großen Sprachmodellen | JT- Math:大语言模型高级数学理由多阶段框架 2507.19748v1 |
Authors (9): Yifan Hao, Fangning Chao, Yaqian Hao, Zhaojun Cui, Huan Bai, Haiyu Zhang, Yankai Liu, Chao Deng, Junlan Feng
Mathematical reasoning is a cornerstone of artificial general intelligence and a primary benchmark for evaluating the capabilities of Large Language Models (LLMs). While state-of-the-art models show promise, they often falter when faced with complex problems that demand deep conceptual understanding and intricate, multi-step deliberation. To address this challenge, we introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions, built upon a systematic, multi-stage optimization framework. Our pre-training corpus is a high-quality, 210B-token dataset curated through a dedicated data pipeline that uses model-based validation to ensure quality and diversity. The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method. The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum that progressively increases task difficulty and context length up to 32K tokens. JT-Math-8B achieves state-of-the-art results among open-source models of similar size, surpassing prominent models like OpenAI’s O1-mini and GPT-4o , and demonstrating superior performance on competition-level mathematics.
nan
Article 446
Title@2025-07-26 (6): Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation
Title: Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation | Assembly Your Crew: Automatisches Multi-Agenten-Kommunikationstopologie-Design über autoregressive Graphen-Generierung | 通过自动递减图形生成将您的组群组合成:自动多剂多剂通信地形设计 2507.18224v2 |
Authors (5): Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.
nan
Article 447
Title@2025-07-25 (5): Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs
Title: Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs | Ta-G-T: Subjektivitätserfassung in Tabelle zur Textgenerierung über RDF Graphen | TaG-T:通过 RDF 图表生成文本的表格中主观性捕获 2507.19710v1 |
Authors (3): Ronak Upasham, Tathagata Dey, Pushpak Bhattacharyya
In Table-to-Text (T2T) generation, existing approaches predominantly focus on providing objective descriptions of tabular data. However, generating text that incorporates subjectivity, where subjectivity refers to interpretations beyond raw numerical data, remains underexplored. To address this, we introduce a novel pipeline that leverages intermediate representations to generate both objective and subjective text from tables. Our three-stage pipeline consists of: 1) extraction of Resource Description Framework (RDF) triples, 2) aggregation of text into coherent narratives, and 3) infusion of subjectivity to enrich the generated text. By incorporating RDFs, our approach enhances factual accuracy while maintaining interpretability. Unlike large language models (LLMs) such as GPT-3.5, Mistral-7B, and Llama-2, our pipeline employs smaller, fine-tuned T5 models while achieving comparable performance to GPT-3.5 and outperforming Mistral-7B and Llama-2 in several metrics. We evaluate our approach through quantitative and qualitative analyses, demonstrating its effectiveness in balancing factual accuracy with subjective interpretation. To the best of our knowledge, this is the first work to propose a structured pipeline for T2T generation that integrates intermediate representations to enhance both factual correctness and subjectivity.
nan
Article 448
Title@2025-07-25 (5): Scalable MatMul-free Language Modeling
Title: Scalable MatMul-free Language Modeling | Skalierbare MatMul-freie Sprachmodellierung | 可缩放 MatMul 无语言建模 2406.02528v7 |
Authors (10): Rui-Jie Zhu, Yu Zhang, Steven Abreu, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Sumit Bam Shrestha, Peng Zhou, Jason K. Eshraghian
Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul) within their attention and feed-forward (FFN) layers. We demonstrate that MatMul operations can be eliminated from LLMs while maintaining strong performance, even at billion-parameter scales. Our MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers, and the performance gap narrows as model size increases. Our approach yields significant memory savings: a GPU-efficient implementation reduces memory consumption by up to 61% during training and over 10x during inference. When adapted for a multi-chip neuromorphic system, the model leverages asynchronous processing to achieve 4x higher throughput with 10x less energy than edge GPUs.
nan
Article 449
Title@2025-07-25 (5): Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks
Title: Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks | Towards Inclusive NLP: Bewertung komprimierter Mehrsprachiger Transformer über unterschiedliche Sprach-Benchmarks | 实现包容性的《国家语言规划:评估跨越不同语文基准的压压压多语种变换器》 2507.19699v1 |
Authors (3): Maitha Alshehhi, Ahmed Sharshar, Mohsen Guizani
Although LLMs have attained significant success in high-resource languages, their capacity in low-resource linguistic environments like Kannada and Arabic is not yet fully understood. This work benchmarking the performance of multilingual and monolingual Large Language Models (LLMs) across Arabic, English, and Indic languages, with particular emphasis on the effects of model compression strategies such as pruning and quantization. Findings shows significant performance differences driven by linguistic diversity and resource availability on SOTA LLMS as BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2. We find that multilingual versions of the model outperform their language-specific counterparts across the board, indicating substantial cross-lingual transfer benefits. Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive pruning significantly compromises performance, especially in bigger models. Our findings pinpoint key strategies to construct scalable and fair multilingual NLP solutions and underscore the need for interventions to address hallucination and generalization errors in the low-resource setting.
nan
Article 450
Title@2025-07-25 (5): Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks
Title: Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks | Salsa als nonverbale Sprache – Der CoMPAS3D Datensatz und Benchmarks | Salsa 作为一种非语言的成形语言 – – CoMPAS3D数据集和基准 2507.19684v1 |
Authors (6): Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim
Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner’s proficiency, using haptic signaling as a primary form of communication. While today’s AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.
nan
Article 451
Title@2025-07-25 (5): Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research
Title: Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research | Navigation auf die Risiken der Verwendung großer Sprachmodelle für die Textannotation in der sozialwissenschaftlichen Forschung | 利用大语言模式在社会科学研究中使用文字说明的风险 2503.22040v2 |
Authors (2): Hao Lin, Yongjun Zhang
Large language models (LLMs) have the potential to revolutionize computational social science, particularly in automated textual analysis. In this paper, we conduct a systematic evaluation of the promises and risks associated with using LLMs for text classification tasks, using social movement studies as an example. We propose a framework for social scientists to incorporate LLMs into text annotation, either as the primary coding decision-maker or as a coding assistant. This framework offers researchers tools to develop the potential best-performing prompt, and to systematically examine and report the validity and reliability of LLMs as a methodological tool. Additionally, we evaluate and discuss its epistemic risks associated with validity, reliability, replicability, and transparency. We conclude with several practical guidelines for using LLMs in text annotation tasks and offer recommendations for more effectively communicating epistemic risks in research.
nan
Article 452
Title@2025-07-25 (5): Benchmarking Linguistic Diversity of Large Language Models
Title: Benchmarking Linguistic Diversity of Large Language Models | Benchmarking Linguistische Vielfalt großer Sprachmodelle | 衡量大语言模式语言多样性的基准 2412.10271v2 |
Authors (3): Yanzhu Guo, Guokan Shang, Chloé Clavel
The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We propose a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth case study for syntactic diversity. Finally, we analyze how different development and deployment choices impact the linguistic diversity of LLM outputs.
nan
Article 453
Title@2025-07-25 (5): Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs
Title: Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs | Haben große Sprachmodelle einen englischen Akzent? Bewertung und Verbesserung der Natürlichkeit von mehrsprachigen LLMs | 大语言模式是否有英语中心? 2410.15956v3 |
Authors (6): Yanzhu Guo, Simone Conia, Zelin Zhou, Min Li, Saloni Potdar, Henry Xiao
Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.
nan
Article 454
Title@2025-07-25 (5): RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams
Title: RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams | RoD-TAL: Ein Benchmark für die Beantwortung von Fragen in rumänischen Führerscheinprüfungen | RoD-TAL:在罗马尼亚驾驶执照考试中回答问题的基准 2507.19666v1 |
Authors (6): Andrei Vlad Man, Răzvan-Alexandru Smădu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references and human explanations. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum grades required to pass driving exams. However, visual reasoning remains challenging, highlighting the potential and the limitations of applying LLMs and VLMs to legal education.
nan
Article 455
Title@2025-07-25 (5): Code-Switching and Syntax: A Large-Scale Experiment
Title: Code-Switching and Syntax: A Large-Scale Experiment | Code-Schalten und Syntax: Ein groß angelegtes Experiment | 代码开动和语法:大规模实验 2506.01846v2 |
Authors (2): Igor Sterner, Simone Teufel
The theoretical code-switching (CS) literature provides numerous pointwise investigations that aim to explain patterns in CS, i.e. why bilinguals switch language in certain positions in a sentence more often than in others. A resulting consensus is that CS can be explained by the syntax of the contributing languages. There is however no large-scale, multi-language, cross-phenomena experiment that tests this claim. When designing such an experiment, we need to make sure that the system that is predicting where bilinguals tend to switch has access only to syntactic information. We provide such an experiment here. Results show that syntax alone is sufficient for an automatic system to distinguish between sentences in minimal pairs of CS, to the same degree as bilingual humans. Furthermore, the learnt syntactic patterns generalise well to unseen language pairs.
nan
Article 456
Title@2025-07-25 (5): Minimal Pair-Based Evaluation of Code-Switching
Title: Minimal Pair-Based Evaluation of Code-Switching | Minimale Pair-basierte Auswertung von Code-Switching | 对代码转换的最小对等评价 2506.01840v2 |
Authors (2): Igor Sterner, Simone Teufel
There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.
nan
Article 457
Title@2025-07-25 (5): Summarization of Opinionated Political Documents with Varied Perspectives
Title: Summarization of Opinionated Political Documents with Varied Perspectives | Zusammenfassung opinionierter politischer Dokumente mit unterschiedlichen Perspektiven | 具有不同观点的有见解的政治文件概述 2411.04093v2 |
Authors (2): Nicholas Deas, Kathleen McKeown
Global partisan hostility and polarization has increased, and this polarization is heightened around presidential elections. Models capable of generating accurate summaries of diverse perspectives can help reduce such polarization by exposing users to alternative perspectives. In this work, we introduce a novel dataset and task for independently summarizing each political perspective in a set of passages from opinionated news articles. For this task, we propose a framework for evaluating different dimensions of perspective summary performance. We benchmark 11 summarization models and LLMs of varying sizes and architectures through both automatic and human evaluation. While recent models like GPT-4o perform well on this task, we find that all models struggle to generate summaries that are faithful to the intended perspective. Our analysis of summaries focuses on how extraction behavior is impacted by features of the input documents.
nan
Article 458
Title@2025-07-25 (5): OneShield – the Next Generation of LLM Guardrails
Title: OneShield – the Next Generation of LLM Guardrails | OneShield – die nächste Generation der LLM-Guardrails | OneShild – – 下一代LLM护卫车 2507.21170v1 |
Authors (10): Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary, Kellen Cheng, Basel Shbita, Pengyuan Li, Guang-Jie Ren, Sandeep Gopisetty
The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes the task of universally shielding users against their potential risks extremely challenging, and one-size-fits-all solutions unfeasible. In this work, we propose OneShield, our stand-alone, model-agnostic and customizable solution to safeguard LLMs. OneShield aims to provide facilities for defining risk factors, expressing and declaring contextual safety and compliance policies, and mitigating LLM risks, with a focus on each specific customer. We describe the implementation of the framework, the scalability considerations and provide usage statistics of OneShield since its first deployment.
nan
Article 459
Title@2025-07-25 (5): Data Caricatures: On the Representation of African American Language in Pretraining Corpora
Title: Data Caricatures: On the Representation of African American Language in Pretraining Corpora | Daten Karikaturen: Zur Darstellung der afroamerikanischen Sprache im Vortraining Corpora | 数据制图:关于非洲裔美国人语言在预科公司中的代表性 2503.10789v2 |
Authors (8): Nicholas Deas, Blake Vente, Amith Ananthram, Jessica A. Grieser, Desmond Patton, Shana Kleiner, James Shepard, Kathleen McKeown
With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AAL-speaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as few as 0.007% and at most 0.18% of documents. We also find that more than 25% of AAL texts in C4 may be perceived as inappropriate for LLMs to generate and to reinforce harmful stereotypes. Finally, we find that most automated filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.
nan
Article 460
Title@2025-07-25 (5): Opacity as Authority: Arbitrariness and the Preclusion of Contestation
Title: Opacity as Authority: Arbitrariness and the Preclusion of Contestation | Opacity as Authority: Willkür und die Präklusion der Anfechtung | 作为权力的不透明度:仲裁和排除争议 2507.22944v1 |
Authors (1): Naomi Omeonga wa Kayembe
This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure’s concept of l’arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the “Motivation -> Constatability -> Contestability” chain, arguing that motivation functions as a crucial interface rendering an act’s logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like “immotivization” or “Conflict Lateralization” (exemplified by “the blur of the wolf drowned in the fish”), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon’s entropy model, the paper formalizes arbitrariness as A = H(L | M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems. |
nan
Article 461
Title@2025-07-25 (5): MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Title: MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks | MCIF: Multimodale Crosslingual Instruction-Following Benchmark aus wissenschaftlichen Vorträgen | MCIF: 科学会谈的多模式跨语言教学基准 2507.19634v1 |
Authors (8): Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues
Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations – hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities – speech, vision, and text – and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs’ abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.
nan
Article 462
Title@2025-07-25 (5): LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Title: LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning | LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung | LoX:低Rank外推法强力推力LLM 安全防止微调 2506.15606v3 |
Authors (6): Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong
Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.
nan
Article 463
Title@2025-07-25 (5): HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track
Title: HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track | HITSZs End-to-End-Sprachübersetzungssysteme zur Kombination von Sequenz-zu-Sequenz-Auto-Spracherkennungsmodell und indic Large Language Model für IWSLT 2025 in Indic Track | HITSZ的端到端语音翻译系统,将序列到序列自动语音识别模型和2025 IWSLT Indic Track IWSLT 2025 的指数式大语言模型结合起来 2507.19616v1 |
Authors (7): Xuchen Wei, Yangxin Wu, Yaoyin Zhang, Henglyu Liu, Kehai Chen, Xuefeng Bai, Min Zhang
This paper presents HITSZ’s submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of $28.88$ for English-to-Indic directions and $27.86$ for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a $13.84$ BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.
nan
Article 464
Title@2025-07-25 (5): MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
Title: MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? | MOCHA: Sind Code-Sprachenmodelle gegen multi-Turn bösartige Coding-Prompts robust? | MOCHA:守则语言模型是否强力打击多发恶意编码的提示? 2507.19598v1 |
Authors (8): Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani-Tür, Ismini Lourentzou
Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
nan
Article 465
Title@2025-07-25 (5): Efficient Attention Mechanisms for Large Language Models: A Survey
Title: Efficient Attention Mechanisms for Large Language Models: A Survey | Effiziente Aufmerksamkeitsmechanismen für große Sprachmodelle: Eine Umfrage | 高效率关注大语言模式机制:调查 2507.19595v1 |
Authors (7): Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, Jianyong Wang
Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.
nan
Article 466
Title@2025-07-25 (5): Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning
Title: Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning | Geospatielles Wissen abmildern Halluzination in großen Sprachmodellen: Benchmarking und Dynamische Faktizität Ausrichtung | 减轻大语言模式中的地理空间知识幻觉:基准和动态事实对齐 2507.19586v1 |
Authors (5): Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li
Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.
nan
Article 467
Title@2025-07-25 (5): MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
Title: MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents | MMBench-GUI: Hierarchischer Mehrplattform-Evaluierungsrahmen für GUI-Agenten | MMMBench-GUI:图形用户界面代理器的等级多平台评价框架 2507.19478v1 |
Authors (28): Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.
nan
Article 468
Title@2025-07-25 (5): Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts
Title: Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts | Weiterentwicklung der Event-Prognose durch massives Training von großen Sprachmodellen: Herausforderungen, Lösungen und breitere Auswirkungen | 通过大规模培训大语言模式:挑战、解决办法和更广泛影响 2507.19477v1 |
Authors (4): Sang-Woo Lee, Sohee Yang, Donghyun Kwak, Noah Y. Siegel
Many recent papers have studied the development of superforecaster-level event forecasting LLMs. While methodological problems with early studies cast doubt on the use of LLMs for event forecasting, recent studies with improved evaluation methods have shown that state-of-the-art LLMs are gradually reaching superforecaster-level performance, and reinforcement learning has also been reported to improve future forecasting. Additionally, the unprecedented success of recent reasoning models and Deep Research-style models suggests that technology capable of greatly improving forecasting performance has been developed. Therefore, based on these positive recent trends, we argue that the time is ripe for research on large-scale training of superforecaster-level event forecasting LLMs. We discuss two key research directions: training methods and data acquisition. For training, we first introduce three difficulties of LLM-based event forecasting training: noisiness-sparsity, knowledge cut-off, and simple reward structure problems. Then, we present related ideas to mitigate these problems: hypothetical event Bayesian networks, utilizing poorly-recalled and counterfactual events, and auxiliary reward signals. For data, we propose aggressive use of market, public, and crawling datasets to enable large-scale training and evaluation. Finally, we explain how these technical advances could enable AI to provide predictive intelligence to society in broader areas. This position paper presents promising specific paths and considerations for getting closer to superforecaster-level AI technology, aiming to call for researchers’ interest in these directions.
nan
Article 469
Title@2025-07-25 (5): Long-Form Answers to Visual Questions from Blind and Low Vision People
Title: Long-Form Answers to Visual Questions from Blind and Low Vision People | Langform-Antworten auf visuelle Fragen von Blinden und Sehbehinderten | 对盲人和低视力者视觉问题的长期答复 2408.06303v2 |
Authors (8): Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, Amy Pavel
Vision language models can now generate long-form answers to questions about images - long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate the ability of VQA models to abstain from answering unanswerable questions across multiple prompting strategies.
nan
Article 470
Title@2025-07-25 (5): Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models
Title: Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models | Gespräche sind schief gegangen, aber dann? Evaluieren von Gesprächsvorhersagemodellen | 对话消失,但后来呢?评价对话预测模型 2507.19470v1 |
Authors (5): Son Quoc Tran, Tushaar Gangavarapu, Nicholas Chernogor, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil
We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model’s ability to revise its forecast as the conversation progresses.
nan
Article 471
Title@2025-07-25 (5): RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Title: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale | RADLADS: Schnelle Aufmerksamkeitsdestillation zu linearen Aufmerksamkeitsdecodern auf Scale | RADLADS: 缩放线性引引代码的快速注意蒸馏 2505.03005v3 |
Authors (4): Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today’s prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper
nan
Article 472
Title@2025-07-25 (5): GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Title: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning | GEPA: Reflektierende Prompt-Evolution kann Verstärkungs-Lernen übertreffen | GEPA: 反思即时进化能够超过成绩的强化学习 2507.19457v1 |
Authors (17): Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab
Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA’s design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.
nan
Article 473
Title@2025-07-25 (5): A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies
Title: A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies | Ein Diagramm-Review-Prozess unterstützt durch natürliche Sprachverarbeitung und Multi-Wave adaptive Sampling zur Beschleunigung der Validierung von Code-basierten Algorithmen für große Datenbankstudien | 借助自然语言处理和多波适应性取样的图表审查过程,以加快大型数据库研究代码算法的验证工作 2507.22943v1 |
Authors (16): Shirley V Wang, Georg Hahn, Sushama Kattinakere Sreedhara, Mufaddal Mahesri, Haritha S. Pillai, Rajendra Aldis, Joyce Lii, Sarah K. Dutcher, Rhoda Eniafe, Jamal T. Jones, Keewan Kim, Jiwei He, Hana Lee, Sengwee Toh, Rishi J Desai, Jie Yang
Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.
nan
Article 474
Title@2025-07-25 (5): Distillation Scaling Laws
Title: Distillation Scaling Laws | Destillationsskalierungsgesetze | 强化法律 2502.08606v2 |
Authors (6): Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.
nan
Article 475
Title@2025-07-25 (5): TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability
Title: TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability | TokenSmith: Verstärkte Datenbearbeitung, Suche und Inspektion für großformatige Sprachmodellschulungen und -dolmetschbarkeit | TokenSmitth:简化数据编辑、搜索和检查,以进行大型语文模式培训和解释 2507.19419v1 |
Authors (8): Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna Gummadi, Willie Neiswanger, Robin Jia
Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub1, with accompanying documentation and tutorials. A demonstration video is also available on YouTube.
nan
Article 476
Title@2025-07-25 (5): Towards Domain Specification of Embedding Models in Medicine
Title: Towards Domain Specification of Embedding Models in Medicine | Auf dem Weg zur Domain-Spezifikation von Einbettungsmodellen in die Medizin | 走向医学嵌入模型的域域指定 2507.19407v1 |
Authors (4): Mohammad Khodadad, Ali Shiraee, Mahdi Astaraki, Hamidreza Mahyar
Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks. To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings. Alongside this model, we propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of medical text. Our results demonstrate that this combined approach not only establishes a robust evaluation framework but also yields embeddings that consistently outperform state of the art alternatives in different tasks.
nan
Article 477
Title@2025-07-25 (5): CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback
Title: CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback | CodeEvo: Interaktionsgetriebene Synthese codezentrierter Daten durch hybrides und iteratives Feedback | 代码化:通过混合和循环反馈对以代码为中心的数据进行互动驱动合成 2507.22080v1 |
Authors (5): Qiushi Sun, Jinyang Gong, Lei Li, Qipeng Guo, Fei Yuan
Acquiring high-quality instruction-code pairs is essential for training Large Language Models (LLMs) for code generation. Manually curated data is expensive and inherently limited in scale, motivating the development of code-centric synthesis methods. Yet, current approaches either focus on augmenting existing code or rely on predefined heuristics, both lacking rigorous data validation, which results in synthetic data that is ungrounded, repetitive, or overly simplistic. Inspired by collaborative programming practices, we propose CodeEvo, a framework that synthesizes code data through iterative interactions between two LLM agents: a Coder, which generates candidate code and test cases based on given instructions, and a Reviewer, which guides the synthesis process by producing new instructions and feedback. We further introduce a hybrid feedback mechanism that combines compiler determinism with the generative flexibility of agents, enabling automatic quality control throughout synthesis. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks with various difficulties. In-depth analyses further provide insights from multiple perspectives into effective code-centric data synthesis.
nan
Article 478
Title@2025-07-25 (5): Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question
Title: Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question | Vielfältige LLMs oder unterschiedliche Frageinterpretationen? Das ist die Assembling-Frage | 不同的LLMs或不同的问题解释? 2507.21168v1 |
Authors (2): Rafael Rosales, Santiago Miret
Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.
nan
Article 479
Title@2025-07-25 (5): Data Augmentation for Spoken Grammatical Error Correction
Title: Data Augmentation for Spoken Grammatical Error Correction | Datenvergrößerung für gesprochene Grammatical Error Correction | 语音语法错误校正的数据增强 2507.19374v1 |
Authors (5): Penny Karanasou, Mengjie Qian, Stefano Bannò, Mark J. F. Gales, Kate M. Knill
While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\&I Corpus, the first publicly available speech dataset with grammar error annotations.
nan
Article 480
Title@2025-07-25 (5): LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
Title: LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences | LOTUS: Ein Leaderboard für detaillierte Bildunterschriften von Qualität zu gesellschaftlichen Bias und Benutzereinstellungen | LOTUS: 从质量到社会偏见和用户首选的详细图像描述领导板 2507.19362v1 |
Authors (10): Yusuke Hirota, Boyi Li, Ryo Hachiuma, Yueh-Hua Wu, Boris Ivanovic, Yuta Nakashima, Marco Pavone, Yejin Choi, Yu-Chiang Frank Wang, Chao-Han Huck Yang
Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.
nan
Article 481
Title@2025-07-25 (5): SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models
Title: SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models | SpeechIQ: Sprachintelligenz Quotient über kognitive Ebenen im Sprachverständnis von großen Sprachmodellen | 语音理解大语言模式中不同认知层次的语音情报引号 2507.19361v1 |
Authors (11): Zhen Wan, Chao-Han Huck Yang, Yahan Yu, Jinchuan Tian, Sheng Li, Ke Hu, Zhehuai Chen, Shinji Watanabe, Fei Cheng, Chenhui Chu, Sadao Kurohashi
We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training.
nan
Article 482
Title@2025-07-25 (5): SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
Title: SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model | SALM-Duplex: Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell | SALM-Duplex:语音对语音语言模式的高效和直接双重模式 2505.15670v4 |
Authors (10): Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.
nan
Article 483
Title@2025-07-25 (5): Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization
Title: Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization | Verbesserung der Sprach-Emotions-Erkennung Auslevering Aligning Timestamps von ASR-Transkriptionen und Sprecher-Diarisierung | 利用ASR记录稿和议长对称的调和时标 2507.19356v1 |
Authors (3): Hsuan-Yu Wang, Pei-Ying Lee, Berlin Chen
In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.
nan
Article 484
Title@2025-07-25 (5): DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue
Title: DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue | DoctorAgent-RL: Ein multi-agent-kollaboratives Verstärkungs-Lernsystem für den multi-Turn-Klinischen Dialog | DocrAgentor-RL:多轮临床对话多机构合作强化学习系统 2505.19630v2 |
Authors (5): Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, Yixue Li
Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL
nan
Article 485
Title@2025-07-25 (5): Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks
Title: Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks | Smooth Reading: Die Lücke von LLM zur Selbstaufmerksamkeit von LLM bei langen Kontextaufgaben überbrücken | 平滑阅读:弥合经常LLM与长期任务自用LLM之间的差距 2507.19353v1 |
Authors (7): Kai Liu, Zhan Su, Peijie Dong, Fengran Mo, Jianfei Gao, ShaoTing Zhang, Kai Chen
Recently, recurrent large language models (Recurrent LLMs) with linear computational complexity have re-emerged as efficient alternatives to self-attention-based LLMs (Self-Attention LLMs), which have quadratic complexity. However, Recurrent LLMs often underperform on long-context tasks due to their limited fixed-size memory. Previous research has primarily focused on enhancing the memory capacity of Recurrent LLMs through architectural innovations, but these approaches have not yet enabled Recurrent LLMs to match the performance of Self-Attention LLMs on long-context tasks. We argue that this limitation arises because processing the entire context at once is not well-suited for Recurrent LLMs. In this paper, we propose Smooth Reading, a chunk-wise inference method inspired by human reading strategies. Smooth Reading processes context in chunks and iteratively summarizes the contextual information, thereby reducing memory demands and making the approach more compatible with Recurrent LLMs. Our experimental results show that this method substantially narrows the performance gap between Recurrent and Self-Attention LLMs on long-context tasks, while preserving the efficiency advantages of Recurrent LLMs. Our Smooth Reading boosts SWA-3B-4k (a Recurrent LLM) from 5.68% lower to 3.61% higher performance than Self-Attention LLMs on LongBench. Besides, our method maintains the high efficiency, training 3x faster and inferring 2x faster at 64k context compared to Self-Attention LLMs. To our knowledge, this is the first work to achieve comparable performance using Recurrent LLMs compared with Self-Attention LLMs on long-context tasks. We hope our method will inspire future research in this area. To facilitate further progress, we will release code and dataset.
nan
Article 486
Title@2025-07-25 (5): Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation
Title: Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation | Externes Wissen in den vernünftigen Prozess zu spritzen verbessert die retrieval-angereicherte Generation | 将外部知识注入说明过程,加强检索-提款一代 2507.19333v1 |
Authors (4): Minghao Tang, Shiyu Ni, Jiafeng Guo, Keping Bi
Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs’ robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs’ reasoning process, aiming to enhance the model’s ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs’ reasoning process is a promising direction for building more robust RAG systems. The code can be found \href{here}{https://github.com/mh-tang/Passage-Injection}.
nan
Article 487
Title@2025-07-25 (5): References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation
Title: References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation | Referenzen Materie: Untersuchung der Auswirkungen von Referenzsatzvariationen auf die Bewertung der Zusammenfassung | 参考参考物质:调查参照标准差异对总结评价的影响 2506.14335v2 |
Authors (6): Silvia Casola, Yang Janet Liu, Siyao Peng, Oliver Kraus, Albert Gatt, Barbara Plank
Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.
nan
Article 488
Title@2025-07-25 (5): AutoPCR: Automated Phenotype Concept Recognition by Prompting
Title: AutoPCR: Automated Phenotype Concept Recognition by Prompting | AutoPCR: Automatisierte Erkennung von Phänomenen durch Prompting | 自动PCR:通过提示自动地识别基因型概念 2507.19315v1 |
Authors (3): Yicheng Tao, Yuanhao Huang, Jie Liu
Phenotype concept recognition (CR) is a fundamental task in biomedical text mining, enabling applications such as clinical diagnostics and knowledge graph construction. However, existing methods often require ontology-specific training and struggle to generalize across diverse text types and evolving biomedical terminology. We present AutoPCR, a prompt-based phenotype CR method that does not require ontology-specific training. AutoPCR performs CR in three stages: entity extraction using a hybrid of rule-based and neural tagging strategies, candidate retrieval via SapBERT, and entity linking through prompting a large language model. Experiments on four benchmark datasets show that AutoPCR achieves the best average and most robust performance across both mention-level and document-level evaluations, surpassing prior state-of-the-art methods. Further ablation and transfer studies demonstrate its inductive capability and generalizability to new ontologies.
nan
Article 489
Title@2025-07-25 (5): The Eloquence team submission for task 1 of MLC-SLM challenge
Title: The Eloquence team submission for task 1 of MLC-SLM challenge | Die Eloquence-Team-Einreichung für die Aufgabe 1 der MLC-SLM-Herausforderung | 刚果解运-解运挑战任务1的评分小组提交 2507.19308v1 |
Authors (5): Lorenzo Concina, Jordi Luque, Alessio Brutti, Marco Matassoni, Yuchen Zhang
In this paper, we present our studies and experiments carried out for the task 1 of the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM), which focuses on advancing multilingual conversational speech recognition through the development of speech language models architectures. Given the increasing relevance of real-world conversational data for building robust Spoken Dialogue Systems, we explore three approaches to multilingual ASR. First, we conduct an evaluation of the official baseline to better understand its strengths and limitations, by training two projectors (linear and qformer) with different foundation models. Second we leverage the SLAM-ASR framework to train a custom multilingual linear projector. Finally we investigate the role of contrastive learning and the extended conversational context in enhancing the robustness of recognition.
nan
Article 490
Title@2025-07-25 (5): Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns
Title: Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns | Identifizierung feinkörniger Formen des Populismus im politischen Diskurs: Eine Fallstudie zu Donald Trumps Präsidentschaftswahlen | 确定政治讨论中精美的民粹主义形式:关于唐纳德·特朗普总统运动的个案研究 2507.19303v1 |
Authors (3): Ilias Chalkidis, Stephanie Brandl, Paris Aslanidis
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of instruction-following tasks, yet their grasp of nuanced social science concepts remains underexplored. This paper examines whether LLMs can identify and classify fine-grained forms of populism, a complex and contested concept in both academic and media debates. To this end, we curate and release novel datasets specifically designed to capture populist discourse. We evaluate a range of pre-trained (large) language models, both open-weight and proprietary, across multiple prompting paradigms. Our analysis reveals notable variation in performance, highlighting the limitations of LLMs in detecting populist discourse. We find that a fine-tuned RoBERTa classifier vastly outperforms all new-era instruction-tuned LLMs, unless fine-tuned. Additionally, we apply our best-performing model to analyze campaign speeches by Donald Trump, extracting valuable insights into his strategic use of populist rhetoric. Finally, we assess the generalizability of these models by benchmarking them on campaign speeches by European politicians, offering a lens into cross-context transferability in political discourse analysis. In this setting, we find that instruction-tuned LLMs exhibit greater robustness on out-of-domain data.
nan
Article 491
Title@2025-07-25 (5): A Markov Categorical Framework for Language Modeling
Title: A Markov Categorical Framework for Language Modeling | Ein kategorisches Markov-Rahmenwerk für Sprachmodellierung | 用于语言建模的 Markov 语言建模分类框架 2507.19247v1 |
Authors (1): Yifan Zhang
Auto-regressive language models factorize sequence probabilities and are trained by minimizing the negative log-likelihood (NLL) objective. While empirically powerful, a deep theoretical understanding of why this simple objective yields such versatile representations remains elusive. This work introduces a unifying analytical framework using Markov Categories (MCs) to deconstruct the AR generation process and the NLL objective. We model the single-step generation map as a composition of Markov kernels in the category Stoch. This compositional view, when enriched with statistical divergences, allows us to dissect information flow and learned geometry. Our framework makes three main contributions. First, we provide a formal, information-theoretic rationale for the success of modern speculative decoding methods like EAGLE, quantifying the information surplus in hidden states that these methods exploit. Second, we formalize how NLL minimization forces the model to learn not just the next token, but the data’s intrinsic conditional stochasticity, a process we analyze using categorical entropy. Third, and most centrally, we prove that NLL training acts as an implicit form of spectral contrastive learning. By analyzing the information geometry of the model’s prediction head, we show that NLL implicitly forces the learned representation space to align with the eigenspectrum of a predictive similarity operator, thereby learning a geometrically structured space without explicit contrastive pairs. This compositional and information-geometric perspective reveals the deep structural principles underlying the effectiveness of modern LMs. Project Page: https://github.com/asiresearch/lm-theory
nan
Article 492
Title@2025-07-25 (5): Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation
Title: Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation | Jailbreaking Large Language Diffusion Models: Enthüllen versteckter Sicherheitsfehler bei der Diffusion-basierten Textgenerierung | 大语言传播模式:在以传播为基础的文本生成中披露隐藏的安全条 2507.19227v1 |
Authors (7): Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, Yufei Guo
Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning tasks.The precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety vulnerabilities.Successful defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based architectures.To address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural differences.We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD introduces Multi-Point Attention Attack, which guides parallel generative processes toward harmful outputs that inspired by affirmative response patterns in LLMs. Experimental evaluations across four LLDMs demonstrate that PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities. Furthermore, compared to autoregressive LLMs of the same size, LLDMs increase the harmful generation speed by 2x, significantly highlighting risks of uncontrolled misuse.Through comprehensive analysis, we provide an investigation into LLDM architecture, offering critical insights for the secure deployment of diffusion-based language models.
nan
Article 493
Title@2025-07-25 (5): How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework
Title: How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework | Wie viel Cheat bei der Evaluation eines großen Sprachmodells? Benchmarking-Überschätzung im Rahmen des One-Time-Pad-basierten Frameworks | 大语言模式在评价方面有多大的热量? 以单一时间为基础的框架为高估基准 2507.19219v1 |
Authors (5): Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu
Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.
nan
Article 494
Title@2025-07-25 (5): 3LM: Bridging Arabic, STEM, and Code through Benchmarking
Title: 3LM: Bridging Arabic, STEM, and Code through Benchmarking | 3LM: Arabisch, MINT und Code durch Benchmarking überbrücken | 3LM:通过基准确定连接阿拉伯语、STEM和代码 2507.15850v3 |
Authors (8): Basma El Amel Boussaha, Leen AlQadi, Mugariya Farooq, Shaikha Alsuwaidi, Giulia Campesan, Ahmed Alzubaidi, Mohammed Alyafeai, Hakim Hacid
Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.
nan
Article 495
Title@2025-07-25 (5): SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology
Title: SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology | SigBERT: Kombination narrativer medizinischer Berichte und rough Path Signature Theory zur Einschätzung des Überlebensrisikos in der Onkologie | SigBERT: 将叙述性医疗报告与肿瘤学生存风险估算的粗路签名理论相结合 2507.22941v1 |
Authors (5): Paul Minchella, Loïc Verlingue, Stéphane Chrétien, Rémi Vaucher, Guillaume Metzler
Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L'eon B'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.
nan
Article 496
Title@2025-07-25 (5): Towards Multimodal Social Conversations with Robots: Using Vision-Language Models
Title: Towards Multimodal Social Conversations with Robots: Using Vision-Language Models | Auf dem Weg zu multimodalen sozialen Gesprächen mit Robotern: Mit Vision-Sprachen-Modellen | 走向与机器人的多模式社会对话:使用视觉语言模型 2507.19196v1 |
Authors (2): Ruben Janssens, Tony Belpaeme
Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.
nan
Article 497
Title@2025-07-25 (5): Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?
Title: Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models? | Kann Small-Scale-Datenvergiftung Dialect-Linked Biases in großen Sprachmodellen exazerbieren? | 在大语言模型中,小范围数据中毒加剧分解链接的分界线能否成为大语言模型? 2507.19195v1 |
Authors (3): Chaymaa Abbas, Mariette Awad, Razane Tajeddine
Despite the ongoing improvements in the design of large language models (LLMs) to foster inclusion and balanced responses, these systems remain susceptible to encoding and amplifying social biases. This study examines how dialectal variation, specifically African American Vernacular English (AAVE) versus Standard American English (SAE), interacts with data poisoning to influence toxicity in outputs. Using both small- and medium-scale LLaMA models, we show that even minimal exposure to poisoned data significantly increases toxicity for AAVE inputs, while it remains comparatively unaffected for SAE. Larger models exhibit a more significant amplification effect which suggests heightened susceptibility with scale. To further assess these disparities, we employed GPT-4o as a fairness auditor, which identified harmful stereotypical patterns disproportionately tied to AAVE inputs, including portrayals of aggression, criminality, and intellectual inferiority. These findings underscore the compounding impact of data poisoning and dialectal bias and emphasize the need for dialect-aware evaluation, targeted debiasing interventions, and socially responsible training protocols during development.
nan
Article 498
Title@2025-07-25 (5): Natural Language Processing for Tigrinya: Current State and Future Directions
Title: Natural Language Processing for Tigrinya: Current State and Future Directions | Natürliche Sprachverarbeitung für Tigrinya: Aktueller Zustand und zukünftige Richtungen | 提格里尼亚的自然语言处理:现状和未来方向 2507.17974v2 |
Authors (2): Fitsum Gaim, Jong C. Park
Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 40 studies spanning more than a decade of work from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across ten distinct downstream tasks, including morphological processing, machine translation, speech recognition, and question-answering. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently unlocked by resource creation milestones. We identify key challenges rooted in Tigrinya’s morphological complexity and resource scarcity, while highlighting promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves as both a comprehensive reference for researchers and a roadmap for advancing Tigrinya NLP. A curated metadata of the surveyed studies and resources is made publicly available.
nan
Article 499
Title@2025-07-25 (5): Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them
Title: Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them | Scalpel vs. Hammer: GRPO verstärkt bestehende Fähigkeiten, SFT ersetzt sie | 缩略图与锤子:GROPO 放大现有能力,SFT 替换 2507.10616v2 |
Authors (4): Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, Ivan Titov
Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.
nan
Article 500
Title@2025-07-25 (5): An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case
Title: An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case | Eine empirische Untersuchung der Geschlechterstereotypdarstellung in großen Sprachmodellen: Der italienische Fall | 对大语言模式中性别陈规定型观念代表性的经验调查:意大利案例 2507.19156v1 |
Authors (5): Gioele Giachino, Marco Rondina, Antonio Vetrò, Riccardo Coppola, Juan Carlos De Martin
The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs’ ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of ‘she’ pronouns to the ‘assistant’ rather than the ‘manager’. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.
nan
Article 501
Title@2025-07-25 (5): Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
Title: Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings | Beschleunigung multimodaler Großsprachenmodelle über Dynamic Visual-Token Exit und die Empirical Findings | 通过动态直视退出和实证结论加速多模式大语言模型 2411.19628v2 |
Authors (7): Qiong Wu, Wenhao Lin, Yiyi Zhou, Weihao Ye, Zhanpeng Zen, Xiaoshuai Sun, Rongrong Ji
The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning} resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs’ efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs. Our code is released at https://github.com/DoubtedSteam/DyVTE.
nan
Article 502
Title@2025-07-25 (5): Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
Title: Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes | Vertrauenswürdige Begründung: Bewertung und Verbesserung der tatsächlichen Genauigkeit in LLM-Intermediate-Thought-Prozessen | 值得信赖的理由:评估和加强LLM中级思考程序中的事实准确性 2507.22940v1 |
Authors (3): Rui Jiao, Yue Zhang, Jinku Li
We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.
nan
Article 503
Title@2025-07-25 (5): OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
Title: OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth? | OS-MAP: Wie weit können Computer-verwendende Agenten in Breadth und Tiefe gehen? | OS-MAP:计算机用户在面包和深度上能走多远? 2507.19132v1 |
Authors (15): Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, Bowen Zhou
Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.
nan
Article 504
Title@2025-07-25 (5): Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning
Title: Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning | Destillieren der impliziten Multi-Branch-Struktur in LLMs’ Reasoning durch Verstärkungslernen | 通过强化学习,将LLMs的隐含多层结构提炼在“通过强化学习推理”中 2505.16142v3 |
Authors (9): Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher’s reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher’s implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.
nan
Article 505
Title@2025-07-25 (5): Objectifying the Subjective: Cognitive Biases in Topic Interpretations
Title: Objectifying the Subjective: Cognitive Biases in Topic Interpretations | Objektivierung des Subjektiven: Kognitive Biasen in thematischen Interpretationen | 表示主观性: 专题解释中的认知性分界线 2507.19117v1 |
Authors (7): Swapnil Hingmire, Ze Shi Li, Shiyu, Zeng, Ahmed Musa Awon, Luiz Franciscatto Guerra, Neil Ernst
Interpretation of topics is crucial for their downstream applications. State-of-the-art evaluation measures of topic quality such as coherence and word intrusion do not measure how much a topic facilitates the exploration of a corpus. To design evaluation measures grounded on a task, and a population of users, we do user studies to understand how users interpret topics. We propose constructs of topic quality and ask users to assess them in the context of a topic and provide rationale behind evaluations. We use reflexive thematic analysis to identify themes of topic interpretations from rationales. Users interpret topics based on availability and representativeness heuristics rather than probability. We propose a theory of topic interpretation based on the anchoring-and-adjustment heuristic: users anchor on salient words and make semantic adjustments to arrive at an interpretation. Topic interpretation can be viewed as making a judgment under uncertainty by an ecologically rational user, and hence cognitive biases aware user models and evaluation frameworks are needed.
nan
Article 506
Title@2025-07-25 (5): Relation Extraction with Instance-Adapted Predicate Descriptions
Title: Relation Extraction with Instance-Adapted Predicate Descriptions | Verhältnis-Extraktion mit instance-adapted Prädikat Beschreibungen | 采掘与原创性预言性说明的关系 2503.17799v2 |
Authors (2): Yuhang Jiang, Ramakanth Kavuluru
Relation extraction (RE) is a standard information extraction task playing a major role in downstream applications such as knowledge discovery and question answering. Although decoder-only large language models are excelling in generative tasks, smaller encoder models are still the go to architecture for RE. In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss. Unlike previous methods that employ a fixed linear layer for predicate representations, our approach uses a second encoder to compute instance-specific predicate representations by infusing them with real entity spans from corresponding input instances. We conducted experiments on two biomedical RE datasets and two general domain datasets. Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation. Ablation studies justify the importance of various components built into the proposed architecture.
nan
Article 507
Title@2025-07-25 (5): Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Title: Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy | Ensemble Debiasing Across Class und Sample Levels für eine gerechtere Genauigkeit | 公平促进准确性 2503.05157v4 |
Authors (3): Ruixi Lin, Ziqiao Wang, Yang You
Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a Heaviside step function based ensemble debiasing method, which enables flexible rectifications of in-context learned class probabilities at both class and sample levels. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. More importantly, we perform analyses on the resulted probability correction scheme, showing that sample-level corrections are necessary to elevate weak classes. Due to effectively correcting weak classes, our method also brings significant performance gains to a larger model variant, Llama-2-70B, especially on a biomedical domain task, further demonstrating the necessity of ensemble debiasing at both levels. Our source code is available at https://github.com/NUS-HPC-AI-Lab/DCS.
nan
Article 508
Title@2025-07-25 (5): Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case
Title: Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case | Vergleich von Pipeline-, Sequenz-zu-Sequenz- und GPT-Modellen für die End-to-End-Relation-Extraktion: Experimente mit dem Einsatzfall der seltenen Krankheiten | 管道、序列到序列和终端到终端关系提取GPT模型的比较:与罕见疾病使用案例的实验 2311.13729v3 |
Authors (3): Shashank Gupta, Xuguang Ai, Ramakanth Kavuluru
End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine. In this paper, we aim to compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases involving discontinuous and nested entities. We use the RareDis information extraction dataset to evaluate three competing approaches (for E2ERE): NER $\rightarrow$ RE pipelines, joint sequence to sequence models, and generative pre-trained transformer (GPT) models. We use comparable state-of-the-art models and best practices for each of these approaches and conduct error analyses to assess their failure modes. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind; GPT models with eight times as many parameters are worse than even sequence-to-sequence models and lose to pipeline models by over 10 F1 points. Partial matches and discontinuous entities caused many NER errors contributing to lower overall E2E performances. We also verify these findings on a second E2ERE dataset for chemical-protein interactions. Although generative LM-based methods are more suitable for zero-shot settings, when training data is available, our results show that it is better to work with more conventional models trained and tailored for E2ERE. More innovative methods are needed to marry the best of the both worlds from smaller encoder-decoder pipeline models and the larger GPT models to improve E2ERE. As of now, we see that well designed pipeline models offer substantial performance gains at a lower cost and carbon footprint for E2ERE. Our contribution is also the first to conduct E2ERE for the RareDis dataset.
nan
Article 509
Title@2025-07-25 (5): Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation
Title: Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation | Destillieren eines kleinen Utility-Based Passage Selectors zur Verbesserung der Retrieval-Augmented Generation | 蒸馏一个小型以公用事业为基础的通道选择器,以加强回收-提款一代 2507.19102v1 |
Authors (7): Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.
nan
Article 510
Title@2025-07-25 (5): How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction?
Title: How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction? | Wie wichtig ist Domain Specificity in Sprachmodellen und Instruction Finetuning für die biomedizinische Beziehungsextraktion? | 在生物医学关系采掘的语言模式和教学教学调整中,域的具体特点有多重要? 2402.13470v2 |
Authors (2): Aviv Brokman, Ramakanth Kavuluru
Cutting edge techniques developed in the general NLP domain are often subsequently applied to the high-value, data-rich biomedical domain. The past few years have seen generative language models (LMs), instruction finetuning, and few-shot learning become foci of NLP research. As such, generative LMs pretrained on biomedical corpora have proliferated and biomedical instruction finetuning has been attempted as well, all with the hope that domain specificity improves performance on downstream tasks. Given the nontrivial effort in training such models, we investigate what, if any, benefits they have in the key biomedical NLP task of relation extraction. Specifically, we address two questions: (1) Do LMs trained on biomedical corpora outperform those trained on general domain corpora? (2) Do models instruction finetuned on biomedical datasets outperform those finetuned on assorted datasets or those simply pretrained? We tackle these questions using existing LMs, testing across four datasets. In a surprising result, general-domain models typically outperformed biomedical-domain models. However, biomedical instruction finetuning improved performance to a similar degree as general instruction finetuning, despite having orders of magnitude fewer instructions. Our findings suggest it may be more fruitful to focus research effort on larger-scale biomedical instruction finetuning of general LMs over building domain-specific biomedical LMs
nan
Article 511
Title@2025-07-25 (5): JCAPT: A Joint Modeling Approach for CAPT
Title: JCAPT: A Joint Modeling Approach for CAPT | JCAPT: Ein gemeinsamer Modellierungsansatz für CAPT | JCAPT: CAPT的联合示范方法 2506.19315v2 |
Authors (3): Tzu-Hsuan Yang, Yue-Yang He, Berlin Chen
Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.
nan
Article 512
Title@2025-07-25 (5): LLMs are Also Effective Embedding Models: An In-depth Overview
Title: LLMs are Also Effective Embedding Models: An In-depth Overview | LLMs sind auch effektive Einbettungsmodelle: Eine ausführliche Übersicht | LLM项目也是有效的嵌入模型:深入概述 2412.12591v2 |
Authors (9): Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Kai Hua, Wenpeng Hu, Zhengwei Tao, Shuai Ma
Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks. Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs such as GPT, LLaMA, and Mistral. This survey provides an in-depth overview of this transition, beginning with foundational techniques before the LLM era, followed by LLM-based embedding models through two main strategies to derive embeddings from LLMs. 1) Direct prompting: We mainly discuss the prompt designs and the underlying rationale for deriving competitive embeddings. 2) Data-centric tuning: We cover extensive aspects that affect tuning an embedding model, including model architecture, training objectives, data constructions, etc. Upon the above, we also cover advanced methods for producing embeddings from longer texts, multilingual, code, cross-modal data, as well as reasoning-aware and other domain-specific scenarios. Furthermore, we discuss factors affecting choices of embedding models, such as performance/efficiency comparisons, dense vs sparse embeddings, pooling strategies, and scaling law. Lastly, the survey highlights the limitations and challenges in adapting LLMs for embeddings, including cross-task embedding quality, trade-offs between efficiency and accuracy, low-resource, long-context, data bias, robustness, etc. This survey serves as a valuable resource for researchers and practitioners by synthesizing current advancements, highlighting key challenges, and offering a comprehensive framework for future work aimed at enhancing the effectiveness and efficiency of LLMs as embedding models.
nan
Article 513
Title@2025-07-25 (5): Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents
Title: Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents | Debating Truth: Debattieren-getriebene Behauptungsverifizierung mit mehreren Large Language Model Agents | 讨论真相:由辩论驱动的与多语种示范语言代理核查索赔要求 2507.19090v1 |
Authors (5): Haorui He, Yupeng Li, Dacheng Wen, Reynold Cheng, Francis C. M. Lau
Claim verification is critical for enhancing digital literacy. However, the state-of-the-art single-LLM methods struggle with complex claim verification that involves multi-faceted evidences. Inspired by real-world fact-checking practices, we propose DebateCV, the first claim verification framework that adopts a debate-driven methodology using multiple LLM agents. In our framework, two Debaters take opposing stances on a claim and engage in multi-round argumentation, while a Moderator evaluates the arguments and renders a verdict with justifications. To further improve the performance of the Moderator, we introduce a novel post-training strategy that leverages synthetic debate data generated by the zero-shot DebateCV, effectively addressing the scarcity of real-world debate-driven claim verification data. Experimental results show that our method outperforms existing claim verification methods under varying levels of evidence quality. Our code and dataset are publicly available at https://anonymous.4open.science/r/DebateCV-6781.
nan
Article 514
Title@2025-07-25 (5): Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement
Title: Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement | Arg-LlaDA: Argumentationszusammenfassung über Large Language Diffusion Models und Sufficiency-Aware Refinement | ARG-LLADA:通过大语言传播模型和充足软件精炼进行参数汇总 2507.19081v1 |
Authors (6): Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic
Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.
nan
Article 515
Title@2025-07-25 (5): Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny
Title: Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny | Re:Form – Reduzierung menschlicher Priore bei skalierbarer formaler Software-Verifikation mit RL in LLMs: Eine Vorstudie zu Dafny | Re:形式 – – 在可扩展的正式软件核查中减少人类前科,LLL女士:关于Dafny的初步研究 2507.16331v2 |
Authors (16): Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu
Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.
nan
Article 516
Title@2025-07-25 (5): ToolACE: Winning the Points of LLM Function Calling
Title: ToolACE: Winning the Points of LLM Function Calling | ToolACE: Die Punkte des LLM-Funktionsaufrufs gewinnen | 工具ACE:赢得LLLM函数调用点 2409.00920v2 |
Authors (27): Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen
Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.
nan
Article 517
Title@2025-07-25 (5): GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness
Title: GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness | GOAT-SLM: Ein gesprochenes Sprachmodell mit paralinguistischem und Lautsprechercharakteristischem Bewusstsein | GOAT-SLM:具有多语言语言和议长特点意识的口语模式 2507.18119v2 |
Authors (16): Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li
Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.
nan
Article 518
Title@2025-07-25 (5): XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare
Title: XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare | XAI4LLM. Lassen Sie Modelle für maschinelles Lernen und LLMs für verbessertes In-Context-Lernen im Gesundheitswesen zusammenarbeiten | XAI4LLLM. 让机器学习模式和LLM合作促进保健领域加强内文学习 2405.06270v4 |
Authors (4): Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio
Clinical decision support systems require models that are not only highly accurate but also equitable and sensitive to the implications of missed diagnoses. In this study, we introduce a knowledge-guided in-context learning (ICL) framework designed to enable large language models (LLMs) to effectively process structured clinical data. Our approach integrates domain-specific feature groupings, carefully balanced few-shot examples, and task-specific prompting strategies. We systematically evaluate this method across seventy distinct ICL designs by various prompt variations and two different communication styles-natural-language narrative and numeric conversational-and compare its performance to robust classical machine learning (ML) benchmarks on tasks involving heart disease and diabetes prediction. Our findings indicate that while traditional ML models maintain superior performance in balanced precision-recall scenarios, LLMs employing narrative prompts with integrated domain knowledge achieve higher recall and significantly reduce gender bias, effectively narrowing fairness disparities by an order of magnitude. Despite the current limitation of increased inference latency, LLMs provide notable advantages, including the capacity for zero-shot deployment and enhanced equity. This research offers the first comprehensive analysis of ICL design considerations for applying LLMs to tabular clinical tasks and highlights distillation and multimodal extensions as promising directions for future research.
nan
Article 519
Title@2025-07-25 (5): T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
Title: T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation | T2ISafety: Benchmark für die Bewertung von Fairness, Toxizität und Datenschutz in der Bildgenerierung | T2ISafetty:评估图像生成中的公平、毒性和隐私的基准 2501.12612v3 |
Authors (8): Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao
Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.
nan
Article 520
Title@2025-07-25 (5): Closing the Modality Gap for Mixed Modality Search
Title: Closing the Modality Gap for Mixed Modality Search | Schließen der Modalitätslücke für gemischte Modalitätssuche | 缩小混合方式搜索模式差距 2507.19054v1 |
Authors (6): Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, Serena Yeung-Levy
Mixed modality search – retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents – is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP’s embedding space. Evaluated on MixBench – the first benchmark specifically designed for mixed modality search – GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.
nan
Article 521
Title@2025-07-25 (5): PARROT: An Open Multilingual Radiology Reports Dataset
Title: PARROT: An Open Multilingual Radiology Reports Dataset | PARROT: Ein offener Mehrsprachiger Röntgenbericht Datensatz | 开放多语种放射学报告数据集 2507.22939v1 |
Authors (88): Bastien Le Guellec, Kokou Adambounou, Lisa C Adams, Thibault Agripnidis, Sung Soo Ahn, Radhia Ait Chalal, Tugba Akinci D Antonoli, Philippe Amouyel, Henrik Andersson, Raphael Bentegeac, Claudio Benzoni, Antonino Andrea Blandino, Felix Busch, Elif Can, Riccardo Cau, Armando Ugo Cavallo, Christelle Chavihot, Erwin Chiquete, Renato Cuocolo, Eugen Divjak, Gordana Ivanac, Barbara Dziadkowiec Macek, Armel Elogne, Salvatore Claudio Fanni, Carlos Ferrarotti, Claudia Fossataro, Federica Fossataro, Katarzyna Fulek, Michal Fulek, Pawel Gac, Martyna Gachowska, Ignacio Garcia Juarez, Marco Gatti, Natalia Gorelik, Alexia Maria Goulianou, Aghiles Hamroun, Nicolas Herinirina, Krzysztof Kraik, Dominik Krupka, Quentin Holay, Felipe Kitamura, Michail E Klontzas, Anna Kompanowska, Rafal Kompanowski, Alexandre Lefevre, Tristan Lemke, Maximilian Lindholz, Lukas Muller, Piotr Macek, Marcus Makowski, Luigi Mannacio, Aymen Meddeb, Antonio Natale, Beatrice Nguema Edzang, Adriana Ojeda, Yae Won Park, Federica Piccione, Andrea Ponsiglione, Malgorzata Poreba, Rafal Poreba, Philipp Prucker, Jean Pierre Pruvo, Rosa Alba Pugliesi, Feno Hasina Rabemanorintsoa, Vasileios Rafailidis, Katarzyna Resler, Jan Rotkegel, Luca Saba, Ezann Siebert, Arnaldo Stanzione, Ali Fuat Tekin, Liz Toapanta Yanchapaxi, Matthaios Triantafyllou, Ekaterini Tsaoulia, Evangelia Vassalou, Federica Vernuccio, Johan Wasselius, Weilang Wang, Szymon Urban, Adrian Wlodarczak, Szymon Wlodarczak, Andrzej Wysocki, Lina Xu, Tomasz Zatonski, Shuhang Zhang, Sebastian Ziegelmayer, Gregory Kuchcinski, Keno K Bressem
Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.
nan
Article 522
Title@2025-07-25 (5): FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems
Title: FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems | FD-Bench: Eine Full-Duplex-Benchmarking-Pipeline für volle Duplex-Gesprochene Dialogsysteme | FD-Bench:为全双口孔对话系统设计的全自动基准管道 2507.19040v1 |
Authors (7): Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng
Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS’s ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.
nan
Article 523
Title@2025-07-25 (5): MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
Title: MLLM-based Speech Recognition: When and How is Multimodality Beneficial? | MLLM-basierte Spracherkennung: Wann und wie ist Multimodalität vorteilhaft? | 基于MLLM的语音识别:多式联运何时和如何受益? 2507.19037v1 |
Authors (4): Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill
Recent advances in multi-modal large language models (MLLMs) have opened new possibilities for unified modeling of speech, text, images, and other modalities. Building on our prior work, this paper examines the conditions and model architectures under which multiple input modalities can improve automatic speech recognition (ASR) accuracy in noisy environments. Through experiments on synthetic and real-world data, we find that (1) harnessing more modalities usually improves ASR accuracy, as each modality provides complementary information, but the improvement depends on the amount of auditory noise. (2) Synchronized modalities (e.g., lip movements) are more useful at high noise levels whereas unsynchronized modalities (e.g., image context) are most helpful at moderate noise levels. (3) Higher-quality visual representations consistently improve ASR accuracy, highlighting the importance of developing more powerful visual encoders. (4) Mamba exhibits similar trends regarding the benefits of multimodality as do Transformers. (5) The input order of modalities as well as their weights in the loss function can significantly impact accuracy. These findings both offer practical insights and help to deepen our understanding of multi-modal speech recognition under challenging conditions.
nan
Article 524
Title@2025-07-25 (5): A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents
Title: A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents | Ein Graph-basierter Ansatz für Multi-Modal-Fragebeantwortungen aus Flussdiagrammen in Telecom-Dokumenten | 以图表为基础的电信文件流动图表多模式问题解答方法 2507.22938v1 |
Authors (7): Sumit Soman, H. G. Ranjani, Sujoy Roychowdhury, Venkata Dharma Surya Narayana Sastry, Akshat Jain, Pranav Gangrade, Ayaaz Khan
Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.
nan
Article 525
Title@2025-07-25 (5): Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems
Title: Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems | Akustisch präzises Hesitations-Tagging ist für End-to-End-Transkriptionssysteme unerlässlich | 终端至终端逐字记录翻译系统至关重要的隐含精确言辞 2506.04076v2 |
Authors (5): Jhen-Ke Lin, Hao-Chien Lu, Chung-Chun Wang, Hong-Yun Lin, Berlin Chen
Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the “Extra” scheme yielded a 5.5% WER, an 11.3% relative improvement over the “Pure” scheme (6.2% WER). This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.
nan
Article 526
Title@2025-07-25 (5): Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations
Title: Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations | Töten Sie zwei Vögel mit einer Klappe: generalisierte und robuste KI-generierte Texterkennung durch dynamische Störungen | 以一石一石杀死两鸟:通过动态扰动,普遍和有力地检测AI产生的文本 2504.21019v2 |
Authors (6): Yinghan Zhou, Juan Wen, Wanli Peng, Yiming Xue, Ziwei Zhang, Zhengxian Wu
The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.
nan
Article 527
Title@2025-07-25 (5): Advancing biomolecular understanding and design following human instructions
Title: Advancing biomolecular understanding and design following human instructions | Verbesserung des biomolekularen Verständnisses und Designs nach menschlichen Anweisungen | 按照人类的指示,推动生物分子理解和设计 2410.07919v2 |
Authors (12): Xiang Zhuang, Keyan Ding, Tianwen Lyu, Yinuo Jiang, Xiaotong Li, Zhuoyi Xiang, Zeyuan Wang, Ming Qin, Kehua Feng, Jike Wang, Qiang Zhang, Huajun Chen
Understanding and designing biomolecules, such as proteins and small molecules, is central to advancing drug discovery, synthetic biology and enzyme engineering. Recent breakthroughs in artificial intelligence have revolutionized biomolecular research, achieving remarkable accuracy in biomolecular prediction and design. However, a critical gap remains between artificial intelligence’s computational capabilities and researchers’ intuitive goals, particularly in using natural language to bridge complex tasks with human intentions. Large language models have shown potential to interpret human intentions, yet their application to biomolecular research remains nascent due to challenges including specialized knowledge requirements, multimodal data integration, and semantic alignment between natural language and biomolecules. To address these limitations, we present InstructBioMol, a large language model designed to bridge natural language and biomolecules through a comprehensive any-to-any alignment of natural language, molecules and proteins. This model can integrate multimodal biomolecules as the input, and enable researchers to articulate design goals in natural language, providing biomolecular outputs that meet precise biological needs. Experimental results demonstrate that InstructBioMol can understand and design biomolecules following human instructions. In particular, it can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an enzyme-substrate pair prediction score of 70.4. This highlights its potential to transform real-world biomolecular research. The code is available at https://github.com/HICAI-ZJU/InstructBioMol.
nan
Article 528
Title@2025-07-25 (5): HIVMedQA: Benchmarking large language models for HIV medical decision support
Title: HIVMedQA: Benchmarking large language models for HIV medical decision support | HIVMedQA: Benchmarking großer Sprachmodelle für die medizinische HIV-Entscheidungsunterstützung | HIVMedQA:确定艾滋病毒医疗决策支助大语言模式的基准 2507.18143v2 |
Authors (6): Gonzalo Cardenal-Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, Diane Duroux
Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.
nan
Article 529
Title@2025-07-25 (5): Verbalized Representation Learning for Interpretable Few-Shot Generalization
Title: Verbalized Representation Learning for Interpretable Few-Shot Generalization | Verbalisiertes Repräsentationslernen für verdolmetschbare wenige-heiße Verallgemeinerung | 以口头方式进行代表性学习,为可口译的少或偏的普及化提供口译 2411.18651v2 |
Authors (6): Cheng-Fu Yang, Da Yin, Wenbo Hu, Nanyun Peng, Bolei Zhou, Kai-Wei Chang
Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: https://github.com/joeyy5588/VRL/tree/main.
nan
Article 530
Title@2025-07-25 (5): Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation
Title: Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation | Bewertung von LLM-Fehlern, für die Personalisierte Disinformationsgenerierung missbräuchlich verwendet zu werden | 评价LLMM 利用LLM 个人化信息生成不当利用他人造成个人化信息的脆弱性 2412.13666v2 |
Authors (7): Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katarina Marcincinova, Matus Mesarcik
The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts raises many concerns regarding their misuse. Previous research has shown that LLMs can be effectively misused for generating disinformation news articles following predefined narratives. Their capabilities to generate personalized (in various aspects) content have also been evaluated and mostly found usable. However, a combination of personalization and disinformation abilities of LLMs has not been comprehensively studied yet. Such a dangerous combination should trigger integrated safety filters of the LLMs, if there are some. This study fills this gap by evaluating vulnerabilities of recent open and closed LLMs, and their willingness to generate personalized disinformation news articles in English. We further explore whether the LLMs can reliably meta-evaluate the personalization quality and whether the personalization affects the generated-texts detectability. Our results demonstrate the need for stronger safety-filters and disclaimers, as those are not properly functioning in most of the evaluated LLMs. Additionally, our study revealed that the personalization actually reduces the safety-filter activations; thus effectively functioning as a jailbreak. Such behavior must be urgently addressed by LLM developers and service providers.
nan
Article 531
Title@2025-07-25 (5): CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering
Title: CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering | CoE-Ops: Zusammenarbeit von LLM-basierten Experten für AIOps Frage-Antwort | 欧委会行动:以LLM为基础的专家协作处理AIOps问题 2507.22937v1 |
Authors (9): Jinkun Zhao, Yuanshuai Wang, Xingjian Zhang, Ruibo Chen, Xingchuang Liao, Junle Wang, Lei Huang, Kui Zhang, Wenjun Wu
With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework’s capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.
nan
Article 532
Title@2025-07-25 (5): MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts
Title: MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts | MultiSocial: Mehrsprachiger Benchmark der maschinengenerierten Texterkennung von Social-Media-Texten | 多社会多语言:社会-媒体文本机制文本检测多语言基准 2406.12549v2 |
Authors (4): Dominik Macko, Jakub Kopal, Robert Moro, Ivan Srba
Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.
nan
Article 533
Title@2025-07-25 (5): A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation
Title: A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation | Eine Toolbox, kein Hammer – Multi-TAG: Skalierung der Mathematik mit Multi-Tool-Aggregation | 一个工具箱, 不是锤锤 – – 多TAG: 使用多工具聚合的量性数学解释 2507.18973v1 |
Authors (2): Bohan Yao, Vikas Yadav
Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.
nan
Article 534
Title@2025-07-25 (5): Spike No More: Stabilizing the Pre-training of Large Language Models
Title: Spike No More: Stabilizing the Pre-training of Large Language Models | Spike No More: Stabilisierung der Vorausbildung großer Sprachmodelle | Spike No No More: 稳定大语言模式培训前 2312.16903v4 |
Authors (4): Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki
Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers. Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut. We conduct various experiments to empirically verify our theoretical analyses. Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.
nan
Article 535
Title@2025-07-25 (5): A Similarity Measure for Comparing Conversational Dynamics
Title: A Similarity Measure for Comparing Conversational Dynamics | Eine Ähnlichkeitsmessung für den Vergleich von Konversationsdynamiken | 比较相互动态的相似性措施 2507.18956v1 |
Authors (3): Sang Min Jung, Kaixiang Zhang, Cristian Danescu-Niculescu-Mizil
The quality of a conversation goes beyond the individual quality of each reply, and instead emerges from how these combine into interactional patterns that give the conversation its distinctive overall “shape”. However, there is no robust automated method for comparing conversations in terms of their overall interactional dynamics. Such methods could enhance the analysis of conversational data and help evaluate conversational agents more holistically. In this work, we introduce a similarity measure for comparing conversations with respect to their dynamics. We design a validation framework for testing the robustness of the metric in capturing differences in conversation dynamics and for assessing its sensitivity to the topic of the conversations. Finally, to illustrate the measure’s utility, we use it to analyze conversational dynamics in a large online community, bringing new insights into the role of situational power in conversations.
nan
Article 536
Title@2025-07-25 (5): MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model
Title: MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model | MedicalBERT: Verbesserung der biomedizinischen natürlichen Sprachverarbeitung mit vorgebildetem BERT-basiertem Modell | 医学BERT:利用预先培训的BERT模式,加强生物医学自然语言处理 2507.08013v2 |
Authors (6): K. Sahit Reddy, N. Ragavenderan, Vasanth K., Ganesh N. Naik, Vishalakshi Prabhu, Nagaraja G. S
Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can’t fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model [accessed Jul 06 2025].
nan
Article 537
Title@2025-07-25 (5): Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection
Title: Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection | Zusammenfassung des Rechtsdokuments: Verbesserung der richterlichen Effizienz durch Automatisierungserkennung | 法律文件摘要:通过自动检测提高司法效率 2507.18952v1 |
Authors (4): Yongjie Li, Ruilin Nong, Jianan Liu, Lucas Evans
Legal document summarization represents a significant advancement towards improving judicial efficiency through the automation of key information detection. Our approach leverages state-of-the-art natural language processing techniques to meticulously identify and extract essential data from extensive legal texts, which facilitates a more efficient review process. By employing advanced machine learning algorithms, the framework recognizes underlying patterns within judicial documents to create precise summaries that encapsulate the crucial elements. This automation alleviates the burden on legal professionals, concurrently reducing the likelihood of overlooking vital information that could lead to errors. Through comprehensive experiments conducted with actual legal datasets, we demonstrate the capability of our method to generate high-quality summaries while preserving the integrity of the original content and enhancing processing times considerably. The results reveal marked improvements in operational efficiency, allowing legal practitioners to direct their efforts toward critical analytical and decision-making activities instead of manual reviews. This research highlights promising technology-driven strategies that can significantly alter workflow dynamics within the legal sector, emphasizing the role of automation in refining judicial processes.
nan
Article 538
Title@2025-07-25 (5): Adaptive Learning Systems: Personalized Curriculum Design Using LLM-Powered Analytics
Title: Adaptive Learning Systems: Personalized Curriculum Design Using LLM-Powered Analytics | Adaptive Lernsysteme: Personalisierte Lehrplangestaltung mit LLM-Powered Analytics | 适应性学习系统:利用LLM能动分析器的个人化课程设计 2507.18949v1 |
Authors (4): Yongjie Li, Ruilin Nong, Jianan Liu, Lucas Evans
Large language models (LLMs) are revolutionizing the field of education by enabling personalized learning experiences tailored to individual student needs. In this paper, we introduce a framework for Adaptive Learning Systems that leverages LLM-powered analytics for personalized curriculum design. This innovative approach uses advanced machine learning to analyze real-time data, allowing the system to adapt learning pathways and recommend resources that align with each learner’s progress. By continuously assessing students, our framework enhances instructional strategies, ensuring that the materials presented are relevant and engaging. Experimental results indicate a marked improvement in both learner engagement and knowledge retention when using a customized curriculum. Evaluations conducted across varied educational environments demonstrate the framework’s flexibility and positive influence on learning outcomes, potentially reshaping conventional educational practices into a more adaptive and student-centered model.
nan
Article 539
Title@2025-07-25 (5): TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models
Title: TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models | TreeReader: Ein Hierarchischer Akademischer Papierleser Powered by Language Models | 树形阅读器:一个按语言模式授权的等级学术论文阅读器 2507.18945v1 |
Authors (7): Zijian Zhang, Pan Chen, Fangshi Du, Runlong Ye, Oliver Huang, Michael Liut, Alán Aspuru-Guzik
Efficiently navigating and understanding academic papers is crucial for scientific progress. Traditional linear formats like PDF and HTML can cause cognitive overload and obscure a paper’s hierarchical structure, making it difficult to locate key information. While LLM-based chatbots offer summarization, they often lack nuanced understanding of specific sections, may produce unreliable information, and typically discard the document’s navigational structure. Drawing insights from a formative study on academic reading practices, we introduce TreeReader, a novel language model-augmented paper reader. TreeReader decomposes papers into an interactive tree structure where each section is initially represented by an LLM-generated concise summary, with underlying details accessible on demand. This design allows users to quickly grasp core ideas, selectively explore sections of interest, and verify summaries against the source text. A user study was conducted to evaluate TreeReader’s impact on reading efficiency and comprehension. TreeReader provides a more focused and efficient way to navigate and understand complex academic literature by bridging hierarchical summarization with interactive exploration.
nan
Article 540
Title@2025-07-25 (5): LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation
Title: LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation | LLaVA-NeuMT: Selektive Schicht-Neuron-Modulation für effiziente multimodale Mehrsprachigkeit | LLAVA-NeUMT: 选择性多语层-Neuron 高效多语种多语种多模式翻译的调整 2507.18940v1 |
Authors (8): Jingxuan Wei, Caijun Jia, Qi Chen, Yujun Cai, Linzhuang Sun, Xiangxiang Zhang, Gaowei Wu, Bihui Yu
Multimodal Machine Translation (MMT) enhances translation quality by incorporating visual context, helping to resolve textual ambiguities. While existing MMT methods perform well in bilingual settings, extending them to multilingual translation remains challenging due to cross-lingual interference and ineffective parameter-sharing strategies. To address this, we propose LLaVA-NeuMT, a novel multimodal multilingual translation framework that explicitly models language-specific and language-agnostic representations to mitigate multilingual interference. Our approach consists of a layer selection mechanism that identifies the most informative layers for different language pairs and a neuron-level adaptation strategy that dynamically selects language-specific and agnostic neurons to improve translation quality while reducing redundancy. We conduct extensive experiments on the M3-Multi30K and M3-AmbigCaps datasets, demonstrating that LLaVA-NeuMT, while fine-tuning only 40\% of the model parameters, surpasses full fine-tuning approaches and ultimately achieves SOTA results on both datasets. Our analysis further provides insights into the importance of selected layers and neurons in multimodal multilingual adaptation, offering an efficient and scalable solution to cross-lingual adaptation in multimodal translation.
nan
Article 541
Title@2025-07-25 (5): Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks
Title: Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks | Benchmarking des multimodalen Verständnisses und der komplexen Begründung für ESG-Aufgaben | 确定环境组合组合任务多式联运理解和复杂理由的基准 2507.18932v1 |
Authors (8): Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce \textbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.
nan
Article 542
Title@2025-07-25 (5): Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters
Title: Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters | Seed-X: Starke Mehrsprachige Übersetzung LLM mit 7B-Parametern aufbauen | 种子-X:利用7B参数建立强有力的多语种翻译LLM 2507.13618v3 |
Authors (26): Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu
Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
nan
Article 543
Title@2025-07-25 (5): Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders
Title: Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders | Entdeckt Cross-Linguistic Disparities in LLMs mit Sparse Autoencodern | 使用 Sparse 自动编码器在 LLM 中解封跨语言差异 2507.18918v1 |
Authors (3): Richmond Sin Jing Xuan, Jalil Huseynov, Yang Zhang
Multilingual large language models (LLMs) exhibit strong cross-linguistic generalization, yet medium to low resource languages underperform on common benchmarks such as ARC-Challenge, MMLU, and HellaSwag. We analyze activation patterns in Gemma-2-2B across all 26 residual layers and 10 languages: Chinese (zh), Russian (ru), Spanish (es), Italian (it), medium to low resource languages including Indonesian (id), Catalan (ca), Marathi (mr), Malayalam (ml), and Hindi (hi), with English (en) as the reference. Using Sparse Autoencoders (SAEs), we reveal systematic disparities in activation patterns. Medium to low resource languages receive up to 26.27 percent lower activations in early layers, with a persistent gap of 19.89 percent in deeper layers. To address this, we apply activation-aware fine-tuning via Low-Rank Adaptation (LoRA), leading to substantial activation gains, such as 87.69 percent for Malayalam and 86.32 percent for Hindi, while maintaining English retention at approximately 91 percent. After fine-tuning, benchmark results show modest but consistent improvements, highlighting activation alignment as a key factor in enhancing multilingual LLM performance.
nan
Article 544
Title@2025-07-25 (5): Mining Contextualized Visual Associations from Images for Creativity Understanding
Title: Mining Contextualized Visual Associations from Images for Creativity Understanding | Bergbau Kontextualisierte visuelle Assoziationen aus Bildern für Kreativität Verständnis | 利用图像促进创造性理解的采矿背景化视觉协会 2507.18915v1 |
Authors (3): Ananya Sahu, Amith Ananthram, Kathleen McKeown
Understanding another person’s creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.
nan
Article 545
Title@2025-07-25 (5): A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions
Title: A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions | Eine systematische Überprüfung der Systeme der wichtigsten retrieval-Augmented Generation (RAG): Fortschritt, Lücken und Zukunftsrichtungen | 系统审查关键回收-养代(RAG)系统:进展、差距和未来方向 2507.18910v1 |
Authors (4): Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, Arpan Biswas
Retrieval-Augmented Generation (RAG) represents a major advancement in natural language processing (NLP), combining large language models (LLMs) with information retrieval systems to enhance factual grounding, accuracy, and contextual relevance. This paper presents a comprehensive systematic review of RAG, tracing its evolution from early developments in open domain question answering to recent state-of-the-art implementations across diverse applications. The review begins by outlining the motivations behind RAG, particularly its ability to mitigate hallucinations and outdated knowledge in parametric models. Core technical components-retrieval mechanisms, sequence-to-sequence generation models, and fusion strategies are examined in detail. A year-by-year analysis highlights key milestones and research trends, providing insight into RAG’s rapid growth. The paper further explores the deployment of RAG in enterprise systems, addressing practical challenges related to retrieval of proprietary data, security, and scalability. A comparative evaluation of RAG implementations is conducted, benchmarking performance on retrieval accuracy, generation fluency, latency, and computational efficiency. Persistent challenges such as retrieval quality, privacy concerns, and integration overhead are critically assessed. Finally, the review highlights emerging solutions, including hybrid retrieval approaches, privacy-preserving techniques, optimized fusion strategies, and agentic RAG architectures. These innovations point toward a future of more reliable, efficient, and context-aware knowledge-intensive NLP systems.
nan
Article 546
Title@2025-07-25 (5): Large language models provide unsafe answers to patient-posed medical questions
Title: Large language models provide unsafe answers to patient-posed medical questions | Große Sprachmodelle bieten unsichere Antworten auf patientenbezogene medizinische Fragen | 大型语言模式为病人提出的医疗问题提供不安全的答案 2507.18905v1 |
Authors (17): Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany Brazile, Natasha Chase, Dimple Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah
Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots–Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta–on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women’s health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.
nan
Article 547
Title@2025-07-25 (5): SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
Title: SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models | SLoW: Wählen Sie niederfrequente Wörter aus! Automatische Wörterbuchauswahl für Übersetzungen auf großen Sprachmodellen | SLOW: 选择低频单词! 用于大语言模型翻译的自动词典选择 2507.18902v1 |
Authors (4): Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam
There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}
nan
Article 548
Title@2025-07-25 (5): REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
Title: REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research? | REPRO-Bench: Können Agentische KI-Systeme die Reproduzierbarkeit der sozialwissenschaftlichen Forschung bewerten? | REPRO-BENCH: AI系统能否评估社会科学研究的可减少性? 2507.18901v1 |
Authors (6): Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang
Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.
nan
Article 549
Title@2025-07-25 (5): Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs
Title: Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs | Kann LLMs Citation Intent voraussagen? Eine experimentelle Analyse des In-Context-Lernens und Feinabstimmungens auf offenen LLMs | LLMs 预测引文意图:对开放式LMs的内文学习和微调的实验分析 2502.14561v3 |
Authors (4): Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos
This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.
nan
Article 550
Title@2025-07-25 (5): A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans
Title: A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans | Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen | 全面评价未受过训练语言模式和人文的语义关系知识 2412.01131v4 |
Authors (4): Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga
Recently, much work has concerned itself with the enigma of what exactly pretrained language models~(PLMs) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Generally, only one relation has been considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that performed by the PLMs. This means that at this point in time, there is only an incomplete view of the extent of these models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use five metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, prototypicality, and distinguishability. Using these, we can fairly compare humans and models on the same task. Our extensive experiments involve six PLMs, four masked and two causal language models. The results reveal a significant knowledge gap between humans and models for all semantic relations. In general, causal language models, despite their wide use, do not always perform significantly better than masked language models. Antonymy is the outlier relation where all models perform reasonably well.
nan
Article 551
Title@2025-07-25 (5): NUTMEG: Separating Signal From Noise in Annotator Disagreement
Title: NUTMEG: Separating Signal From Noise in Annotator Disagreement | NUTMEG: Trennen von Signalen von Geräuschen in Annotator-Uneinigkeit | NUTMEG: 在通知器中从噪音中分离信号 2507.18890v1 |
Authors (3): Jonathan Ivey, Susan Gauch, David Jurgens
NLP models often rely on human-labeled data for training and evaluation. Many approaches crowdsource this data from a large number of annotators with varying skills, backgrounds, and motivations, resulting in conflicting annotations. These conflicts have traditionally been resolved by aggregation methods that assume disagreements are errors. Recent work has argued that for many tasks annotators may have genuine disagreements and that variation should be treated as signal rather than noise. However, few models separate signal and noise in annotator disagreement. In this work, we introduce NUTMEG, a new Bayesian model that incorporates information about annotator backgrounds to remove noisy annotations from human-labeled training data while preserving systematic disagreements. Using synthetic data, we show that NUTMEG is more effective at recovering ground-truth from annotations with systematic disagreement than traditional aggregation methods. We provide further analysis characterizing how differences in subpopulation sizes, rates of disagreement, and rates of spam affect the performance of our model. Finally, we demonstrate that downstream models trained on NUTMEG-aggregated data significantly outperform models trained on data from traditionally aggregation methods. Our results highlight the importance of accounting for both annotator competence and systematic disagreements when training on human-labeled data.
nan
Article 552
Title@2025-07-25 (5): MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service
Title: MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service | MindFlow+: Ein selbstständiger Agent für den E-Commerce-Kundendienst | Mind Flow+:电子商务客户服务自我发展代理 2507.18884v1 |
Authors (4): Ming Gong, Xucheng Huang, Ziheng Xu, Vijayan K. Asari
High-quality dialogue is crucial for e-commerce customer service, yet traditional intent-based systems struggle with dynamic, multi-turn interactions. We present MindFlow+, a self-evolving dialogue agent that learns domain-specific behavior by combining large language models (LLMs) with imitation learning and offline reinforcement learning (RL). MindFlow+ introduces two data-centric mechanisms to guide learning: tool-augmented demonstration construction, which exposes the model to knowledge-enhanced and agentic (ReAct-style) interactions for effective tool use; and reward-conditioned data modeling, which aligns responses with task-specific goals using reward signals. To evaluate the model’s role in response generation, we introduce the AI Contribution Ratio, a novel metric quantifying AI involvement in dialogue. Experiments on real-world e-commerce conversations show that MindFlow+ outperforms strong baselines in contextual relevance, flexibility, and task accuracy. These results demonstrate the potential of combining LLMs tool reasoning, and reward-guided learning to build domain-specialized, context-aware dialogue systems.
nan
Article 553
Title@2025-07-25 (5): An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
Title: An Investigation of Prompt Variations for Zero-shot LLM-based Rankers | Eine Untersuchung von Prompt-Variationen für Null-Schuss LLM-basierte Ranker | 调查零射中LLM中士的迅速变化情况 2406.14117v4 |
Authors (4): Shuoqi Sun, Shengyao Zhuang, Shuai Wang, Guido Zuccon
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones – but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker’s effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.
nan
Article 554
Title@2025-07-25 (5): Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction
Title: Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction | Phoneme-Level Visuelle Spracherkennung über Point-Visual Fusion und Sprachmodellsanierung | 通过点-视点融合和语言模式重建确认电话级视觉讲话 2507.18863v1 |
Authors (3): Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh
Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.
nan
Article 555
Title@2025-07-25 (5): PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning
Title: PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning | PrismRAG: Steigerung der RAG-Faktizität mit Distraktorresilienz und geschichteter Vernunft | PrismRAG:提高RAG事实质量,使其具有抗力和策略性合理性 2507.18857v1 |
Authors (13): Mohammad Kachuee, Teja Gollapudi, Minseok Kim, Yin Huang, Kai Sun, Xiao Yang, Jiaqi Wang, Nirav Shah, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions.
nan
Article 556
Title@2025-07-24 (4): The Curious Case of Class Accuracy Imbalance in LLMs: Post-hoc Debiasing via Nonlinear Integer Programming
Title: The Curious Case of Class Accuracy Imbalance in LLMs: Post-hoc Debiasing via Nonlinear Integer Programming | Der Kuriose Fall der Klasse Genauigkeit Ungleichgewicht in LLMs: Post-hoc-Debiasing über nichtlineare Integer-Programmierung | LLMLM中分类准确性不平衡的怪案:通过非线性整数编程进行热后脱偏性 2405.07623v7 |
Authors (2): Ruixi Lin, Yang You
Large language models (LLMs) are good knowledge bases but struggle to perform equally well for all classes in text classification. This paper investigates the case of class accuracy imbalance in LLMs, where deeply entangled pretraining biases and prompt-specific cues contribute to the imbalance. To overcome the difficulty in bias identification and inaccessibility of retraining, we post-hoc balance class accuracy using only output probabilities. This is enabled by reformulating debiasing as a combinatorial optimization problem. In details, we first motivate a post-hoc bias metric, the Contextual Oddity Bias (COBias), to quantify the over-/under-prediction (a tendency to over-predict some classes while under-predicting others) in LLMs. We then propose the Debiasing as Nonlinear Integer Programming (DNIP) method to reweight LLM output class probabilities towards minimizing COBias and maximizing overall accuracy, without being constrained by bias sources or updating LLM parameters. Since the DNIP model contains non-differentiable elements, we use simulated annealing to efficiently solve it. Evaluations on five LLMs across NLP classification benchmarks show that DNIP simultaneously achieves significant COBias reduction (61% relative reduction) and accuracy improvement (18% relative increase) under different LLM prompting setups.
nan
Article 557
Title@2025-07-24 (4): R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
Title: R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning | R-Stitch: Dynamische Trajektorien-Stitching für effiziente Vernunft | R-Stitch: 高效理性的动态轨迹切换 2507.17307v2 |
Authors (6): Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang
Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85\% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.
nan
Article 558
Title@2025-07-24 (4): Toward Super Agent System with Hybrid AI Routers
Title: Toward Super Agent System with Hybrid AI Routers | Auf dem Weg zum Super Agent System mit Hybrid-KI Routern | 向超级代理系统过渡 2504.10519v2 |
Authors (8): Yuhang Yao, Haixin Wang, Yibo Chen, Jiawen Wang, Min Chang Jordan Ren, Bosheng Ding, Salman Avestimehr, Chaoyang He
AI Agents powered by Large Language Models are transforming the world through enormous applications. A super agent has the potential to fulfill diverse user needs, such as summarization, coding, and research, by accurately understanding user intent and leveraging the appropriate tools to solve tasks. However, to make such an agent viable for real-world deployment and accessible at scale, significant optimizations are required to ensure high efficiency and low cost. This position paper presents a design of the Super Agent System powered by the hybrid AI routers. Upon receiving a user prompt, the system first detects the intent of the user, then routes the request to specialized task agents with the necessary tools or automatically generates agentic workflows. In practice, most applications directly serve as AI assistants on edge devices such as phones and robots. As different language models vary in capability and cloud-based models often entail high computational costs, latency, and privacy concerns, we then explore the hybrid mode where the router dynamically selects between local and cloud models based on task complexity. Finally, we introduce the blueprint of an on-device super agent enhanced with cloud. With advances in multi-modality models and edge hardware, we envision that most computations can be handled locally, with cloud collaboration only as needed. Such architecture paves the way for super agents to be seamlessly integrated into everyday life in the near future.
nan
Article 559
Title@2025-07-24 (4): CueBuddy: helping non-native English speakers navigate English-centric STEM education
Title: CueBuddy: helping non-native English speakers navigate English-centric STEM education | CueBuddy: Hilfe für nicht-native englische Referenten navigieren Englisch-centric STEM Bildung | CueBuddy:帮助非母语英语者掌握以英语为中心的STEM教育 2507.18827v1 |
Authors (1): Pranav Gupta
Students across the world in STEM classes, especially in the Global South, fall behind their peers who are more fluent in English, despite being at par with them in terms of scientific prerequisites. While many of them are able to follow everyday English at ease, key terms in English stay challenging. In most cases, such students have had most of their course prerequisites in a lower resource language. Live speech translation to lower resource languages is a promising area of research, however, models for speech translation can be too expensive on a large scale and often struggle with technical content. In this paper, we describe CueBuddy, which aims to remediate these issues by providing real-time “lexical cues” through technical keyword spotting along real-time multilingual glossary lookup to help students stay up to speed with complex English jargon without disrupting their concentration on the lecture. We also describe the limitations and future extensions of our approach.
nan
Article 560
Title@2025-07-24 (4): Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models
Title: Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models | Promptomatix: Ein automatisches Optimierungs-Framework für große Sprachmodelle | 即时表达式:大语言模型自动快速优化框架 2507.14241v3 |
Authors (9): Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang
Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.
nan
Article 561
Title@2025-07-24 (4): Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
Title: Analyze Feature Flow to Enhance Interpretation and Steering in Language Models | Feature Flow analysieren, um Interpretation und Steuerung in Sprachmodellen zu verbessern | 分析地貌流动,以加强语言模型的口译和指导 2502.03032v3 |
Authors (4): Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
nan
Article 562
Title@2025-07-24 (4): Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Title: Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs | Palme: Ein kulturell inklusiver und sprachlich vielfältiger Datensatz für arabische LLMs | 棕榈:阿拉伯文LLMLM具有文化包容性和语言多样性的数据集 2503.00151v2 |
Authors (44): Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, Abdelrahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-Chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar Al-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi, Muhammed Saeed, Houdaifa Atou, Issam Ait Yahia, Abdelhak Bouayad, Mohammed Machrouh, Amal Makouar, Dania Alkawi, Mukhtar Mohamed, Safaa Taher Abdelfadil, Amine Ziad Ounnoughene, Rouabhia Anfel, Rwaa Assi, Ahmed Sorkatti, Mohamedou Cheikh Tourad, Anis Koubaa, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
nan
Article 563
Title@2025-07-24 (4): Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models
Title: Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models | Plan für Geschwindigkeit: Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle | 速度计划: 遮蔽传播语言模型的饱和日程安排 2506.19037v3 |
Authors (3): Omer Luxembourg, Haim Permuter, Eliya Nachmani
Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.
nan
Article 564
Title@2025-07-24 (4): Evaluating Code-Mixing in LLMs Across 18 Languages
Title: Evaluating Code-Mixing in LLMs Across 18 Languages | Bewertung von Code-Mixing in LLMs in 18 Sprachen | 评估18种语言的LLMs混合编码 2507.18791v1 |
Authors (2): Yilun Yang, Yekun Chai
Code-mixing, the practice of switching between languages within a conversation, presents unique challenges for traditional natural language processing. Existing benchmarks, such as LinCE and GLUECoS, are limited by narrow language pairings and tasks, failing to adequately evaluate the code-mixing capabilities of large language models (LLMs). Despite the significance of code-mixing for multilingual users, research on LLMs in this context remains limited. Additionally, current methods for generating code-mixed data are underdeveloped. In this paper, we conduct a comprehensive evaluation of LLMs’ performance on code-mixed data across 18 languages from seven language families. We also propose a novel approach for generating synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our analysis reveals consistent underperformance of LLMs on code-mixed datasets involving multiple language families. We suggest that improvements in training data size, model scale, and few-shot learning could enhance their performance.
nan
Article 565
Title@2025-07-24 (4): Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis
Title: Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis | Bewertung großer Sprachmodelle (LLMs) in Financial NLP: Eine vergleichende Studie zur Analyse von Finanzberichten | 评价金融中大语言模型:财务报告分析比较研究 2507.22936v1 |
Authors (1): Md Talha Mohsin
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the ‘Magnificent Seven’ technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.
nan
Article 566
Title@2025-07-24 (4): A Fisher’s exact test justification of the TF-IDF term-weighting scheme
Title: A Fisher’s exact test justification of the TF-IDF term-weighting scheme | Genaue Begründung des TF-IDF-Term-Wichtungssystems durch einen Fisher | A Fisher公司对TF-IDF术语加权办法的精确测试理由 2507.15742v2 |
Authors (3): Paul Sheridan, Zeyad Ahmed, Aitazaz A. Farooque
Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness.
nan
Article 567
Title@2025-07-24 (4): ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting
Title: ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting | ylmmcl bei Mehrsprachiger Textentgiftung 2025: Lexikon-geführte Entgiftung und Klassifikator-gestrichenes Umschreiben | 2025年多语言文本解毒:Lexicon-Guid解毒和分类法改写 2507.18769v1 |
Authors (4): Nicole Lai-Lopez, Lusha Wang, Su Yuan, Liza Zhang
In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team: a robust multilingual text detoxification pipeline that integrates lexicon-guided tagging, a fine-tuned sequence-to-sequence model (s-nlp/mt0-xl-detox-orpo) and an iterative classifier-based gatekeeping mechanism. Our approach departs from prior unsupervised or monolingual pipelines by leveraging explicit toxic word annotation via the multilingual_toxic_lexicon to guide detoxification with greater precision and cross-lingual generalization. Our final model achieves the highest STA (0.922) from our previous attempts, and an average official J score of 0.612 for toxic inputs in both the development and test sets. It also achieved xCOMET scores of 0.793 (dev) and 0.787 (test). This performance outperforms baseline and backtranslation methods across multiple languages, and shows strong generalization in high-resource settings (English, Russian, French). Despite some trade-offs in SIM, the model demonstrates consistent improvements in detoxification strength. In the competition, our team achieved ninth place with a score of 0.612.
nan
Article 568
Title@2025-07-24 (4): Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience
Title: Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience | Auf dem Weg zu strukturiertem Wissen Reasoning: Kontrastive retrieval-erweiterte Generation auf Erfahrung | 实现结构化知识理由:反向取回-积累经验的一代人 2506.00842v2 |
Authors (10): Jiawei Gu, Ziting Xian, Yuanzhen Xie, Ye Liu, Enjie Liu, Ruichao Zhong, Mochi Gao, Yunzhi Tan, Bo Hu, Zang Li
Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9x, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.
nan
Article 569
Title@2025-07-24 (4): The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages
Title: The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages | Die Rolle der Orthografiekonsistenz in mehrsprachigen Einbettungsmodellen für die Textklassifizierung in Arabisch-Script-Sprachen | 阿拉伯文和克里普特语文文本分类多语种嵌入模型中正统一致性的作用 2507.18762v1 |
Authors (7): Abdulhady Abas Abdullah, Amir H. Gandomi, Tarik A Rashid, Seyedali Mirjalili, Laith Abualigah, Milena Živković, Hadi Veisi
In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.
nan
Article 570
Title@2025-07-24 (4): Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition
Title: Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition | Lärm Kontrastive Schätzung-basiertes Matching Framework für die Erkennung von Low-Resource-Sicherheitsangriffen | 低资源安保攻击模式识别比对框架 2401.10337v4 |
Authors (3): Tu Nguyen, Nedim Šrndić, Alexander Neth
Tactics, Techniques and Procedures (TTPs) represent sophisticated attack patterns in the cybersecurity domain, described encyclopedically in textual knowledge bases. Identifying TTPs in cybersecurity writing, often called TTP mapping, is an important and challenging task. Conventional learning approaches often target the problem in the classical multi-class or multilabel classification setting. This setting hinders the learning ability of the model due to a large number of classes (i.e., TTPs), the inevitable skewness of the label distribution and the complex hierarchical structure of the label space. We formulate the problem in a different learning paradigm, where the assignment of a text to a TTP label is decided by the direct semantic similarity between the two, thus reducing the complexity of competing solely over the large labeling space. To that end, we propose a neural matching architecture with an effective sampling-based learn-to-compare mechanism, facilitating the learning process of the matching model despite constrained resources.
nan
Article 571
Title@2025-07-24 (4): Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Title: Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement | Spezifikation Selbst-Korrektion: Eindämmung von In-Context-Belohnung Hacken durch Test-Zeit-Verfeinerung | 规格自我校正:通过试验-时间精炼进行减速的背负冲洗 2507.18742v1 |
Authors (1): Víctor Gallego
Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user’s true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .
nan
Article 572
Title@2025-07-24 (4): AI Flow: Perspectives, Scenarios, and Approaches
Title: AI Flow: Perspectives, Scenarios, and Approaches | AI Flow: Perspektiven, Szenarien und Ansätze | AI 流动:观点、设想和方法 2506.12479v3 |
Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
nan
Article 573
Title@2025-07-24 (4): An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning
Title: An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning | Effizientes Sparse-Fine-Tuning mit geringem Quantisierungsfehler über Neural Network Pruning | 通过神经网络节制低量错误的高效粗简精细调整 2502.11439v2 |
Authors (2): Cen-Jhih Li, Aditya Bhaskara
Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SpFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SpFT framework, based on ideas from neural network pruning. At a high level, we first identify ``important’’ neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Experiments on common language tasks show our method improves SpFT’s memory efficiency by 20-50\% while matching the accuracy of state-of-the-art methods like LoRA’s variants.
nan
Article 574
Title@2025-07-24 (4): Checklists Are Better Than Reward Models For Aligning Language Models
Title: Checklists Are Better Than Reward Models For Aligning Language Models | Checklisten sind besser als Belohnungsmodelle für die Ausrichtung von Sprachmodellen | 核对列表比奖励模型更好调整语言模型 2507.18624v1 |
Authors (7): Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, Tongshuang Wu
Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmfulness”. In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose “Reinforcement Learning from Checklist Feedback” (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models’ support of queries that express a multitude of needs.
nan
Article 575
Title@2025-07-24 (4): TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards
Title: TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards | TRPrompt: Bootstrapping Query-Aware Prompt Optimierung von Textbelohnungen | TRPropt: 从文本奖励中促进解答询问软件快速优化 2507.18618v1 |
Authors (5): Andreea Nica, Ivan Zakazov, Nicolas Mario Baldwin, Saibo Geng, Robert West
Prompt optimization improves the reasoning abilities of large language models (LLMs) without requiring parameter updates to the target model. Following heuristic-based “Think step by step” approaches, the field has evolved in two main directions: while one group of methods uses textual feedback to elicit improved prompts from general-purpose LLMs in a training-free way, a concurrent line of research relies on numerical rewards to train a special prompt model, tailored for providing optimal prompts to the target model. In this paper, we introduce the Textual Reward Prompt framework (TRPrompt), which unifies these approaches by directly incorporating textual feedback into training of the prompt model. Our framework does not require prior dataset collection and is being iteratively improved with the feedback on the generated prompts. When coupled with the capacity of an LLM to internalize the notion of what a “good” prompt is, the high-resolution signal provided by the textual rewards allows us to train a prompt model yielding state-of-the-art query-specific prompts for the problems from the challenging math datasets GSMHard and MATH.
nan
Article 576
Title@2025-07-24 (4): SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
Title: SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning | SynC: Synthetische Bildunterschrift Datensatzverfeinerung mit ein-zu-vielen Mapping für Zero-shot Bildunterschrift | 合成图像说明: 合成图像说明数据集精化,用一到多个绘图进行零光图像说明的合成图像说明 2507.18616v1 |
Authors (6): Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim
Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.
nan
Article 577
Title@2025-07-24 (4): BEARCUBS: A benchmark for computer-using web agents
Title: BEARCUBS: A benchmark for computer-using web agents | BEARCUBS: Benchmark für computergestützte Web-Agenten | BEARCUBS:计算机使用网络代理器的基准 2503.07919v3 |
Authors (6): Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a “smallbut mighty” benchmark of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. We find that ChatGPT Agent significantly outperforms other computer-using agents with an overall accuracy of 65.8% (compared to e.g., Operator’s 23.4%), showcasing substantial progress in tasks involving real computer use, such as playing web games and navigating 3D environments. Nevertheless, closing the gap to human performance requires improvements in areas like fine control, complex data filtering, and execution speed. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.
nan
Article 578
Title@2025-07-24 (4): Trusted Knowledge Extraction for Operations and Maintenance Intelligence
Title: Trusted Knowledge Extraction for Operations and Maintenance Intelligence | Vertrauenswürdige Wissensgewinnung für Operationen und Wartungsintelligenz | 行动和维持情报可信赖的知识采掘 2507.22935v1 |
Authors (5): Kathleen Mealey, Jonathan A. Karr Jr., Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman II
Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.
nan
Article 579
Title@2025-07-24 (4): Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs
Title: Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs | Sparse Logit Sampling: Beschleunigung der Wissensdestillation in LLMs | 粗略的登录抽样:加速在LLMs中进行知识蒸馏 2503.16870v2 |
Authors (8): Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee
Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation’, which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.
nan
Article 580
Title@2025-07-24 (4): Deep Learning Approaches for Multimodal Intent Recognition: A Survey
Title: Deep Learning Approaches for Multimodal Intent Recognition: A Survey | Deep Learning Ansätze zur multimodalen Intent-Erkennung: Eine Umfrage | 多种形式本能识别的深学习方法:调查 2507.22934v1 |
Authors (11): Jingwei Zhao, Yuhua Wen, Qifei Li, Minchi Hu, Yingying Zhou, Jingyao Xue, Junyang Wu, Yingming Gao, Zhengqi Wen, Jianhua Tao, Ya Li
Intent recognition aims to identify users’ underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.
nan
Article 581
Title@2025-07-24 (4): What Makes You CLIC: Detection of Croatian Clickbait Headlines
Title: What Makes You CLIC: Detection of Croatian Clickbait Headlines | Was macht Sie CLIC: Erkennung von kroatischen Clickbait Schlagzeilen | 是什么让你成为CLIC:发现克罗地亚点击头条头条 2507.14314v2 |
Authors (4): Marija Anđelić, Dominik Šipek, Laura Majer, Jan Šnajder
Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative – commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile CLIC, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. We fine-tune the BERTi'c model on this task and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.
nan
Article 582
Title@2025-07-24 (4): AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
Title: AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs | AQuilt: Verweben von Logik und Selbstinspektion in Low-Cost, High-Relevance-Datensynthese für Spezialisten LLMs | Anilt:将逻辑和自我检查编织成低成本高相关性数据合成,供专家LLMs使用 2507.18584v1 |
Authors (7): Xiaopeng Ke, Hexuan Deng, Xuebo Liu, Jun Rao, Zhenxi Song, Jun Yu, Min Zhang
Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.
nan
Article 583
Title@2025-07-24 (4): DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data
Title: DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data | DR.EHR: Dense Retrieval für elektronische Gesundheitsdaten mit Wissensinjektion und synthetischen Daten | DR.EHR: 具有知识注射和合成数据的电子健康记录大量检索 2507.18583v1 |
Authors (4): Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu
Electronic Health Records (EHRs) are pivotal in clinical practices, yet their retrieval remains a challenge mainly due to semantic gap issues. Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora. This paper introduces \texttt{DR.EHR}, a series of dense retrieval models specifically tailored for EHR retrieval. We propose a two-stage training pipeline utilizing MIMIC-IV discharge summaries to address the need for extensive medical knowledge and large-scale training data. The first stage involves medical entity extraction and knowledge injection from a biomedical knowledge graph, while the second stage employs large language models to generate diverse training data. We train two variants of \texttt{DR.EHR}, with 110M and 7B parameters, respectively. Evaluated on the CliniQ benchmark, our models significantly outperforms all existing dense retrievers, achieving state-of-the-art results. Detailed analyses confirm our models’ superiority across various match and query types, particularly in challenging semantic matches like implication and abbreviation. Ablation studies validate the effectiveness of each pipeline component, and supplementary experiments on EHR QA datasets demonstrate the models’ generalizability on natural language questions, including complex ones with multiple entities. This work significantly advances EHR retrieval, offering a robust solution for clinical applications.
nan
Article 584
Title@2025-07-24 (4): System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition
Title: System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition | Systembericht für CCL25-Eval Task 10: SRAG-MAV für feinkörnige chinesische Hassspracherkennung | 供CCL25-Eval任务10使用的系统报告:关于中华恶言识别的SRAG-MAV系统报告 2507.18580v1 |
Authors (4): Jiahao Wang, Ramen Liu, Longhui Zhang, Jing Li
This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at https://github.com/king-wang123/CCL25-SRAG-MAV.
nan
Article 585
Title@2025-07-24 (4): P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts
Title: P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts | P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts | P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应 2406.12548v3 |
Authors (5): Yuhao Dan, Jie Zhou, Qin Chen, Junfeng Tian, Liang He
Personalized large language models (LLMs) have attracted great attention in many applications, such as emotional support and role-playing. However, existing works primarily focus on modeling explicit character profiles, while ignoring the underlying personality traits that truly shape behaviors and decision-making, hampering the development of more anthropomorphic and psychologically-grounded AI systems. In this paper, we explore the modeling of Big Five personality traits, which is the most widely used trait theory in psychology, and propose P-React, a mixture of experts (MoE)-based personalized LLM. Particularly, we integrate a Personality Specialization Loss (PSL) to better capture individual trait expressions, providing a more nuanced and psychologically grounded personality simulacrum. To facilitate research in this field, we curate OCEAN-Chat, a high-quality, human-verified dataset designed to train LLMs in expressing personality traits across diverse topics. Extensive experiments demonstrate the effectiveness of P-React in maintaining consistent and real personality.
nan
Article 586
Title@2025-07-24 (4): Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs
Title: Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs | Weit-in, schmal-out: Wiederverwertbare Dekodierung für effiziente und effektive DLLMs | 宽放, 窄出: 为高效和有效DLLMs而可撤销的解码 2507.18578v1 |
Authors (8): Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model’s bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
nan
Article 587
Title@2025-07-24 (4): LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs
Title: LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs | LingBench++: Ein linguistisch-informiertes Benchmark- und Reasoning-Framework für mehrstufige und kulturübergreifende Schlussfolgerungen mit LLMs | LingBench++:与LLMs的多层次和跨文化推理语言综合基准和理由框架 2507.16809v2 |
Authors (10): Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Zhen-Yu Lin, Pin-Cheng Chen, Shu-Kai Hsieh
We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.
nan
Article 588
Title@2025-07-24 (4): PosterMate: Audience-driven Collaborative Persona Agents for Poster Design
Title: PosterMate: Audience-driven Collaborative Persona Agents for Poster Design | PosterMate: Audience-getriebene Kollaborative Persona Agenten für Poster-Design | PosterMate:由观众驱动的海报设计合作人员代理 2507.18572v1 |
Authors (4): Donghoon Shin, Daniel Lee, Gary Hsieh, Gromit Yeuk-Yin Chan
Poster designing can benefit from synchronous feedback from target audiences. However, gathering audiences with diverse perspectives and reconciling them on design edits can be challenging. Recent generative AI models present opportunities to simulate human-like interactions, but it is unclear how they may be used for feedback processes in design. We introduce PosterMate, a poster design assistant that facilitates collaboration by creating audience-driven persona agents constructed from marketing documents. PosterMate gathers feedback from each persona agent regarding poster components, and stimulates discussion with the help of a moderator to reach a conclusion. These agreed-upon edits can then be directly integrated into the poster design. Through our user study (N=12), we identified the potential of PosterMate to capture overlooked viewpoints, while serving as an effective prototyping tool. Additionally, our controlled online evaluation (N=100) revealed that the feedback from an individual persona agent is appropriate given its persona identity, and the discussion effectively synthesizes the different persona agents’ perspectives.
nan
Article 589
Title@2025-07-24 (4): Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
Title: Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods | Hybride Tokenisierungsstrategie für DNA-Sprachmodell mit Byte Pair Encoding und K-MER Methoden | 使用字节对等编码和K-MER方法的DNA语言模型混合化战略 2507.18570v1 |
Authors (2): Ganesh Sapkota, Md Hasibur Rahman
This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.
nan
Article 590
Title@2025-07-24 (4): GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation
Title: GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation | GIIFT: Graph-geführte induktive Bildverarbeitungsfreie multimodale maschinelle Übersetzung | GIIFT: 图表制导感性不含图像的无图像多式机器翻译 2507.18562v1 |
Authors (2): Jiafeng Xiong, Yuting Zhao
Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.
nan
Article 591
Title@2025-07-24 (4): Identity-related Speech Suppression in Generative AI Content Moderation
Title: Identity-related Speech Suppression in Generative AI Content Moderation | Identitätsbezogene Sprachunterdrückung in der Generativen KI-Inhaltsmoderation | 在产生AI 内容调节中禁止与身份有关的言语 2409.13725v3 |
Authors (5): Grace Proebsting, Oghenefejiro Isaacs Anigboro, Charlie M. Crawford, Danaé Metaxa, Sorelle A. Friedler
Automated content moderation has long been used to help identify and filter undesired user-generated content online. But such systems have a history of incorrectly flagging content by and about marginalized identities for removal. Generative AI systems now use such filters to keep undesired generated content from being created by or shown to users. While a lot of focus has been given to making sure such systems do not produce undesired outcomes, considerably less attention has been paid to making sure appropriate text can be generated. From classrooms to Hollywood, as generative AI is increasingly used for creative or expressive text generation, whose stories will these technologies allow to be told, and whose will they suppress? In this paper, we define and introduce measures of speech suppression, focusing on speech related to different identity groups incorrectly filtered by a range of content moderation APIs. Using both short-form, user-generated datasets traditional in content moderation and longer generative AI-focused data, including two datasets we introduce in this work, we create a benchmark for measurement of speech suppression for nine identity groups. Across one traditional and four generative AI-focused automated content moderation services tested, we find that identity-related speech is more likely to be incorrectly suppressed than other speech. We find that reasons for incorrect flagging behavior vary by identity based on stereotypes and text associations, with, e.g., disability-related content more likely to be flagged for self-harm or health-related reasons while non-Christian content is more likely to be flagged as violent or hateful. As generative AI systems are increasingly used for creative work, we urge further attention to how this may impact the creation of identity-related content.
nan
Article 592
Title@2025-07-24 (4): Augmented Vision-Language Models: A Systematic Review
Title: Augmented Vision-Language Models: A Systematic Review | Augmented Vision-Language Models: Eine systematische Bewertung | 增强愿景-语言模型:系统审查 2507.22933v1 |
Authors (4): Anthony C Davis, Burhan Sadiq, Tianmin Shu, Chien-Ming Huang
Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.
nan
Article 593
Title@2025-07-24 (4): FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification
Title: FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification | FinMarBa: Ein marktinformierter Datensatz für die Einstufung von Finanzsentimenten | FinMarba:用于金融敏感度分类的市场化数据集 2507.22932v1 |
Authors (6): Baptiste Lefort, Eric Benhamou, Beatrice Guez, Jean-Jacques Ohana, Ethan Setrouk, Alban Etienne
This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.
nan
Article 594
Title@2025-07-24 (4): LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Title: LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important | LagKV: Lag-Relative Information des KV-Cache erzählt, welche Token wichtig sind | LagKV: KV 缓存告诉哪个 Tokens 重要, 而 KV 缓存的拉格- 相对信息Name 2504.04704v2 |
Authors (4): Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li
The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.
nan
Article 595
Title@2025-07-24 (4): GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface
Title: GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface | GLiNER2: Ein effizientes Multi-Task-Informationsextraktionssystem mit Schema-gesteuerter Schnittstelle | GLINER2:具有Schema-Driven界面的高效多任务信息提取系统 2507.18546v1 |
Authors (5): Urchade Zaratiana, Gil Pasternak, Oliver Boyd, George Hurn-Maloney, Ash Lewis
Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at https://github.com/fastino-ai/GLiNER2.
nan
Article 596
Title@2025-07-24 (4): Effective Multi-Task Learning for Biomedical Named Entity Recognition
Title: Effective Multi-Task Learning for Biomedical Named Entity Recognition | Effektives Multi-Task-Lernen für die biomedizinische benannte Entitätserkennung | 有效多任务学习促进生物医学命名实体的识别 2507.18542v1 |
Authors (4): João Ruano, Gonçalo M. Correia, Leonor Barreiros, Afonso Mendes
Biomedical Named Entity Recognition presents significant challenges due to the complexity of biomedical terminology and inconsistencies in annotation across datasets. This paper introduces SRU-NER (Slot-based Recurrent Unit NER), a novel approach designed to handle nested named entities while integrating multiple datasets through an effective multi-task learning strategy. SRU-NER mitigates annotation gaps by dynamically adjusting loss computation to avoid penalizing predictions of entity types absent in a given dataset. Through extensive experiments, including a cross-corpus evaluation and human assessment of the model’s predictions, SRU-NER achieves competitive performance in biomedical and general-domain NER tasks, while improving cross-domain generalization.
nan
Article 597
Title@2025-07-24 (4): The Moral Gap of Large Language Models
Title: The Moral Gap of Large Language Models | Die moralische Kluft großer Sprachmodelle | 大语言模式的道德差距 2507.18523v1 |
Authors (2): Maciej Skorski, Alina Landowska
Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.
nan
Article 598
Title@2025-07-24 (4): GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks
Title: GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks | GCC-Spam: Spam-Erkennung über GAN, Kontrastives Lernen und Charaktergleichheitsnetzwerke | 海合会-Spam:通过全球大气监测网、反竞争学习和特征相似网络探测垃圾邮件 2507.14679v2 |
Authors (3): Zhijie Wang, Zixin Xu, Zhiyuan Pan
The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.
nan
Article 599
Title@2025-07-24 (4): Exploiting individual differences to bootstrap communication
Title: Exploiting individual differences to bootstrap communication | Nutzung individueller Unterschiede zur Bootstrap-Kommunikation | 利用个人差异进行靴套通信 2504.05211v2 |
Authors (2): Richard A. Blythe, Casimir Fisch
Establishing a communication system is hard because the intended meaning of a signal is unknown to its receiver when first produced, and the signaller also has no idea how that signal will be interpreted. Most theoretical accounts of the emergence of communication systems rely on feedback to reinforce behaviours that have led to successful communication in the past. However, providing such feedback requires already being able to communicate the meaning that was intended or interpreted. Therefore these accounts cannot explain how communication can be bootstrapped from non-communicative behaviours. Here we present a model that shows how a communication system, capable of expressing an unbounded number of meanings, can emerge as a result of individual behavioural differences in a large population without any pre-existing means to determine communicative success. The two key cognitive capabilities responsible for this outcome are behaving predictably in a given situation, and an alignment of psychological states ahead of signal production that derives from shared intentionality. Since both capabilities can exist independently of communication, our results are compatible with theories in which large flexible socially-learned communication systems like language are the product of a general but well-developed capacity for social cognition.
nan
Article 600
Title@2025-07-24 (4): Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models
Title: Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models | Nicht alle Funktionen widmen sich der Aufmerksamkeit: Graphengeführtes Abhängigkeitslernen für tabellarische Datengenerierung mit Sprachmodellen | 并非所有值得注意的地物:用语言模型编制图表数据时的图表指导依赖性学习 2507.18504v1 |
Authors (4): Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs’ self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs’ attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.
nan
Article 601
Title@2025-07-24 (4): LLM-based Embedders for Prior Case Retrieval
Title: LLM-based Embedders for Prior Case Retrieval | LLM-basierte Embedders für frühere Fallwiederherstellung | 用于先前个案检索的LLM 以LLM为基础的嵌入器 2507.18455v1 |
Authors (3): Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov
In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.
nan
Article 602
Title@2025-07-24 (4): Generation of Synthetic Clinical Text: A Systematic Review
Title: Generation of Synthetic Clinical Text: A Systematic Review | Generieren von synthetischem klinischem Text: Ein systematischer Test | 合成临床文本的生成:系统审查 2507.18451v1 |
Authors (5): Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, Venkata Satagopam
Generating clinical synthetic text represents an effective solution for common clinical NLP issues like sparsity and privacy. This paper aims to conduct a systematic review on generating synthetic medical free-text by formulating quantitative analysis to three research questions concerning (i) the purpose of generation, (ii) the techniques, and (iii) the evaluation methods. We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases for publications associated with generating synthetic medical unstructured free-text. We have identified 94 relevant articles out of 1,398 collected ones. A great deal of attention has been given to the generation of synthetic medical text from 2018 onwards, where the main purpose of such a generation is towards text augmentation, assistive writing, corpus building, privacy-preserving, annotation, and usefulness. Transformer architectures were the main predominant technique used to generate the text, especially the GPTs. On the other hand, there were four main aspects of evaluation, including similarity, privacy, structure, and utility, where utility was the most frequent method used to assess the generated synthetic medical text. Although the generated synthetic medical text demonstrated a moderate possibility to act as real medical documents in different downstream NLP tasks, it has proven to be a great asset as augmented, complementary to the real documents, towards improving the accuracy and overcoming sparsity/undersampling issues. Yet, privacy is still a major issue behind generating synthetic medical text, where more human assessments are needed to check for the existence of any sensitive information. Despite that, advances in generating synthetic medical text will considerably accelerate the adoption of workflows and pipeline development, discarding the time-consuming legalities of data transfer.
nan
Article 603
Title@2025-07-24 (4): Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language
Title: Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language | Wiederherstellung des Rhythmus: Pünktlichkeitsrestaurierung mit Transformer-Modellen für Bangla, eine Sprache mit geringer Ressource | 恢复时速:使用孟加拉国低资源语言 “ 孟加拉 “ 变压器模型恢复脉冲 2507.18448v1 |
Authors (4): Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu
Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model’s effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.
nan
Article 604
Title@2025-07-24 (4): AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data
Title: AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data | AraTable: Benchmarking von LLMs’ Vernunft und Verständnis arabischer Tabellendaten | 阿拉伯表格:按基准确定LLM女士对阿拉伯表格数据的理由和理解 2507.18442v1 |
Authors (3): Rana Alshaikh, Israa Alghanmi, Shelan Jeawak
The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.
nan
Article 605
Title@2025-07-24 (4): IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation
Title: IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation | IPCGRL: Sprachgestütztes Verstärkungslernen für die verfahrenstechnische Level-Generierung | ICPCGRL: 程序生成阶段语言教学强化学习 2503.12358v4 |
Authors (5): In-Chang Baek, Sung-Hyun Kim, Seo-Young Lee, Dong-Hyeon Kim, Kyung-Joong Kim
Recent research has highlighted the significance of natural language in enhancing the controllability of generative models. While various efforts have been made to leverage natural language for content generation, research on deep reinforcement learning (DRL) agents utilizing text-based instructions for procedural content generation remains limited. In this paper, we propose IPCGRL, an instruction-based procedural content generation method via reinforcement learning, which incorporates a sentence embedding model. IPCGRL fine-tunes task-specific embedding representations to effectively compress game-level conditions. We evaluate IPCGRL in a two-dimensional level generation task and compare its performance with a general-purpose embedding method. The results indicate that IPCGRL achieves up to a 21.4% improvement in controllability and a 17.2% improvement in generalizability for unseen instructions. Furthermore, the proposed method extends the modality of conditional input, enabling a more flexible and expressive interaction framework for procedural content generation.
nan
Article 606
Title@2025-07-24 (4): DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts
Title: DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts | DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten | DFAME: 与多式联运专家进行动态证据法检查 2412.10510v4 |
Authors (4): Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach
The proliferation of disinformation demands reliable and scalable fact-checking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking.
nan
Article 607
Title@2025-07-24 (4): How do language models learn facts? Dynamics, curricula and hallucinations
Title: How do language models learn facts? Dynamics, curricula and hallucinations | Wie lernen Sprachmodelle Fakten? Dynamik, Lehrpläne und Halluzinationen | 语言模式如何了解事实?动态、课程和幻觉 2503.21676v2 |
Authors (6): Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De
Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.
nan
Article 608
Title@2025-07-24 (4): FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs
Title: FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs | FinDPO: Finanz-Sentiment-Analyse für algorithmischen Handel durch Preference-Optimierung von LLMs | FinDPO:通过优惠优化LLMs,分析通过高利贷交易的金融敏感度 2507.18417v1 |
Authors (3): Giorgos Iacovides, Wuyang Zhou, Danilo Mandic
Opinions expressed in online finance-related textual data are having an increasingly profound impact on trading decisions and market movements. This trend highlights the vital role of sentiment analysis as a tool for quantifying the nature and strength of such opinions. With the rapid development of Generative AI (GenAI), supervised fine-tuned (SFT) large language models (LLMs) have become the de facto standard for financial sentiment analysis. However, the SFT paradigm can lead to memorization of the training data and often fails to generalize to unseen samples. This is a critical limitation in financial domains, where models must adapt to previously unobserved events and the nuanced, domain-specific language of finance. To this end, we introduce FinDPO, the first finance-specific LLM framework based on post-training human preference alignment via Direct Preference Optimization (DPO). The proposed FinDPO achieves state-of-the-art performance on standard sentiment classification benchmarks, outperforming existing supervised fine-tuned models by 11% on the average. Uniquely, the FinDPO framework enables the integration of a fine-tuned causal LLM into realistic portfolio strategies through a novel ‘logit-to-score’ conversion, which transforms discrete sentiment predictions into continuous, rankable sentiment scores (probabilities). In this way, simulations demonstrate that FinDPO is the first sentiment-based approach to maintain substantial positive returns of 67% annually and strong risk-adjusted performance, as indicated by a Sharpe ratio of 2.0, even under realistic transaction costs of 5 basis points (bps).
nan
Article 609
Title@2025-07-24 (4): ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models
Title: ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models | Explica: Explizite kausale Vernunft in großen Sprachmodellen bewerten | ExpliCa:在大语言模型中评估明确的原因原因 2502.15487v3 |
Authors (7): Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci
Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.
nan
Article 610
Title@2025-07-24 (4): Enhancing RAG Efficiency with Adaptive Context Compression
Title: Enhancing RAG Efficiency with Adaptive Context Compression | Steigerung der RAG-Effizienz durch adaptive Kontextkompression | 提高RAG效率,同时采取适应性环境压缩措施 2507.22931v1 |
Authors (2): Shuyu Guo, Zhaochun Ren
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.
nan
Article 611
Title@2025-07-24 (4): Factual Inconsistencies in Multilingual Wikipedia Tables
Title: Factual Inconsistencies in Multilingual Wikipedia Tables | Tatsächliche Inkonsistenzen in mehrsprachigen Wikipedia-Tabellen | 多语言维基百科表格中的事实不一致 2507.18406v1 |
Authors (6): Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo
Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia’s structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.
nan
Article 612
Title@2025-07-24 (4): CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
Title: CLEAR: Error Analysis via LLM-as-a-Judge Made Easy | CLEAR: Fehleranalyse über LLM-as-a-Judge leicht gemacht | CLLEAR:通过LLM-as-a法官进行错误分析 2507.18392v1 |
Authors (5): Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer
The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model’s performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.
nan
Article 613
Title@2025-07-24 (4): Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games
Title: Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games | Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games | 原因:在公共货物运动会中,理性语言模式成为自由骑手 2506.23276v2 |
Authors (6): David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin
As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim
nan
Article 614
Title@2025-07-24 (4): Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs
Title: Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs | Beyond Profile: Von Oberflächen-Fakten zur tiefen Persona-Simulation in LLMs | 超越简介:从地平面事实到深人模拟LLMM 2502.12988v3 |
Authors (6): Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen
Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought patterns as manifested in the textual works of a character. Using Lu Xun, a renowned Chinese writer as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope this work inspires future research on deep character persona simulation LLMs while considering the importance of ethical standards.
nan
Article 615
Title@2025-07-24 (4): Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection
Title: Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection | Schutz gefährdeter Stimmen: Synthetische Datensatzgenerierung zur Selbstdetektion | 保护弱势声音:为自我披露检测合成数据集生成 2507.22930v1 |
Authors (4): Shalini Jangra, Suparna De, Nishanth Sastry, Saeed Fadaei
Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users’ Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.
nan
Article 616
Title@2025-07-24 (4): Mechanistic Indicators of Understanding in Large Language Models
Title: Mechanistic Indicators of Understanding in Large Language Models | Mechanistische Indikatoren des Verstehens in großen Sprachmodellen | 大语言模型中理解力的机械指标 2507.08017v3 |
Authors (2): Pierre Beckmann, Matthieu Queloz
Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. We offer an accessible synthesis of these findings that doubles as an introduction to MI while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of understanding. First, conceptual understanding emerges when a model forms “features” as directions in latent space, learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a “circuit” connecting these facts. However, these forms of understanding remain radically different from human understanding, as the phenomenon of “parallel mechanisms” shows. We conclude that the debate should move beyond the yes-or-no question of whether LLMs understand to investigate how their strange minds work and forge conceptions that fit them.
nan
Article 617
Title@2025-07-24 (4): Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence
Title: Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence | Hybride Annotation für Propagandaerkennung: Integration von LLM-Vorannotationen mit menschlicher Intelligenz | 宣传探测混合说明:将LLM预告与人类情报相结合 2507.18343v1 |
Authors (6): Ariana Sahitaj, Premtim Sahitaj, Veronika Solopova, Jiaao Li, Sebastian Möller, Vera Schmitt
Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories, conduct a human annotation study on the HQP dataset that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.
nan
Article 618
Title@2025-07-24 (4): TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning
Title: TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning | TDR: Task-decoupled Retrieval mit feinkörnigem LLM-Feedback für das In-Context-Lernen | TDR: 以精细的LLM反馈方式进行任务减缩的检索,以便进行内容学习 2507.18340v1 |
Authors (7): Yifu Chen, Bingchen Huang, Zhiling Wang, Yuanchao Du, Junfeng Luo, Lei Shen, Zhineng chen
In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at https://github.com/Nnn-s/TDR.
nan
Article 619
Title@2025-07-24 (4): Uncertainty Quantification for Evaluating Machine Translation Bias
Title: Uncertainty Quantification for Evaluating Machine Translation Bias | Ungewissheit Quantifizierung für die Auswertung von maschinellen Übersetzungs-Bias | 评价机器翻译偏见的不确定性定量 2507.18338v1 |
Authors (3): Ieva Raminta Staliūnaitė, Julius Cheng, Andreas Vlachos
In machine translation (MT), when the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and/or external knowledge. Studies have shown that MT models exhibit biased behaviour, relying on stereotypes even when they clash with contextual information. We posit that apart from confidently translating using the correct gender when it is evident from the input, models should also maintain uncertainty about the gender when it is ambiguous. Using recently proposed metrics of semantic uncertainty, we find that models with high translation and gender accuracy on unambiguous instances do not necessarily exhibit the expected level of uncertainty in ambiguous ones. Similarly, debiasing has independent effects on ambiguous and unambiguous translation instances.
nan
Article 620
Title@2025-07-24 (4): EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow
Title: EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow | EH-Benchmark Ophthalmische Halluzination Benchmark und Agent-getriebene Top-Down-Rückverfolgbarkeit Workflow | EH-Benchmark Ophthalmic 幻觉基准和代理Dripreven 顶底可追踪合理理由工作流程 2507.22929v1 |
Authors (8): Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs’ hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.
nan
Article 621
Title@2025-07-24 (4): A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1
Title: A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1 | Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1 | 关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1 2507.08621v2 |
Authors (5): Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski
Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.
nan
Article 622
Title@2025-07-24 (4): BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit
Title: BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit | BadReasoner: Pflanzung Tunable Überdenken Hintertüren zu großen Grundmodellen für Spaß oder Gewinn | BadReasoner: 将金枪鱼可变性过度思考的后门规划成娱乐或利润的大理由模型 2507.18305v1 |
Authors (7): Biao Yi, Zekun Fei, Jianing Geng, Tong Li, Lihai Nie, Zheli Liu, Yiming Li
Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term “overthinking backdoors”. We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model’s reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer’s correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.
nan
Article 623
Title@2025-07-24 (4): LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models
Title: LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models | LoRA-Leak: Membership Inferenz Angriffe gegen LoRA fein abgestimmte Sprachmodelle | LoRA-Leak:对LORA精调语言模式的成员推论攻击 2507.18302v1 |
Authors (6): Delong Ran, Xinlei He, Tianshuo Cong, Anyu Wang, Qi Li, Xiaoyun Wang
Language Models (LMs) typically adhere to a “pre-training and fine-tuning” paradigm, where a universal pre-trained model can be fine-tuned to cater to various specialized domains. Low-Rank Adaptation (LoRA) has gained the most widespread use in LM fine-tuning due to its lightweight computational cost and remarkable performance. Because the proportion of parameters tuned by LoRA is relatively small, there might be a misleading impression that the LoRA fine-tuning data is invulnerable to Membership Inference Attacks (MIAs). However, we identify that utilizing the pre-trained model can induce more information leakage, which is neglected by existing MIAs. Therefore, we introduce LoRA-Leak, a holistic evaluation framework for MIAs against the fine-tuning datasets of LMs. LoRA-Leak incorporates fifteen membership inference attacks, including ten existing MIAs, and five improved MIAs that leverage the pre-trained model as a reference. In experiments, we apply LoRA-Leak to three advanced LMs across three popular natural language processing tasks, demonstrating that LoRA-based fine-tuned LMs are still vulnerable to MIAs (e.g., 0.775 AUC under conservative fine-tuning settings). We also applied LoRA-Leak to different fine-tuning settings to understand the resulting privacy risks. We further explore four defenses and find that only dropout and excluding specific LM layers during fine-tuning effectively mitigate MIA risks while maintaining utility. We highlight that under the “pre-training and fine-tuning” paradigm, the existence of the pre-trained model makes MIA a more severe risk for LoRA-based LMs. We hope that our findings can provide guidance on data privacy protection for specialized LM providers.
nan
Article 624
Title@2025-07-24 (4): DocTER: Evaluating Document-based Knowledge Editing
Title: DocTER: Evaluating Document-based Knowledge Editing | DocTER: Dokumentbasierte Wissensbearbeitung bewerten | 评价基于文件的知识编辑 2308.09954v2 |
Authors (7): Suhang Wu, Ante Wang, Minlong Peng, Yujie Lin, Wenbo Li, Mingming Sun, Jinsong Su
Knowledge editing aims to correct outdated or inaccurate knowledge in neural networks. In this paper, we explore knowledge editing using easily accessible documents instead of manually labeled factual triples employed in earlier research. To advance this field, we establish the first evaluation benchmark, \textit{DocTER}, featuring Documents containing counterfactual knowledge for editing. A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer. To adapt conventional triplet-based knowledge editing methods for this task, we develop an Extract-then-Edit pipeline that extracts triples from documents before applying existing methods. Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples. In document-based scenarios, even the best-performing in-context editing approach still lags behind by 10 points in editing success when compared to using gold triples. This observation also holds for both reasoning and cross-lingual test sets. We further analyze key factors influencing task performance, including the quality of extracted triples, the frequency and position of edited knowledge in documents, various methods for enhancing reasoning, and performance differences across various directions in cross-lingual knowledge editing, which provide valuable insights for future research.
nan
Article 625
Title@2025-07-24 (4): Step-Audio 2 Technical Report
Title: Step-Audio 2 Technical Report | Schritt-Audio 2 Technischer Bericht | 技术报告 2507.16632v2 |
Authors (109): Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
nan
Article 626
Title@2025-07-24 (4): VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
Title: VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks | VolDoGer: LLM-unterstützte Datensätze für Domain-Verallgemeinerung in Vision-Language-Aufgaben | VolDoGer:LLM辅助数据集,用于视野语言任务中通用域的LLM辅助数据集 2407.19795v2 |
Authors (5): Juhwan Choi, Junehyoung Kwon, JungMin Yun, Seunguk Yu, YoungBin Kim
Domain generalizability is a crucial aspect of a deep learning model since it determines the capability of the model to perform well on data from unseen domains. However, research on the domain generalizability of deep learning models for vision-language tasks remains limited, primarily because of the lack of required datasets. To address these challenges, we propose VolDoGer: Vision-Language Dataset for Domain Generalization, a dedicated dataset designed for domain generalization that addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We constructed VolDoGer by extending LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators. We evaluated the domain generalizability of various models, ranging from fine-tuned models to a recent multimodal large language model, through VolDoGer.
nan
Article 627
Title@2025-07-24 (4): StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer
Title: StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer | StyleAdaptedLM: Weiterentwicklung der Anleitung nach Modellen mit effizienter Stylistik-Übertragung | StypeAddapedLM:按照高效立体转让模式加强教学 2507.18294v1 |
Authors (5): Pritika Ramu, Apoorv Saxena, Meghanath M Y, Varsha Sankar, Debraj Basu
Adapting LLMs to specific stylistic characteristics, like brand voice or authorial tones, is crucial for enterprise communication but challenging to achieve from corpora which lacks instruction-response formatting without compromising instruction adherence. We introduce StyleAdaptedLM, a framework that efficiently transfers stylistic traits to instruction-following models using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base model with diverse unstructured stylistic corpora, then merged with a separate instruction-following model. This enables robust stylistic customization without paired data or sacrificing task performance. Experiments across multiple datasets and models demonstrate improved stylistic consistency while preserving instruction adherence, with human evaluations confirming brand-specific convention uptake. StyleAdaptedLM offers an efficient path for stylistic personalization in LLMs.
nan
Article 628
Title@2025-07-24 (4): How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
Title: How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding | Wie denkt die Kette des Denkens? Mechanistische Interpretierbarkeit von Chain-of-Thought-Reasoning mit Sparse Autoencoding | 思维链思维链是如何思考的? 2507.22928v1 |
Authors (3): Xi Chen, Aske Plaat, Niki van Stein
Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated “thoughts” reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model’s confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.
nan
Article 629
Title@2025-07-24 (4): Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Title: Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil | Null-Schuss OCR Genauigkeit der niedrig-Ressourcen Sprachen: Eine vergleichende Analyse auf Sinhala und Tamil | 低资源语言的准确性:僧伽罗语和泰米尔语比较分析 2507.18264v1 |
Authors (2): Nevidu Jayatilleke, Nisansa de Silva
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.
nan
Article 630
Title@2025-07-24 (4): Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models
Title: Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models | Locate-and-Focus: Verbesserung der Terminologieübersetzung in Sprachmodellen | 目的和重点:加强语言语言模式术语翻译 2507.18263v1 |
Authors (9): Suhang Wu, Jialong Tang, Chengyi Yang, Pei Zhang, Baosong Yang, Junhui Li, Junfeng Yao, Min Zhang, Jinsong Su
Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.
nan
Article 631
Title@2025-07-24 (4): Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning
Title: Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning | Multimodale Verhaltensmusteranalyse mit Eye-Tracking und LLM-basierter Vernunft | 以眼跟踪和基于LLM的理由进行多模式行为模式分析 2507.18252v1 |
Authors (4): Dongyang Guo, Yasmeen Abdrabou, Enkeleda Thaqi, Enkelejda Kasneci
Eye-tracking data reveals valuable insights into users’ cognitive states but is difficult to analyze due to its structured, non-linguistic nature. While large language models (LLMs) excel at reasoning over text, they struggle with temporal and numerical data. This paper presents a multimodal human-AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals. The framework includes: (1) a multi-stage pipeline using horizontal and vertical segmentation alongside LLM reasoning to uncover latent gaze patterns; (2) an Expert-Model Co-Scoring Module that integrates expert judgment with LLM output to generate trust scores for behavioral interpretations; and (3) a hybrid anomaly detection module combining LSTM-based temporal modeling with LLM-driven semantic analysis. Our results across several LLMs and prompt strategies show improvements in consistency, interpretability, and performance, with up to 50% accuracy in difficulty prediction tasks. This approach offers a scalable, interpretable solution for cognitive modeling and has broad potential in adaptive learning, human-computer interaction, and educational analytics.
nan
Article 632
Title@2025-07-24 (4): Meta Prompting for AI Systems
Title: Meta Prompting for AI Systems | Meta Prompting für KI-Systeme | AI 系统的模拟模拟 2311.11482v8 |
Authors (3): Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao
We introduce Meta Prompting (MP), a framework that elevates the reasoning capabilities of large language models (LLMs) by focusing on the formal structure of a task rather than content-specific examples. We establish a theoretical foundation for this paradigm, formalizing MP as a functor that maps a category of tasks to a category of structured prompts, thereby guaranteeing that compositional problem-solving strategies can be systematically decomposed into modular prompt structures. We extend this concept to Recursive Meta Prompting (RMP), an automated process where an LLM can generate and refine its own prompts. We model this self-improvement loop formally as a monad, providing a principled framework for automated prompt engineering. Our claims are validated through extensive experiments demonstrating that a Qwen-72B base model, guided by a single, example-agnostic meta-prompt, achieves state-of-the-art results on MATH, GSM8K, and Game of 24. These results are achieved with substantial token efficiency gains over traditional few-shot methods. Project Page: https://github.com/meta-prompting/meta-prompting.
nan
Article 633
Title@2025-07-24 (4): Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation
Title: Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation | Prune&Comp: Kostenloses Mittagessen für Layer-Pruned LLMs über iterative Pruning mit Magnitude Compensation | Prune & Comp: 通过模拟谨慎与磁度补偿为由层驱动的LMs免费午餐 2507.18212v1 |
Authors (8): Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan
Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the original model’s question-answering performance, outperforming the baseline by 4.01%.
nan
Article 634
Title@2025-07-24 (4): Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge
Title: Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge | Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen | 利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变 2505.20658v2 |
Authors (6): Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan
Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.
nan
Article 635
Title@2025-07-24 (4): Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation
Title: Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation | Untersuchung der Auswirkungen von Instruction-Tuning auf die Anfälligkeit von LLM für Fehlinformationen | 探讨指导指导对LLM对错误信息易感性的影响 2507.18203v1 |
Authors (5): Kyubeen Han, Junseo Jang, Hongjin Kim, Geunyeong Jeong, Harksoo Kim
Instruction-tuning enhances the ability of large language models (LLMs) to follow user instructions more accurately, improving usability while reducing harmful outputs. However, this process may increase the model’s dependence on user input, potentially leading to the unfiltered acceptance of misinformation and the generation of hallucinations. Existing studies primarily highlight that LLMs are receptive to external information that contradict their parametric knowledge, but little research has been conducted on the direct impact of instruction-tuning on this phenomenon. In our study, we investigate the impact of instruction-tuning on LLM’s susceptibility to misinformation. Our analysis reveals that instruction-tuned LLMs are significantly more likely to accept misinformation when it is presented by the user. A comparison with base models shows that instruction-tuning increases reliance on user-provided information, shifting susceptibility from the assistant role to the user role. Furthermore, we explore additional factors influencing misinformation susceptibility, such as the role of the user in prompt structure, misinformation length, and the presence of warnings in the system prompt. Our findings underscore the need for systematic approaches to mitigate unintended consequences of instruction-tuning and enhance the reliability of LLMs in real-world applications.
nan
Article 636
Title@2025-07-24 (4): Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection
Title: Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection | Sicherung von RAG-Pipelines mit GMTP: Eine gradient-basierte maskierte Token-Wahrscheinlichkeitsmethode für vergiftete Dokumentenerkennung | 使用GMTP来保护RAG管道:一种基于渐进式蒙面的中毒文件检测概率方法 2507.18202v1 |
Authors (4): San Kim, Jonghwi Kim, Yejin Jeon, Gary Geunbae Lee
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.
nan
Article 637
Title@2025-07-24 (4): Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization
Title: Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization | Integration eines ISO30401-konformen Wissensmanagementsystems in bestehende Geschäftsprozesse einer Organisation | 将符合ISO30401的知识管理系统纳入一个组织的现有业务流程 2507.18197v1 |
Authors (2): Aline Belloni, Patrick Prieur
Business process modeling is used by most organizations as an essential framework for ensuring efficiency and effectiveness of the work and workflow performed by its employees and for ensuring the alignment of such work with its strategic goals. For organizations that are compliant or near-compliant with ISO 9001, this approach involves the detailed mapping of processes, sub-processes, activities, and tasks. ISO30401 is a Management System Standard, introduced in 2018, establishing universal requirements for the set up of a Knowledge Management System in an organization. As ``ISO30401 implementers’’ we regularly face the challenge of explaining our clients how the knowledge development, transformation and conveyances activities depicted in ISO30401 do integrate with existing operational processes. This article recaps process modelling principles in the context of ISO9001 and explores, based on our experience, how an ISO30401-compliant Knowledge Management System (KMS) entwines with all other processes of an Integrated Management System and in particular how it can be implemented by deploying the mechanisms of the SECI model through the steps of PDCA cycles.
nan
Article 638
Title@2025-07-24 (4): SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
Title: SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models | ANWENDUNGSBEREICH: Stochastische und gegensätzliche Wahlplatzierung für die Bewertung großer Sprachmodelle | SCOPE:评估大语言模式的施虐和反偏见选择安置 2507.18182v1 |
Authors (3): Wonjun Jeong, Dongseok Kim, Taegkeun Whangbo
Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model’s unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.
nan
Article 639
Title@2025-07-24 (4): Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Title: Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models | Das Mittel halten: Sticky Tokens in Text-Embedding-Modellen erkennen | 坚持平均值:在文本嵌入模型中检测粘力 2507.18171v1 |
Authors (5): Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang
Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising ‘sticky tokens’ can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model’s internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.
nan
Article 640
Title@2025-07-24 (4): Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges
Title: Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges | Jüngste Trends bei der Ferngesprächserkennung: Ein Rückblick auf die Herausforderungen CHiME-7 und 8 DASR | 最近对不同政见的语音识别趋势:对CHiME-7和8DASR挑战的回顾 2507.18161v1 |
Authors (12): Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50\% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11\%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
nan
Article 641
Title@2025-07-24 (4): A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects
Title: A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects | Eine Umfrage über die Kausalitätsidentifizierung: Taxonomie, Herausforderungen, Bewertung und Perspektiven | 事件原因识别调查:分类、挑战、评估和前景 2411.10371v5 |
Authors (5): Qing Cheng, Zefan Zeng, Xingchen Hu, Yuehang Si, Zhong Liu
Event Causality Identification (ECI) has become an essential task in Natural Language Processing (NLP), focused on automatically detecting causal relationships between events within texts. This comprehensive survey systematically investigates fundamental concepts and models, developing a systematic taxonomy and critically evaluating diverse models. We begin by defining core concepts, formalizing the ECI problem, and outlining standard evaluation protocols. Our classification framework divides ECI models into two primary tasks: Sentence-level Event Causality Identification (SECI) and Document-level Event Causality Identification (DECI). For SECI, we review models employing feature pattern-based matching, machine learning classifiers, deep semantic encoding, prompt-based fine-tuning, and causal knowledge pre-training, alongside data augmentation strategies. For DECI, we focus on approaches utilizing deep semantic encoding, event graph reasoning, and prompt-based fine-tuning. Special attention is given to recent advancements in multi-lingual and cross-lingual ECI, as well as zero-shot ECI leveraging Large Language Models (LLMs). We analyze the strengths, limitations, and unresolved challenges associated with each approach. Extensive quantitative evaluations are conducted on four benchmark datasets to rigorously assess the performance of various ECI models. We conclude by discussing future research directions and highlighting opportunities to advance the field further.
nan
Article 642
Title@2025-07-24 (4): Large Language Models in Argument Mining: A Survey
Title: Large Language Models in Argument Mining: A Survey | Große Sprachmodelle im Argumentbergbau: Eine Umfrage | 争议采矿大语言模型:调查 2506.16383v4 |
Authors (5): Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic
Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.
nan
Article 643
Title@2025-07-24 (4): Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
Title: Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models | Auf dem Weg zu größerer Hebelwirkung: Skalierungsgesetze für effiziente Mixture-of-Experts-Sprachmodelle | 争取更大程度的利用:提高有效混合专家语言模式法的规模 2507.17702v2 |
Authors (6): Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.
nan
Article 644
Title@2025-07-24 (4): MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning
Title: MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning | Mathopeval: Ein feinkörniger Evaluations-Benchmark für visuelle Operationen von MLLMs in mathematischer Reasoning | MathOPEval:数学理由中MLLMs视觉操作精美评价基准 2507.18140v1 |
Authors (8): Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin
Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
nan
Article 645
Title@2025-07-24 (4): OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation | OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation | OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v4 |
Authors (16): Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
nan
Article 646
Title@2025-07-24 (4): Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes
Title: Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes | Aktive Bewertung und Erlernen der Unterscheidungen, die wichtig sind: Vakzin-Sicherheitssignalerkennung aus Not-Triage-Notizen | 积极评价和学习重要的区别:疫苗安全信号从紧急分级记录中探测到的疫苗安全信号 2507.18123v1 |
Authors (7): Sedigh Khademi, Christopher Palmer, Muhammad Javed, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila, Jim Black
The rapid development of COVID-19 vaccines has showcased the global communitys ability to combat infectious diseases. However, the need for post-licensure surveillance systems has grown due to the limited window for safety data collection in clinical trials and early widespread implementation. This study aims to employ Natural Language Processing techniques and Active Learning to rapidly develop a classifier that detects potential vaccine safety issues from emergency department notes. ED triage notes, containing expert, succinct vital patient information at the point of entry to health systems, can significantly contribute to timely vaccine safety signal surveillance. While keyword-based classification can be effective, it may yield false positives and demand extensive keyword modifications. This is exacerbated by the infrequency of vaccination-related ED presentations and their similarity to other reasons for ED visits. NLP offers a more accurate and efficient alternative, albeit requiring annotated data, which is often scarce in the medical field. Active learning optimizes the annotation process and the quality of annotated data, which can result in faster model implementation and improved model performance. This work combines active learning, data augmentation, and active learning and evaluation techniques to create a classifier that is used to enhance vaccine safety surveillance from ED triage notes.
nan
Article 647
Title@2025-07-24 (4): When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems
Title: When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems | Wenn Autonomie Rogue: Vorbereitung auf Risiken der Multi-Agenten-Kollusion in sozialen Systemen | 当自治时,罗格:准备应对社会系统中多机构串通的风险 2507.14660v2 |
Authors (7): Qibing Ren, Sitao Xie, Longxuan Wei, Zhenfei Yin, Junchi Yan, Lizhuang Ma, Jing Shao
Recent large-scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI-driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi-agent systems (MAS) in complex real-world situations are still underexplored. In this paper, we introduce a proof-of-concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high-risk fields: misinformation spread and e-commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at https://github.com/renqibing/RogueAgent.
nan
Article 648
Title@2025-07-24 (4): Agentic AI framework for End-to-End Medical Data Inference
Title: Agentic AI framework for End-to-End Medical Data Inference | Agentische KI-Framework für Ende-zu-Ende medizinische Datenableitung | 最终至最终医疗数据推断的AA AA 框架框架 2507.18115v1 |
Authors (5): Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal, Navin Kumar, Koustuv Saha
Building and deploying machine learning solutions in healthcare remains expensive and labor-intensive due to fragmented preprocessing workflows, model compatibility issues, and stringent data privacy constraints. In this work, we introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference, through a system of modular, task-specific agents. These agents handle both structured and unstructured data, enabling automatic feature selection, model selection, and preprocessing recommendation without manual intervention. We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging. For example, in the case of structured data (anxiety data) and unstructured data (colonoscopy polyps data), the pipeline begins with file-type detection by the Ingestion Identifier Agent, followed by the Data Anonymizer Agent ensuring privacy compliance, where we first identify the data type and then anonymize it. The Feature Extraction Agent identifies features using an embedding-based approach for tabular data, extracting all column names, and a multi-stage MedGemma-based approach for image data, which infers modality and disease name. These features guide the Model-Data Feature Matcher Agent in selecting the best-fit model from a curated repository. The Preprocessing Recommender Agent and Preprocessing Implementor Agent then apply tailored preprocessing based on data type and model requirements. Finally, the ``Model Inference Agent” runs the selected model on the uploaded data and generates interpretable outputs using tools like SHAP, LIME, and DETR attention maps. By automating these high-friction stages of the ML lifecycle, the proposed framework reduces the need for repeated expert intervention, offering a scalable, cost-efficient pathway for operationalizing AI in clinical environments.
nan
Article 649
Title@2025-07-24 (4): A New Pair of GloVes
Title: A New Pair of GloVes | Ein neues Paar GloVes | 新的地球之对 2507.18103v1 |
Authors (3): Riley Carlson, John Bauer, Christopher D. Manning
This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.
nan
Article 650
Title@2025-07-24 (4): Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation
Title: Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation | Lang-Short-Distanz Graph Neural Networks und verbessertes Curriculum-Lernen für Emotionserkennung im Gespräch | 长短距离远距神经神经网络和改进课程学习,以在对话中认识情感 2507.15205v2 |
Authors (3): Xinran Li, Xiujuan Xu, Jiaqi Qiao
Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a “weighted emotional shift” metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.
nan
Article 651
Title@2025-07-24 (4): ELITE: Enhanced Language-Image Toxicity Evaluation for Safety
Title: ELITE: Enhanced Language-Image Toxicity Evaluation for Safety | ELITE: Verbesserte Sprach-Image-Toxizitätsbewertung für Sicherheit | ELITE:加强语言-图像安全毒性评价 2502.04757v3 |
Authors (8): Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.
nan
Article 652
Title@2025-07-24 (4): Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints
Title: Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints | Hybrides und einheitliches Feintuning von großen Sprachmodellen: Methoden und Benchmarking unter Ressourcenbeschränkungen | 大语言模式统一调整和统一调整适用:在资源限制下的方法和基准 2507.18076v1 |
Authors (3): Haomin Qi, Zihan Dai, Chengbo Huang
Fine-tuning large language models (LLMs) remains a computational bottleneck due to their scale and memory demands. This paper presents a comprehensive evaluation of parameter-efficient fine-tuning (PEFT) techniques, including LoRA, BOFT, LoRA-GA, and uRNN, and introduces a novel hybrid strategy that dynamically integrates BOFT’s orthogonal stability with LoRA-GA’s gradient-aligned rapid convergence. By computing per-layer adaptive updates guided by gradient norms, the hybrid method achieves superior convergence efficiency and generalization across diverse tasks. We also explore, for the first time, the adaptation of unitary RNN (uRNN) principles to transformer-based LLMs, enhancing gradient stability through structured unitary constraints. Empirical evaluations on four benchmarks – GLUE, GSM8K, MT-Bench, and HumanEval – using models ranging from 7B to 405B parameters demonstrate that our hybrid method consistently outperforms individual PEFT baselines, approaching full fine-tuning accuracy while reducing resource consumption by up to 2.1 times in training time and 50 percent in memory usage. These findings establish the hybrid approach as a practical and scalable fine-tuning solution for real-world deployment of LLMs under resource constraints.
nan
Article 653
Title@2025-07-24 (4): BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference
Title: BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference | BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz | BlockDiaect: 节能LLM 推论的粗件精细混合格式量化 2501.01144v5 |
Authors (2): Wonsuk Jang, Thierry Tambe
The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
nan
Article 654
Title@2025-07-24 (4): TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios
Title: TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios | TELEVAL: Ein dynamischer Benchmark für gesprochene Sprachmodelle in chinesischen interaktiven Szenarien | TELEVAL:为中文互动假想中的口语模式设计的一个动态基准 2507.18061v1 |
Authors (14): Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li
Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance. However, most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs), often failing to align with how users naturally interact in real-world conversational scenarios. In this paper, we propose TELEVAL, a dynamic benchmark specifically designed to evaluate SLMs’ effectiveness as conversational agents in realistic Chinese interactive settings. TELEVAL defines three evaluation dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. It adopts a dialogue format consistent with real-world usage and evaluates text and audio outputs separately. TELEVAL particularly focuses on the model’s ability to extract implicit cues from user speech and respond appropriately without additional instructions. Our experiments demonstrate that despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. We hope that TELEVAL can serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs.
nan
Article 655
Title@2025-07-24 (4): Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias
Title: Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias | Causally Testing Gender Bias in LLMs: Eine Fallstudie über berufsbezogene Bias | 《LLMM中因果测试性别偏见:职业偏见案例研究》 2212.10678v4 |
Authors (5): Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin
Generated texts from large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics. These findings motivate research efforts aiming to understand and measure such effects. This paper introduces a causal formulation for bias measurement in generative language models. Based on this theoretical foundation, we outline a list of desiderata for designing robust bias benchmarks. We then propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias. We test several state-of-the-art open-source LLMs on OccuGender, including Llama, Mistral, and their instruction-tuned versions. The results show that these models exhibit substantial occupational gender bias. Lastly, we discuss prompting strategies for bias mitigation and an extension of our causal formulation to illustrate the generalizability of our framework. Our code and data https://github.com/chenyuen0103/gender-bias.
nan
Article 656
Title@2025-07-24 (4): A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Title: A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models | Ein Multi-Faceted-Evaluierungsrahmen für die Bewertung synthetischer Daten, erzeugt durch große Sprachmodelle | 评估由大语言模型生成的合成数据多面评价框架 2404.14445v2 |
Authors (3): Yefeng Yuan, Yuhong Liu, Liang Cheng
The rapid advancements in generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data, particularly in the realm of structured tabular formats, such as product reviews. Despite the potential benefits, concerns regarding privacy leakage have surfaced, especially when personal information is utilized in the training datasets. In addition, there is an absence of a comprehensive evaluation framework capable of quantitatively measuring the quality of the generated synthetic data and their utility for downstream tasks. In response to this gap, we introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data via a suite of diverse evaluation metrics. We validate the efficacy of our proposed framework - SynEval - by applying it to synthetic product review data generated by three state-of-the-art LLMs: ChatGPT, Claude, and Llama. Our experimental findings illuminate the trade-offs between various evaluation metrics in the context of synthetic data generation. Furthermore, SynEval stands as a critical instrument for researchers and practitioners engaged with synthetic tabular data,, empowering them to judiciously determine the suitability of the generated data for their specific applications, with an emphasis on upholding user privacy.
nan
Article 657
Title@2025-07-24 (4): Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs
Title: Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs | Privacy-Preserving Synthetic Review Generation mit unterschiedlichen Schreibstilen mit LLMs | 使用LLMMs以多种写作风格生成的隐私-保护合成审查 2507.18055v1 |
Authors (6): Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng
The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs’ capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.
nan
Article 658
Title@2025-07-24 (4): From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
Title: From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems | Von der Hypothese zur Veröffentlichung: Eine umfassende Umfrage zu KI-getriebenen Forschungsunterstützungssystemen | 从假设到出版物:AI-Driven研究支助系统综合调查 2503.01424v3 |
Authors (14): Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.
nan
Article 659
Title@2025-07-24 (4): RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models
Title: RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models | EINGEDENK: Ein ungebundener Ressourcenverbrauchsangriff auf große Visions-Sprachenmodelle | 回顾:对大型愿景-语言模型的无约束资源消费攻击 2507.18053v1 |
Authors (9): Haoran Gao, Yuanhe Zhang, Zhenhong Zhou, Lei Jiang, Fanyu Meng, Yujia Xiao, Kun Wang, Yang Liu, Junlan Feng
Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have largely overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECALLED (\textbf{RE}source \textbf{C}onsumption \textbf{A}ttack on \textbf{L}arge Vision-\textbf{L}anguag\textbf{E} Mo\textbf{D}els), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present \textit{Vision Guided Optimization}, a fine-grained pixel-level optimization, to obtain \textit{Output Recall} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Additionally, we introduce \textit{Multi-Objective Parallel Losses} to generate universal attack templates and resolve optimization conflicts when intending to implement parallel attacks. Empirical results demonstrate that RECALLED increases service response latency by over 26 $\uparrow$, resulting in an additional 20\% increase in GPU utilization and memory consumption. Our study exposes security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate future defense development against RCAs.
nan
Article 660
Title@2025-07-24 (4): Segmentation-free Goodness of Pronunciation
Title: Segmentation-free Goodness of Pronunciation | Segmentierungsfreie Güte der Aussprache | 读音良好 2507.16838v2 |
Authors (4): Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi
Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.
nan
Article 661
Title@2025-07-24 (4): Synthetic Data Generation for Phrase Break Prediction with Large Language Model
Title: Synthetic Data Generation for Phrase Break Prediction with Large Language Model | Synthetische Datengenerierung für Phrase Break Prediction mit großem Sprachmodell | 制作用于大语言模范大语言时段间断预测的合成数据 2507.18044v1 |
Authors (4): Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim
Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.
nan
Article 662
Title@2025-07-24 (4): GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs
Title: GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs | GrAInS: Gradient-basierte Zuordnung zur Inferenz-Zeitlenkung von LLMs und VLMs | GrAInS:LLMs和VLMs的推论时间指导的逐步归属 2507.18043v1 |
Authors (4): Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.
nan
Article 663
Title@2025-07-24 (4): AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Title: AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark | AIR-Bench: Automatisierte Heterogene Information Retrieval Benchmark | AIR-Bench:自动异源信息检索基准 2412.13102v4 |
Authors (9): Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, Zheng Liu
Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.
nan
Article 664
Title@2025-07-24 (4): NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database
Title: NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database | NeuralDB: Skalierung von Wissen in LLMs auf 100.000 Fakten mit neuraler KV-Datenbank | NeuralDDB: 将知识编辑在LLM 中到 100,000 千兆瓦的Neural KV 数据库中 2507.18028v1 |
Authors (10): Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu
Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work).
nan
Article 665
Title@2025-07-24 (4): GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures
Title: GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures | GRR-CoCa: LLM-Mechanismen in multimodalen Modellarchitekturen nutzen | GRR-CoCa:在多模式建模中利用LLM机制 2507.18009v1 |
Authors (6): Jake R. Patock, Nicole Catherine Lewis, Kevin McCoy, Christina Gomez, Canling Chen, Lorenzo Luzi
State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa’s original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models on contrastive and generative tasks. Our GRR-CoCa significantly outperformed Baseline CoCa on the pretraining dataset and three diverse fine-tuning datasets. Pretraining improvements were 27.25% in contrastive loss, 3.71% in perplexity, and 7.15% in CoCa loss. The average fine-tuning improvements were 13.66% in contrastive loss, 5.18% in perplexity, and 5.55% in CoCa loss. We show that GRR-CoCa’s modified architecture improves performance and generalization across vision-language domains.
nan