cs.CL @ 2025-06-01: 1359
-
00 05-29 (4) From Chat Logs to Collective Insights: Aggregative Question Answering Von Chat Logs zu Collective Insights: Aggregative Question Answering 从聊天日志到集体透视:聚合问题解答 2505.23765v1 -
01 05-29 MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence MMSI-Bench: Ein Benchmark für multi-Image-Spatial Intelligence MMSI-Bunch:多图像空间情报基准 2505.23764v1 -
02 05-29 ZeroGUI: Automating Online GUI Learning at Zero Human Cost ZeroGUI: Automatisieren des Online-GUI-Lernens zu null menschlichen Kosten 零GUI: 实现零人成本在线用户界面学习自动化 2505.23762v1 -
03 05-29 Differential Information: An Information-Theoretic Perspective on Preference Optimization Differentialinformation: Eine informationstheoretische Perspektive zur Preference-Optimierung 差别信息:关于首选优化的信息理论观点 2505.23761v1 -
04 05-29 Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint Puzzlet von Puzzles: Wenn Vision-Language-Modelle keinen Hinweis aufnehmen können 由谜题拼取的谜题: 当视觉语言模型无法使用提示时 2505.23759v1 -
05 05-29 DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning DeepTheorem: Verbesserung der LLM-Gründung für Theorem Proving durch natürliche Sprache und Stärkung Lernen 深理理论:通过自然语言和加强学习提高理论力的理论力和强化学习 2505.23754v1 -
06 05-29 ATLAS: Learning to Optimally Memorize the Context at Test Time ATLAS: Optimales Erlernen des Kontextes zur Testzeit ATLAS: 学习在测试时最充分记住上下文 2505.23735v1 -
07 05-29 Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time Begrenzte Rationalität für LLMs: Zufriedene Ausrichtung zur Folgezeit LLM女士的理 理 理 理:在推断时满足一致 2505.23729v1 -
08 05-29 ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering ML-Agent: Verstärkung von LLM-Agenten für autonome Maschinenbautechnik ML-代理:加强自动机械学习工程的LLM代理 2505.23723v1 -
09 05-29 Label-Guided In-Context Learning for Named Entity Recognition Labelgeführtes In-Context-Lernen für die benannte Entitätserkennung 为识别命名实体进行Label-Guided InFincle 学习 2505.23722v1 -
10 05-29 Length-Controlled Margin-Based Preference Optimization without Reference Model Längengesteuerte Margenbasierte Preference-Optimierung ohne Referenzmodell 无参考模型的优化 2502.14643v2 -
11 05-29 Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models Nehmen Sie nicht die Prämisse für gewährt: Bewertung der Premise Critique Fähigkeit von großen Sprachmodellen 评估大语言模型的精密克里米亚能力 2505.23715v1 -
12 05-29 SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods SenWiCh: Sense-Annotation von Low-Resource-Sprachen für WiC mit Hybrid-Methoden SenWiCH: 使用混合方法为无线电通信中心提供低资源语言的高级说明 2505.23714v1 -
13 05-29 SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models SocialMaze: Ein Benchmark für die Bewertung sozialer Vernunft in großen Sprachmodellen 社会领域:用大语言模式评价社会原因的基准 2505.23713v1 -
14 05-29 Neuro-symbolic Training for Reasoning over Spatial Language Neuro-symbolisches Training zur Vernunft über räumliche Sprache 以空间语言为借口的神经主义培训 2406.13828v3 -
15 05-29 Let’s Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM’s Math Capability Let’s Reason Formally: Natürlich-Formal Hybrid Reasoning verbessert LLMs Math Capability 让我们正式解释一下: 自然-正规混合理由提高LLM的数学能力 2505.23703v1 -
16 05-29 Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation Kann LLMs abstrakt über Math Word Probleme ohne CoT? Entwirren Abstrakte Formulierung von Arithmetik Computation 没有 CoT,LLMs 理学原理可以抽象地克服数学词问题吗? 2505.23701v1 -
17 05-29 VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos VF-Eval: Bewertung multimodaler LLMs zur Erzeugung von Feedback auf AIGC-Videos VF-Eval:评价多式LLMs,以生成对AIGC视频的反馈 2505.23693v1 -
18 05-29 Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models Kinderorientierte Sprache fördert nicht konsequent das Syntax-Lernen in Sprachmodellen 在语言模式中促进语法学习 2505.23689v1 -
19 05-29 Automatic classification of stop realisation with wav2vec2.0 Automatische Klassifizierung der Stop-Umsetzung mit wav2vec2.0 以 wav2vec2. 0 自动分类停止实现时间 2505.23688v1 -
20 05-29 GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents GSO: Herausfordernde Software-Optimierungsaufgaben zur Bewertung von SWE-Agenten GSO:评估SWE-Agentics的有挑战的软件优化任务 2505.23671v1 -
21 05-29 LoLA: Low-Rank Linear Attention With Sparse Caching LoLA: Low-Rank Lineare Aufmerksamkeit mit Sparse Caching LoLA: 低兰克线性注意, 以粗糙的缓存 2505.23666v1 -
22 05-29 Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models Mehrsprachige Frage-Antworten in Low-Resource-Einstellungen: Ein Dzongkha-Englischer Benchmark für Stiftungsmodelle 低资源环境下的多语言问题解答:基础模型的Dzongkha-英语基准 2505.18638v2 -
23 05-29 ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions ToolHaystack: Stress-Testing Tool-Augmented Language Models in realistischen Langzeit-Interaktionen 工具 Haystack:现实长期互动中的压力测试工具增强语言模式 2505.23662v1 -
24 05-29 Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation Aktives Layer-Kontrastives Decodieren reduziert Halluzination bei der Generierung von Großsprachenmodellen 大型语言模式生成中活性多语言解层解码减少幻觉 2505.23657v1 -
25 05-29 ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs ARC: Argumentationsdarstellungs- und Coverage-Analyse für eine Null-Shot-Lang-Dokument-Zusammenfassung mit Instruktion nach LLMs ARC: “ 零张长文件摘要 “ 的参数代表性和覆盖面分析,在 “ LLM “ 之后指示 2505.23654v1 -
26 05-29 Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation Kleine Sprachmodelle: Architekturen, Techniken, Evaluation, Probleme und zukünftige Anpassung 小型语言模式:建筑、技术、评价、问题和未来适应 2505.19529v2 -
27 05-29 Are Reasoning Models More Prone to Hallucination? Sind vernünftigere Modelle eher halluzinierend? 理性模型更能让人产生幻觉吗? 2505.23646v1 -
28 05-29 Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives Position: Skalierung von LLM-Agenten erfordert asymptotische Analyse mit LLM-Primitiven 位置: 缩放 LLM 代理需要用 LLM 原始功能进行抗药性分析 2502.04358v2 -
29 05-29 YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering YESciEval: Robuster LLM-as-a-Richter für die Beantwortung wissenschaftlicher Fragen YESciEval: 科学问题回答优异的LLM-as-a法官 2505.14279v2 -
30 05-29 Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education Menschliche Empathie als Encoder: KI-Assisted Depression Assessment in Special Education 人类的同情作为编码器:大赦国际协助的特殊教育中抑郁症评估 2505.23631v1 -
31 05-29 GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns GeNRe: Ein französisches Gender-Neutral-Rewriting-System mit kollektiven Substantiven GENRe:法国使用集体名词的性别-新书改写系统 2505.23630v1 -
32 05-29 AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora AutoSchemaKG: Autonome Wissensgraphenkonstruktion durch dynamische Schemainduktion aus Web-Scale Corpora AutoSchemaKG:通过网络规模公司动态气相引入,建立自主知识图 2505.23628v1 -
33 05-29 RULEBREAKERS: Challenging LLMs at the Crossroads between Formal Logic and Human-like Reasoning RULEBREAKERS: Herausfordernde LLMs an der Kreuzung zwischen formaler Logik und menschlicher Vernunft RULEBRATIERS: 在正式逻辑和类似人类的理由之间的十字路口挑战LLMS 2410.16502v2 -
34 05-29 Characterizing the Expressivity of Transformer Language Models Charakterisierung der Expressivität von Transformer-Sprachmodellen 描述变换语言模式的表达性 2505.23623v1 -
35 05-29 Table-R1: Inference-Time Scaling for Table Reasoning Tabelle-R1: Inferenz-Zeit-Skalierung für Tabellenveranlagung 表-R1:表格理由推理的推断时间尺度 2505.23621v1 -
36 05-29 EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation EXIT: Context-Aware Extractive Compression zur Verbesserung der Retrieval-Augmented Generation EXIT: 为加强回流-提款一代而实行的背景软件抽取压缩 2412.12559v3 -
37 05-29 Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering Satori-SWE: Evolutionäre Test-Zeit-Skalierung für probeneffiziente Software-Engineering Satori-SWE:样本高效软件工程的进化测试-时间尺度 2505.23604v1 -
38 05-29 STeCa: Step-level Trajectory Calibration for LLM Agent Learning STeCa: Schritt-Level-Trajektorienkalibrierung für LLM Agent Learning STeCa:LLM代理学习的职级轨迹校准 2502.14276v2 -
39 05-29 X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents X-TURING: Auf dem Weg zu einem verbesserten und effizienten Turing-Test für Langzeit-Dialogagenten XTurning:争取对长期对话代理机构进行强化和高效率的图示测试 2408.09853v2 -
40 05-29 Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles Jigsaw-R1: Eine Studie über regelbasiertes Visuelles Verstärkungslernen mit Puzzle-Puzzles Jigsaw-R1:用Jigsaw谜语进行基于规则的视觉强化学习研究 2505.23590v1 -
41 05-29 On-Policy RL with Optimal Reward Baseline On-Policy RL mit optimaler Prämienbasis 具有最佳回报基准的 政策性RL 2505.23585v1 -
42 05-29 Multi-Domain Explainability of Preferences Multi-Domain-Erklärbarkeit von Präferenzen 优惠的多功能可解释性 2505.20088v2 -
43 05-29 Evaluating AI capabilities in detecting conspiracy theories on YouTube Bewertung von KI-Fähigkeiten bei der Entdeckung von Verschwörungstheorien auf YouTube 评价大赦国际在YouTube上发现阴谋论的能力 2505.23570v1 -
44 05-29 Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models Segment Policy Optimization: Effektive Segment-Level-Kreditvergabe in RL für große Sprachmodelle 政策优化优化:大语言模式RL中有效的分部一级信用分配 2505.23564v1 -
45 05-29 LEXam: Benchmarking Legal Reasoning on 340 Law Exams LEXam: Benchmarking der rechtlichen Begründung von 340 Rechtsprüfungen LEXam:340项法律考试的法律依据基准 2505.12864v2 -
46 05-29 Understanding Refusal in Language Models with Sparse Autoencoders Ablehnung in Sprachmodellen mit Sparse Autoencodern verstehen 使用 sparse 自动解析器理解语言模式中的拒绝拒绝模式 2505.23556v1 -
47 05-29 Enhancing Automated Interpretability with Output-Centric Feature Descriptions Verbesserte Automatisierte Dolmetschbarkeit mit Output-Centric-Feature-Beschreibungen 加强自动解释与产出中心特点描述的可解释性 2501.08319v2 -
48 05-29 Translation in the Wild Übersetzung in der Wildnis 《野生》翻译 2505.23548v1 -
49 05-29 Probability-Consistent Preference Optimization for Enhanced LLM Reasoning Wahrscheinlichkeitskonsistente Preference-Optimierung für verbesserte LLM-Reasoning 增强 LLM 理由说明的优化 2505.23540v1 -
50 05-29 Fast Large Language Model Collaborative Decoding via Speculation Schnelles Large Language Model Kollaboratives Decodieren über Spekulation 通过投机进行快速大语言合作示范模式 2502.01662v2 -
51 05-29 CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification CLaC bei SemEval-2025 Task 6: Ein Multi-Architektur-Ansatz für die Verifikation von Unternehmensumweltversprechen SemEval-2025任务6:公司环境承诺核查的多建筑方法 2505.23538v1 -
52 05-29 Domain-Aware Tensor Network Structure Search Domain-Aware Tensor Netzwerkstruktur Suche 域- 软件显示器网络网络结构搜索 2505.23537v1 -
53 05-29 Joint Localization and Activation Editing for Low-Resource Fine-Tuning Gemeinsame Lokalisierungs- und Aktivierungsbearbeitung für Low-Resource Fine-Tuning 低资源微调联合定位和启动编辑 2502.01179v4 -
54 05-29 Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents Auf dem Weg zu logisch klingender natürlicher Sprache mit logisch-erweiterten Sprachmodell-Agenten 与逻辑增强语言示范代理商一道,争取实现逻辑合理自然语言合理 2408.16081v2 -
55 05-29 Hijacking Large Language Models via Adversarial In-Context Learning Entführen von großen Sprachmodellen über das adversarische In-Context-Lernen 通过对抗性内书学习劫持大语言模式 2311.09948v3 -
56 05-29 Identity resolution of software metadata using Large Language Models Identitätsauflösung von Software-Metadaten mit großen Sprachmodellen 使用大语言模式的软件元数据的识别分辨率 2505.23500v1 -
57 05-29 Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking Diagnose und Bewältigung von Pitfalls in KG-RAG-Datensätzen: Zu zuverlässigerem Benchmarking 分析和处理KG-RAG数据集的缺陷:争取更可靠的基准 2505.23495v1 -
58 05-29 Spoken Language Modeling with Duration-Penalized Self-Supervised Units Gesprochene Sprachmodellierung mit Dauer-Penalisierten Selbstüberwachten Einheiten 长期惩罚性自督单位的口语模拟模式 2505.23494v1 -
59 05-29 R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation R2I-Bench: 基准推理-驱动生成文本到图像 2505.23493v1 -
60 05-29 Learning to Poison Large Language Models for Downstream Manipulation Große Sprachmodelle für Downstream-Manipulation zu vergiften 学习下游操作毒物大语言模式 2402.13459v3 -
61 05-29 Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu In-Context Machine Translation für Low-Resource-Sprachen verstehen: Eine Fallstudie zu Mandschu 理解低资源语言的文内机翻译:关于满字的个案研究 2502.11862v2 -
62 05-29 Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions Firm oder Fickle? Bewertung großer Sprachmodelle Konsistenz in sequenziellen Interaktionen 公司或Fickle?评估大语言模型在序列相互作用中的一致性 2503.22353v2 -
63 05-29 Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt Überdenken in der langen Kette des Denkens aus der Perspektive des Selbstzweifels 从自杜卜特的视角重新思考长期思维链中的过度思考问题 2505.23480v1 -
64 05-29 Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons Bewertung der Leistungsfähigkeit und Fragilität großer Sprachmodelle auf der Selbsteinschätzung für neurologische Chirurgen 评价神经外科医生自我评估大语言模型的性能和脆弱性 2505.23477v1 -
65 05-29 Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns Scratic-PRMBench: Benchmarking-Prozess-Reward-Modelle mit systematischen Begründungsmustern Scorti-PRMBench:有系统说明理由模式的基准进程奖励模式 2505.23474v1 -
66 05-29 BenchmarkCards: Large Language Model and Risk Reporting BenchmarkCards: Großes Sprachmodell und Risikoberichterstattung 基准目录:大语言模式和风险报告 2410.12974v2 -
67 05-29 Agentic Knowledgeable Self-awareness Agentisch sachkundiges Selbstbewußtsein A. 动态知识自觉意识 2504.03553v2 -
68 05-29 UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions UAQFact: Bewertung der tatsächlichen Wissensnutzung von LLMs auf unbeantwortbaren Fragen UAQFact:评估关于无法回答问题LLMs的实情知识利用情况 2505.23461v1 -
69 05-29 Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models Teilen und Erobern: Eine hybride Strategie besiegt multimodale große Sprachmodelle 差异和征服:混合战略失败 多种多模式大语言模式 2412.16555v3 -
70 05-29 GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning GSQ-Tuning: Group-Shared Exponents integer in einer voll quantifizierten Schulung für LLMs On-Device-Fine-Tuning GSQ-Turning:为在线设计精微调LLM女士提供全面量化培训的集团共享指数整数 2502.12913v3 -
71 05-29 CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning CodePMP: Skalierbares Präferenzmodell Vorschulung für großsprachliche Modellaufklärung 守则PMP:可缩放的特惠模式大语言示范理由预培训模式 2410.02229v2 -
72 05-29 Rethinking Regularization Methods for Knowledge Graph Completion Überdenken von Regularisierungsmethoden für Wissensgraphenvervollständigung 重新思考知识图完成正规化方法 2505.23442v1 -
73 05-29 DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? DeepSeek vs. o3-mini: Wie gut können LLMs mit Vernunft bewerten MT und Zusammenfassung? DeepSeek对 o3-min:如何合理解释LLMs评价MT和总结? 2504.08120v2 -
74 05-29 LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding LLM als Effektiver Streaming-Prozessor: Überbrückung von Streaming-Batch-Mismatches mit Gruppenpositionskodierung LLM 有效流化处理程序: 将流流-批量错误与群居位置编码连接起来 2505.16983v2 -
75 05-29 SPRI: Aligning Large Language Models with Context-Situated Principles SPRI: Ausrichtung großer Sprachmodelle mit kontext-situierten Prinzipien SPRI:使大语言模式与上下文原则相一致 2502.03397v2 -
76 05-29 DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation DynaCode: Dynamischer Code Benchmark für die Bewertung großer Sprachmodelle in der Codegenerierung DynCode:在代码生成过程中评价大语言模型的动态复杂度-软件编码基准 2503.10452v2 -
77 05-29 Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition Probeneffiziente menschliche Bewertung großer Sprachmodelle durch maximalen Diskrepanzwettbewerb 通过最大差异竞争对大语言模式进行抽样有效人力评价 2404.08008v2 -
78 05-29 The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence Das Warmup-Dilemma: Wie sich Lernratenstrategien auf die Konvergenz von Sprach-Text-Modellen auswirken 暖化困境:学习速率战略如何影响演讲到文字模式模式汇合 2505.23420v1 -
79 05-29 SWE-bench Goes Live! SWE-Bench geht live! SWE -BECHE GOES 现场直播! 2505.23419v1 -
80 05-29 On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists On-Device Collaborative Language Modeling über eine Mischung aus Generalisten und Spezialisten 通过通识主义者和专家混合组合的在线合作语言建模 2409.13931v4 -
81 05-29 LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline LLMs können qualitativ hochwertige Simultane Machine Translation so effizient wie Offline erreichen LLM Can 能够像离线那样高效率地实现高质量同声机翻译 2504.09570v2 -
82 05-29 From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs Von Parametern zu Prompts: Den Factuality Gap zwischen fein getunen LLMs verstehen und abschwächen 从参数到提示:了解并缩小微量贷款商之间的实际质量差距 2505.23410v1 -
83 05-29 EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse EFIM: Effizientes Servieren von LLMs zur Erfüllung von Aufgaben mit verbesserter KV Cache Reuse EFIM:以改进的KV缓存再利用高效率地为完成任务的LLMs服务 2505.21889v2 -
84 05-29 VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining VietASR: Erzielen von vietnamesischen ASR auf Branchenebene mit 50-Stunden-Daten und großformatigen Sprachvorschulungen 越南:在越南工业一级实现有50小时标签数据和大型演讲预科培训的有50小时标签的数据的越南ASR 2505.21527v2 -
85 05-29 Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models Adaptive Jailbreaking-Strategien basierend auf dem semantischen Verständnis von Fähigkeiten großer Sprachmodelle 基于大语言模型的语义理解能力 2505.23404v1 -
86 05-29 Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms Re-Ranking mit großen Sprachmodellen zur Minderung der Exposition gegenüber schädlichen Inhalten auf Social Media-Plattformen 利用大型语言模式,在社交媒体平台上减少接触有害内容 2501.13977v3 -
87 05-29 DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding DREAM: Entwurf mit raffinierten Target-Features und Entropie-Adaptive Cross-Attention Fusion für multimodale spekulative Dekodierung DREAM: 与改良目标特征和多模式投机下限的 Entropy-Adpy-Adpic 交叉注意聚变一起起草 2505.19201v2 -
88 05-29 ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation ReflectionCoder: Aus Reflexionssequenz lernen für verbesserte Einmal-Code-Generierung 思考编码:从强化一次性代码生成的反思序列中学习 2405.17057v2 -
89 05-29 BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages BRIGHER: Die Lücke in Text-Emotions-Erkennungs-Datensätzen für 28 Sprachen bohren 消除28种语言在载人附加说明的文本情感识别识别数据集方面的差距 2502.11926v4 -
90 05-29 GWQ: Gradient-Aware Weight Quantization for Large Language Models GWQ: Gradient-Aware Weight Quantization für große Sprachmodelle GWQ: 大语言模型的渐变软件重量 2411.00850v4 -
91 05-29 Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation Threading the Needle: Rewebing Chain-of-Thought Begründung zu erklären, Human Label Variation 针线串列: 重新编织尝试链 解释人类标签变化的原因 2505.23368v1 -
92 05-29 Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs Graph of Records: Steigerung der retrieval Augmented Generation für Langkontext-Zusammenfassung mit Graphen 记录图图:用图表进行长文本摘要的推进检索增量生成器 2410.11001v2 -
93 05-29 Discriminative Policy Optimization for Token-Level Reward Models Diskriminative Politikoptimierung für Token-Level-Reward-Modelle 东京级奖励模式的区别对待政策优化 2505.23363v1 -
94 05-29 Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability Sind Generative Modelle unterbewusst? Bessere Qualitätsschätzung mit erhöhter Modellwahrscheinlichkeit 产生型号是否缺乏自信?更好的质量估算与促进型号的模型概率 2502.11115v2 -
95 05-29 mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus mOSCAR: Ein multimodaler, mehrsprachiger und multimodaler Korpus auf Dokumentebene MOSCAR: 大型多语种和多模式文件级公司 2406.08707v2 -
96 05-29 Nosey: Open-source hardware for acoustic nasalance Nosey: Open-Source-Hardware für akustische Nasalance 鼻鼻:用于音响鼻鼻腔的开源硬件 2505.23339v1 -
97 05-29 Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors Weder Stochastic Parroting noch AGI: LLMs lösen Aufgaben durch kontextorientierte Extrapolation von Trainingsdaten Priors 既不是蒸蒸碎剖析,也不是AGI:通过根据培训数据前期进行的背景差异外推法解解解任务LLMs Solve任务 2505.23323v1 -
98 05-29 DReSD: Dense Retrieval for Speculative Decoding DResD: Dense Retrieval für spekulative Dekodierung DRESD: 用于投机性代号的高级检索值 2502.15572v2 -
99 05-29 Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO Proximalisierte Preference-Optimierung für unterschiedliche Feedback-Typen: Eine zersetzte Perspektive auf DPO 多种反馈类型最佳优化:对残疾人组织拆解的视角 2505.23316v1 -
100 05-29 Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments Verbesserung der Genauigkeit der Markerbewertung durch ordinelles Vertrauensmodellierung in Bildungsbewertungen 通过在教育评估中建立常规信任模型,加强标标码的准确度 2505.23315v1 -
101 05-29 Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction Datensatz-Featurierung: Enthüllen natürlicher Sprach-Features durch unüberwachte Daten-Rekonstruktion Dataset Featuriz化:通过未受监督的数据重建发现自然语言特征 2502.17541v2 -
102 05-29 Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs Generalisierte Category Discovery in Event-Centric Kontexten: Latent Pattern Mining mit LLMs 事件发生时发现的情况:利用LLMM公司进行原型采矿 2505.23304v1 -
103 05-29 Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs Dateneffiziente Meta-Modelle zur Auswertung kontextbasierter Fragen und Antworten in LLMs 评价LLMM基于背景的问答的元模型 2505.23299v1 -
104 05-29 EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian EmoBench-UA: Ein Benchmark-Datensatz für Emotionserkennung in der Ukraine EmoBenich-UA:乌克兰情感检测基准数据集 2505.23297v1 -
105 05-29 How Does Response Length Affect Long-Form Factuality Wie wirkt sich die Response-Länge auf die Langform-Faktizität aus? 反应时间长度如何影响长期事实质量 2505.23295v1 -
106 05-29 Multi-Modal Framing Analysis of News Multi-Modal Framing Analyse der Nachrichten 新闻多模式结构分析 2503.20960v3 -
107 05-29 ScEdit: Script-based Assessment of Knowledge Editing ScEdit: Script-basierte Bewertung von Wissensbearbeitung ScEdit: 基于脚本的知识编辑评估 2505.23291v1 -
108 05-29 Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency Unsicherheit Quantifizierung für LLMs durch Minimum Bayes Risiko: Vertrauensüberbrückung und Konsistenz 通过最低贝谷风险对LLMs的不确定性量化: 建立互信和一致性 2502.04964v4 -
109 05-29 MathArena: Evaluating LLMs on Uncontaminated Math Competitions MathArena: Bewertung von LLMs auf nicht kontaminierten Math-Wettbewerben Matharena:评估未受污染数学竞赛的LLMs 2505.23281v1 -
110 05-29 Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective Sentinel: Aufmerksamkeitsprobierung von Proxy-Modellen für LLM-Kontextkompression mit verstehender Perspektive 哨兵:注意从理解角度观察LLM背景压缩的代理模型 2505.23277v1 -
111 05-29 The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text Der arabische KI-Fingerabdruck: Stylometrische Analyse und Erkennung von großen Sprachmodellen Text 阿拉伯文 AI 指纹:大语言模型文本的tytyllogimics 分析和探测 2505.23276v1 -
112 05-29 BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes BioVL-QR: Egozentrischer biochemischer Vision- und Sprachdatensatz mit Micro-QR-Codes BioVL-QR:使用微质变码的Egocent 生物化学视觉和语言数据集 2404.03161v3 -
113 05-29 Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs Entfernt Machine Unlearning wirklich Modellwissen? Ein Rahmen für die Prüfung von Unlearning in LLMs 机器取消学习是否真正删除了示范知识? 审计框架是否在LLMM中取消学习? 2505.23270v1 -
114 05-29 Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem? Token Pruning in multimodalen großen Sprachmodellen: Lösen wir das richtige Problem? 在多式大语言模式中的 Token Prurning:我们是否解决了正确的问题? 2502.11501v2 -
115 05-29 A Reality Check on Context Utilisation for Retrieval-Augmented Generation Ein Realitätscheck auf Kontext-Auslastung für retrieval-Augmented Generation 关于回收-提款人一代的上下文利用情况的现实检查 2412.17031v2 -
116 05-29 Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs Strukturverstärkte Protein-Instruktions-Tuning: Auf dem Weg zu einem allgemeinen Protein-Verständnis mit LLMs 结构强化的蛋白质指导指示图示:争取与LLMs达成一般用途的蛋白性了解 2410.03553v3 -
117 05-29 Skywork Open Reasoner 1 Technical Report Skywork Open Reasoner 1 Technischer Bericht ” 天窗开放理由1 “ 技术报告 2505.22312v2 -
118 05-29 Tensor Product Attention Is All You Need Tensor Produkt-Achtung ist alles, was Sie brauchen 色素产品 关注是所有你需要的 2501.06425v4 -
119 05-29 Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers Automatische Konstruktion mehrerer Klassifizierungsdimensionen für die Verwaltung von Ansätzen in wissenschaftlichen Papieren 科学文件中管理方法的多重分类方面自动构建 2505.23252v1 -
120 05-29 SOTOPIA-$Ω$: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents SOTOPIA-$Ω$: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents SOTOPIA-美元/美元/美元:在评估社会代理人后进行动态战略注射学习和社会指导 2502.15538v3 -
121 05-29 Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts Autonome Datenauswahl mit Zero-shot Generative Klassifikatoren für mathematische Texte 具有数学文本零光生成分类器的自动数据选择 2402.07625v6 -
122 05-29 ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering ChartMind: Ein umfassender Benchmark für komplexe multimodale Chart-Fragebeantwortung 图表Mind:复杂现实世界多式联运图表问题回答综合基准 2505.23242v1 -
123 05-29 PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts PolyMath: Mathematische Vernunft in multilingualen Kontexten bewerten 多语制:多语种背景下的数学理由评估 2504.18428v3 -
124 05-29 Pandora’s Box or Aladdin’s Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models Pandora’s Box oder Aladdin’s Lampe: Eine umfassende Analyse, die die Rolle des RAG-Geräuschs in großen Sprachmodellen aufzeigt Pandora的盒子或Aladdin的灯光:全面分析RAG噪音在大语言模型中的作用 2408.13533v3 -
125 05-29 MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration MCTSr-Zero: Selbstreflektierende Psychologische Beratung Dialoge Generation über Prinzipien und Adaptive Exploration MMCTSr-Zero:通过原则和适应性探索进行自我反应心理辅导对话 2505.23229v1 -
126 05-29 HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model HiDe-LlaVA: Hierarchische Entkopplung zur kontinuierlichen Instruktionstuning von multimodalen Großsprachenmodellen HIDE-LLALAVA:多式大语言模式连续教学制导的等级脱钩 2503.12941v2 -
127 05-29 Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage Bidirektionale Ketten von Gedanken- und Belohnungsmechanismen zusammenführen Eine Methode zur Verbesserung von Frage-Antwort-Fähigkeiten von großen Sprachmodellen für chinesisches immaterielles Kulturerbe 利用思想和奖赏机制的双向双向两向链 提高中国非物质文化遗产大语言模式的回答问题能力的方法 2505.08167v3 -
128 05-29 Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking Reasoning-to-Defend: Sicherheitsbewusste Reasoning kann große Sprachmodelle von Jailbreaking verteidigen 理由到理由:安全意识理由能够捍卫从破室中使用大语言的模型 2502.12970v2 -
129 05-29 DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models DiagnoseArena: Benchmarking Diagnostic Reasoning für große Sprachmodelle 诊断阿勒纳:大语言模型诊断依据基准 2505.14107v4 -
130 05-29 MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration MMBoundary: MLLM-Wissensgrenzen-Bewusstsein durch vernünftige Schritt-Vertrauens-Kalibrierung MMMMMMMMMM MMMMMMMM:通过合理步骤信任校准提高MLLM知识边界认识 2505.23224v1 -
131 05-29 KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search KBQA-o1: Agentische Wissensdatenbank Frage beantworten mit Monte Carlo Baumsuche KBQA- o1: 用于蒙特卡洛树搜索的代理知识库问题解答 2501.18922v3 -
132 05-29 Reducing Tool Hallucination via Reliability Alignment Reduzieren der Werkzeughalluzination durch Zuverlässigkeitsanpassung 通过可靠性调整减少工具幻觉 2412.04141v3 -
133 05-29 Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces Verbesserung der parallelen Programmleistung mit LLM-Optimierern über Agent-System-Schnittstellen 通过代理-系统接口改进与LLM优化器的平行方案绩效 2410.15625v3 -
134 05-29 System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts System-1.5 Reasoning: Traversal in Sprach- und Latentenräumen mit dynamischen Shortcuts 系统-1.5 理由:具有动态快捷键的语言和隐藏空间的变化 2505.18962v2 -
135 05-29 FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning FCMR: Robuste Bewertung der finanziellen Cross-Modal Multi-Hop Reasoning FCMR: 对跨模式、多渠道金融理由的有力评价 2412.12567v3 -
136 05-29 Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection Multimodale Inverse Aufmerksamkeit Netzwerk mit Intrinsic Discriminant Feature Exploitation für gefälschte Nachrichten Erkennung 多式反向关注网络,利用内在差异性地貌特征利用假新闻探测 2502.01699v2 -
137 05-29 BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning BioProBench: Umfassender Datensatz und Benchmark im Biologischen Protokoll Verständnis und Vernunft BioProBench:生物议定书理解和理由的综合数据集和基准 2505.07889v2 -
138 05-29 $T^5Score$: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets $T^5Score$: Eine Methode zur automatischen Bewertung der Qualität von LLM Generated Multi-Document Topic Sets $T$5STR$:自动评估LLM生成的多文件专题集质量的方法 2407.17390v3 -
139 05-29 ExpeTrans: LLMs Are Experiential Transfer Learners ExpeTrans: LLMs sind erfahrene Transfer-Lerner Expetrary: LLMs 是经验性转移学习者 2505.23191v1 -
140 05-29 Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration Erfahrungsübergreifendes Lernen auf LLM-basierter Multi-Agent-Kollaboration 关于基于LLM的多机构合作的跨任务跨任务经验学习 2505.23187v1 -
141 05-29 Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement Unüberwachte Bewertung auf Word-Level-Qualität für maschinelle Übersetzung durch die Linse der Annotatoren (Dis)Vereinbarung 未经监督的通过标注员的镜头进行机器翻译的字级质量估计 2505.23183v1 -
142 05-29 Improving Continual Pre-training Through Seamless Data Packing Verbesserung der kontinuierlichen Vorschulung durch nahtloses Datenpaket 通过无缝无缝数据包装改进持续培训前培训 2505.22018v2 -
143 05-29 Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification Infinite-Instruct: Synthesizing Scaling Code instruction Daten mit bidirektionaler Synthese und statischer Verifikation 无限指令:以双向合成和静态核查将缩放码指示数据与双向合成和静态核查结合起来 2505.23177v1 -
144 05-29 Map&Make: Schema Guided Text to Table Generation Map&Make: Schema-Leittext zur Tabellenerstellung Mag&Make: 生成表格的图表向导文本 2505.23174v1 -
145 05-29 ZIPA: A family of efficient models for multilingual phone recognition ZIPA: Eine Familie von effizienten Modellen für mehrsprachige Telefonerkennung ZIPA:一套有效的多语言电话识别模式 2505.23170v1 -
146 05-29 Tell, Don’t Show: Leveraging Language Models’ Abstractive Retellings to Model Literary Themes Tell, Don’t Show: Die abstrakten Retellings von Sprachmodellen nutzen, um literarische Themen zu modellieren Tell, don’t show: 利用语言模型对示范文学主题的抽象引用 2505.23166v1 -
147 05-29 Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach Temporale Beziehungsextraktion in klinischen Texten: Ein Span-basierter Graph Transformer-Ansatz 临床文本中的时间关系抽取时间关系:基于泛泛面的图形变形器方法 2503.18085v2 -
148 05-29 Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs Zu konsequent, um zu erkennen: Eine Studie über selbstkonsistente Fehler in LLMs 过于一致,无法检测:LLMM中自相矛盾错误的研究 2505.17656v2 -
149 05-29 Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models Cross-Domain Zweisprachige Lexikoninduktion über vorgebildete Sprachmodelle 通过预先培训语言模式的跨域双语双语双语 2505.23146v1 -
150 05-29 ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation Parammute: Unterdrückende wissenskritische FFNs für treue retrieval-erweiterte Generation 分量:制止知识-关键FFFF,以用于忠实检索-养殖一代 2502.15543v2 -
151 05-29 Enhancing Large Language Models’Machine Translation via Dynamic Focus Anchoring Verbesserung der Übersetzung großer Sprachmodelle durch Dynamic Focus Anchoring 通过动态焦点拼接加强大语言模型的“Machine ”翻译 2505.23140v1 -
152 05-29 CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction CLEME2.0: Auf dem Weg zur Interpretierbaren Bewertung durch Entwirren von Edits für die Korrektur von Grammatikfehlern CLEME2.0:通过拆分文体错误校正的编辑版实现可解释性评价 2407.00934v2 -
153 05-29 Learning to Reason under Off-Policy Guidance Unter außerpolitischer Anleitung zur Vernunft lernen 根据非政策指导学习理由 2504.14945v4 -
154 05-29 EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models EarthSE: Ein Benchmark für die Bewertung der wissenschaftlichen Explorationsfähigkeit der Erde für große Sprachmodelle EarthSE:大语言模型地球科学探索能力基准评估 2505.17139v2 -
155 05-29 Jailbreaking to Jailbreak Gefängnisbruch zum Gefängnisbruch 破门而入,破门而入, 2502.09638v2 -
156 05-29 REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space REVS: Unlearning Sensible Information in Language Models via Rank Editing im Vokabelfeld REVS:通过词汇空间排行编辑在语言模型中学习敏感信息 2406.09325v5 -
157 05-29 GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning GETReason: Bildkontext-Extraktion durch Hierarchische Multi-Agenten-Reasoning verbessern GetReason:通过等级式多机构代理理由加强图像背景采掘 2505.21863v2 -
158 05-29 LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data LongFaith: Verbesserung der Langkontext-Reasonierung in LLMs mit treuen synthetischen Daten 长面:利用忠实合成数据加强LLMs中的长方理由 2502.12583v2 -
159 05-29 Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context Human-Readable Adversarial Prompts: Eine Untersuchung von LLM-Fehlern mit situationsbezogenem Kontext 人类可以读取的反向提示:利用情况背景调查LLM脆弱性 2412.16359v3 -
160 05-29 PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics PBEBench: Ein mehrstufiges Programmieren nach Beispielen, inspiriert von historischer Linguistik PBEBench:根据历史语言推导的多层次方案拟定工作 2505.23126v1 -
161 05-29 CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark CASS: Nvidia zu AMD Transpilation mit Daten, Modellen und Benchmark CASS: Nvidia 到AMD 传输数据、模型和基准 2505.16968v3 -
162 05-29 Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging Verbesserung des Brain-to-Image-Reconstructions durch feinkörnige Text-Bridging 通过完善的文本连接改进脑到图像重建 2505.22150v2 -
163 05-29 ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations ContextQFormer: Eine neue Context-Modellierungsmethode für Multi-Turn Multi-Modal-Gespräche 上下文前:多发多式多模式对话的新背景建模方法 2505.23121v1 -
164 05-29 Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios Elicit und Enhance: Multimodale Reasoning in medizinischen Szenarien fördern 明确和强化:推进医疗假想中的多式联运理由 2505.23118v1 -
165 05-29 Learning to Reason from Feedback at Test-Time Von Feedback bei Test-Time zur Vernunft lernen 从测试时的反馈中学习到理由 2502.15771v2 -
166 05-29 Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data Datensatzkartographie für großsprachliche Modellausrichtung: Mapping und Diagnose von Präferenzdaten 用于大语言模型对齐的数据集制图:绘图和诊断优先数据 2505.23114v1 -
167 05-29 C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation C$^2$LEVA: Auf dem Weg zu einer umfassenden und kontaminationsfreien Sprachmodellbewertung C$$2$LEVA:努力实现全面和无污染、无污染的无语言模式评价 2412.04947v3 -
168 05-29 FutureGen: LLM-RAG Approach to Generate the Future Work of Scientific Article FutureGen: LLM-RAG Ansatz zur Generierung der zukünftigen Arbeit des wissenschaftlichen Artikels FutureGen:LLM-RAG 产生科学条款未来工作的方法 2503.16561v2 -
169 05-29 LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study LLM trifft Szenegraph: Können große Sprachmodelle Szenengraphen verstehen und generieren? Eine Benchmark- und Empirische Studie LLM 满足景象图:大语言模型能够理解和产生景象图吗? 基准和经验研究 2505.19510v2 -
170 05-29 Generating Diverse Training Samples for Relation Extraction with Large Language Models Erzeugen von unterschiedlichen Trainingsbeispielen für die Beziehungsextraktion mit großen Sprachmodellen 生成多种培训样本,用于与大语言模式的抽取关系 2505.23108v1 -
171 05-29 Can We Predict Performance of Large Models across Vision-Language Tasks? Können wir die Leistung großer Modelle über Vision-Language-Aufgaben hinweg voraussagen? 我们能否预测大型模型在愿景-语言任务中的绩效? 2410.10112v2 -
172 05-29 Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models Automatische Übertragung für LLM-Tiers: Kosten- und Genauigkeitsoptimierung in großen Sprachmodellen LLM Tiers 自动传输: 优化大语言模型的成本和准确度 2505.20921v2 -
173 05-29 RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models RepCali: Hocheffizientes Feintuning über Darstellungskalibrierung im Latent Space für vortrainierte Sprachmodelle RepCali:为预培训语言模型在冷藏空间进行高效的精微微调 Via代表比例校准 2505.08463v2 -
174 05-29 SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation SimGRAG: Nutzung ähnlicher Subgraphen für Wissensgraphen Driven Retrieval-Augmented Generation SimGRAG: 利用知识图形驱动回溯源的类似子集 2412.15272v2 -
175 05-29 MAP: Revisiting Weight Decomposition for Low-Rank Adaptation KARTE: Wiederbesuchen der Gewichtsverringerung für Low-Rank-Anpassung MAP: 重新审视低浓度适应的重量分解 2505.23094v1 -
176 05-29 Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models Infi-MMR: Curriculumbasiertes Entsperren multimodaler Vernunft durch schrittweises Verstärktes Lernen in multimodalen Small Language-Modellen Infi-MMMR:通过在多模式小型语言模式中分阶段强化学习,以课程为基础解锁多模式原因 2505.23091v1 -
177 05-29 Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport Document-Level Text Generierung mit minimalen Bayes Risikodekodierung mit optimalem Transport 采用最佳运输方式,以文件水平生成具有最低比值风险解码的文本 2505.23078v1 -
178 05-29 Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation Kontextualisierte automatische Spracherkennung mit dynamischer Vokabelvorhersage und Aktivierung 具有动态词汇预测和启动功能的实用自动语音识别 2505.23077v1 -
179 05-29 Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts Shortcut-verbundene Experten-Parallelität für die Beschleunigung von Mixture-of-Experts 加速混合专家专家专家平行专家 2404.05019v3 -
180 05-29 SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models SORSA: Singuläre Werte und Orthonormale Regularisierte Singuläre Vektoren Anpassung großer Sprachmodelle SORSA: 单项价值和正正正的正规化的单项矢量,以适应大语言模式 2409.00055v6 -
181 05-29 SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services SNS-Bench-VL: Benchmarking multimodaler Großsprachenmodelle in Social Networking Services SNS-Bench-VL:确定社会联网服务中多式大语言模式基准 2505.23065v1 -
182 05-29 GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation GIVE: Strukturierte Begründung großer Sprachmodelle mit Wissensgrafik inspirierte Veracity-Extrapolation 特具:大语言模式结构原因说明,以知识图激发的多才多艺外推法 2410.08475v3 -
183 05-29 Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design Spekulative Dekodierung trifft auf Quantisierung: Kompatibilitätsbewertung und Hierarchisches Framework Design 投机性下限符合量化:兼容性评价和等级框架设计 2505.22179v2 -
184 05-29 Self-Correcting Code Generation Using Small Language Models Selbstkorrekte Code-Generierung mit kleinen Sprachmodellen 使用小型语言模式自行校正代码生成 2505.23060v1 -
185 05-29 Be.FM: Open Foundation Models for Human Behavior Be.FM: Open Foundation Modelle für menschliches Verhalten BeFM: 人类行为开放基础模型 2505.23058v1 -
186 05-29 OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics OrionBench: Ein Benchmark für Diagramm- und Mensch-erkennbare Objekterkennung in Infografiken Orion Bunch:图表和人类可识别的在信息图中探测物体的基准 2505.17473v3 -
187 05-29 Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation Destill CLIP (DCLIP): Bild-Text-Retrieval durch Cross-Modal Transformer-Destillation verbessern 蒸馏 CLIP (DCLIP): 通过跨模式变异器蒸馏加强图像- 文本回收 2505.21549v2 -
188 05-29 Query Routing for Retrieval-Augmented Language Models Abfrage-Routing für Retrieval-Augmented Language-Modelle 查询检索推荐语言模型的查询路径 2505.23052v1 -
189 05-29 DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration DenoiseRotator: Verbesserung der Beschneidungsfestigkeit für LLMs durch Bedeutungskonzentration DenoisRotator:通过重视浓度提高LLMs的稳健力 2505.23049v1 -
190 05-29 Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines Instruction-Tuning LLMs für die Ereignisextraktion mit Annotationsrichtlinien 说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性准则 2502.16377v2 -
191 05-29 FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems FlexDuo: Ein Pluggable-System zur Ermöglichung von Full-Duplex-Fähigkeiten in Sprachdialogsystemen FlexDuo:一个促进语音对话系统全面灵活能力的插件系统 2502.13472v2 -
192 05-29 NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables NeedleInATable: Erforschen von Langkontext-Kapazität von großen Sprachmodellen zu langstrukturierten Tabellen 针线表:探索长结构表格中大语言模型的长文能力 2504.06560v2 -
193 05-29 Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation Cross-modal RAG: Sub-dimensionale Retrieval-Augmented Text-to-Image Generation 跨模式RAG:次二维检索增强的文本到图像生成 2505.21956v2 -
194 05-29 TailorSQL: An NL2SQL System Tailored to Your Query Workload TailorSQL: Ein NL2SQL-System, das auf Ihre Abfrage-Workloads zugeschnitten ist 定制SQL: 适合您查询工作量的 NL2SQL 系统 2505.23039v1 -
195 05-29 EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models EL4NER: Ensemble Lernen für die benannte Entity-Erkennung über mehrere kleine Parameter große Sprachmodelle EL4NER:通过多小口径大语言模型进行命名实体识别的结合学习 2505.23038v1 -
196 05-29 Improving Multilingual Social Media Insights: Aspect-based Comment Analysis Mehrsprachige Social Media-Insights verbessern: Aspect-based Comment Analysis 改进多语种社会媒体透视:基于背景的评论分析 2505.23037v1 -
197 05-29 LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization LoRA-MGPO: Doppelabstieg in der Low-Rank-Anpassung durch Momentum-geführte Perturbierungs-Optimierung abmildern LoRA-MGPO:通过动力调节-受控渗透优化,减少低辐射适应中的双重来源 2502.14538v2 -
198 05-29 Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse Machine-Facing English: Definition eines hybriden Registers, geformt von Human-AI Diskurs 面向机器的英语: 定义由人类-AI 论文构成的混合登记册 2505.23035v1 -
199 05-29 Exploring the Limitations of Mamba in COPY and CoT Reasoning Erforschung der Grenzen von Mamba in COPY und CoT Reasoning 探索COPY和COT理由解释中Mamba的局限性 2410.03810v3 -
200 05-29 AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge AntiLeakBench: Datenkontamination durch automatisches Konstruieren von Benchmarks mit aktualisiertem Real-World-Wissen verhindern 防止泄漏:利用最新现实世界知识自动建立基准,防止数据污染 2412.13670v2 -
201 05-29 On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs Über das Risiko der Beweisverschmutzung für bösartige Social Text Detection in der Ära der LLMs 关于在LLMM公司时代对恶性社会文本进行侦破的证据污染风险 2410.12600v2 -
202 05-29 Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset Können moderne NLP-Systeme zuverlässig Röntgenuntersuchungen im Brustkorb annotieren? Eine Pre-Purchase-Bewertung und vergleichende Untersuchung von Lösungen von AWS, Google, Azure, John Snow Labs und Open-Source-Modellen auf einem unabhängigen Kinderdatensatz 现代NLP系统能否可靠地说明胸前射电测量? 对AWS、Google、Azure、John Snow实验室和独立儿科数据集开放来源模型的解决方案进行采购前评估和比较研究 2505.23030v1 -
203 05-29 Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Spac Enthüllen visual-semantischer psycholinguistischer Eigenschaften aus der Verteilungsstruktur von Texteinbettung Spac 从文字嵌入的文本分布结构中隐藏的视觉-语言心理语言属性 2505.23029v1 -
204 05-29 Context Robust Knowledge Editing for Language Models Kontext Robuste Wissensbearbeitung für Sprachmodelle 语言模型的上下文强力知识编辑 2505.23026v1 -
205 05-29 AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models AgentAlign: Navigieren der Sicherheitsausrichtung im Wechsel von Informativ zu Agentischen Großsprachenmodellen 代理对齐: 导航从信息型转向大语言型的移动中的安全对齐 2505.23020v1 -
206 05-29 SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models SciHorizon: Benchmarking von KI-für-Science Readiness von wissenschaftlichen Daten zu großen Sprachmodellen SciHorizon:将AI-SciHorizon科学准备程度从科学数据基准确定为大语言模式 2503.13503v3 -
207 05-29 Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages Mehrsprachiger Encoder weiß mehr als Sie realisieren: Geteilte Gewichte Vortraining für extrem ressourcenarme Sprachen 多语种编码器者比你所认识的要多得多: 极低资源语言的共有重力预培训 2502.10852v2 -
208 05-29 Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models Ermittlung von Stealthy Backdoor-Proben auf Basis von Intra-Klasse-Abstand für große Sprachmodelle 检测基于大语言模型班级内部距离的隐形后门样本 2505.23015v1 -
209 05-29 BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models BA-LoRA: Bias-Alleviating Low-Rank Anpassung an Mitigate Katastrophische Vererbung in großen Sprachmodellen BA-LORA:在大语言模型中,对减轻灾害传承的低率适应 2408.04556v5 -
210 05-29 Synthetic Document Question Answering in Hungarian Synthetische Dokument-Frage-Antworten auf Ungarisch 匈牙利语的合成文件问题解答 2505.23008v1 -
211 05-29 A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs Ein praktischer Ansatz für Gebäudeproduktions-Grade Conversational Agents mit Workflow Graphen 建立具有工作流量图的生产—- 生产—- 生产—- 不同阶段交流的代理物的实用方法 2505.23006v1 -
212 05-29 Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation Kette der geerdeten Ziele: Überbrückungsprozess und zielorientiertes Prompting für die Codegenerierung 基本目标链链:搭桥进程和以目标为导向的促进代码生成 2501.13978v2 -
213 05-29 What’s In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models Was ist auf Ihrem Gebiet? Mapping Wissenschaftliche Forschung mit Wissensgraphen und großen Sprachmodellen 你的领域是什么?用知识图和大语言模型绘制科学研究图。 2503.09894v2 -
214 05-29 DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors DyePack: Wahrscheinlich Flagging Test Set Kontamination in LLMs Verwendung von Backdoors DyePack: 使用后门的LLMs中可被证实的挂旗试验设置污染 2505.23001v1 -
215 05-29 Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation Prüfen Sie in der Grafik: Entity Disambiguation Enhancement für komplexe Claim-Verifikation mit interaktiver Graphendarstellung 校验格中:实体对复杂索赔核实与交互式图表代表的分歧增强 2505.22993v1 -
216 05-29 Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition Pangu Embedded: Effizienter Dual-System LLM Reasoner mit Metakognition Pangu 嵌入式:高效的双系统LLM 2505.22375v2 -
217 05-29 Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems Agent-UniRAG: Ein trainingables Open-Source LLM Agent Framework für unified Retrieval-Augmented Generation Systems Agent-UniRAG: 一个可培训的开放源码的LLM Agent Form for United Retreval-Augsing System(统一回收-提款发电系统框架) 2505.22571v2 -
218 05-29 Frankentext: Stitching random text fragments into long-form narratives Frankentext: Zufällige Textfragmente zu langformigen Erzählungen heften Frankentext: 将随机文本片断成长式叙述 2505.18128v2 -
219 05-29 Theoretical guarantees on the best-of-n alignment policy Theoretische Garantien für die optimale Ausrichtungspolitik 关于最佳协调政策理论保障 2401.01879v3 -
220 05-29 Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs Business as Rulesual: Benchmark und Rahmen für Business Rule Flow Modellierung mit LLMs 业务作为规则:与LLMs建立商业规则流动模式的基准和框架 2505.18542v2 -
221 05-29 Exploring Scaling Laws for EHR Foundation Models Erforschung von Skalierungsgesetzen für EHR-Stiftungsmodelle 探索EHR基金会模式的扩展法律 2505.22964v1 -
222 05-29 ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind ToMAP: Training Gegner-Bewusst LLM überzeugt mit Theorie des Geistes ToMAP:培训有思想理论的对抗者软件软件LLM 2505.22961v1 -
223 05-29 LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements LLM-basierte HSE Compliance Assessment: Benchmark, Performance und Advancements 基于LLM的HSE合规评估:基准、业绩和进步 2505.22959v1 -
224 05-29 Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View Enthüllen von Umweltauswirkungen von großsprachigen Modellen: Eine funktionale Einheitsansicht 大型语文服务模式的不懈环境影响:职能单位观点 2502.11256v2 -
225 05-29 CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance CodeSteer: Symbolisch-Augmentierte Sprachmodelle über Code/Text Anleitung 代码器:通过编码/文本指导的代码/文本指导的代码器:代号辅助语言模式 2502.04350v2 -
226 05-29 LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments LLMs for argument Mining: Detection, Extraction, and Relationship Classification of pre-defined argumentments in Online Kommentare 辩论采矿的LLMs:在线评论中预先界定的论据的探测、提取和关系分类 2505.22956v1 -
227 05-29 Understanding Bias Reinforcement in LLM Agents Debate Verständnis der Bias-Verstärkung in LLM-Agenten-Debatte 了解LLLM代理商的强化申请 2503.16814v2 -
228 05-29 StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs StrucSum: Graph-strukturierte Begründung für lange Dokumentextraktionszusammenfassung mit LLMs StrucSum: 长文件提取摘要的图表结构化原因与LLMs 2505.22950v1 -
229 05-28 (3) NegVQA: Can Vision Language Models Understand Negation? NegVQA: Können Visions-Sprachmodelle Negation verstehen? NegVQA:视觉语言模式能理解差吗? 2505.22946v1 -
230 05-28 OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature OWL: Über die Weltliteratur testet Cross-Lingual Recall von gemerkten Texten OWL: 通过世界文学对记忆文字进行相互最后回顾 2505.22945v1 -
231 05-28 Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Kann LLMs CLIP deciive? Benchmarking Adversarial Compositionalität der vortrainierten multimodalen Darstellung über Textaktualisierungen LLMs CLIP能否通过文本更新确定培训前多模式代表的反向构成基准? 2505.22943v1 -
232 05-28 WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning WorkForceAgent-R1: Förderung der Fähigkeit von LLM-basierten Web-Agenten durch Verstärkungs-Lernen 工作力量-R1:通过强化学习在基于LLM的网络代理中鼓励 2505.22942v1 -
233 05-28 Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs Verbesserung der QA-Effizienz mit DistilBERT: Feintuning und Schlussfolgerung auf mobilen Intel-CPUs 提高利用dittplBERT提高QA效率:移动 Intel CPU的精密查询和推断 2505.22937v1 -
234 05-28 Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging Unraveling LoRA Interferenz: Orthogonale Subräume für robuste Modellzusammenführung 开放 LoRA 干涉度: 用于强力模型合并的正弦形子空间 2505.22934v1 -
235 05-28 K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction K-Paths: Begründung über Graphenpfade für Drogenrepurposing und Drogeninteraktionsvorhersage K-Paths: 以图解路径为依据进行药物再定位和药物相互作用预测 2502.13344v3 -
236 05-28 How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias Wie Transformer lernen Regelmäßige Spracherkennung: Eine theoretische Studie über Trainingsdynamik und Implizite Bias 变换人如何学习常规语言识别:关于培训动态和隐含偏见的理论研究 2505.00926v3 -
237 05-28 Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning Verbesserung der Schlussfolgerungen auf Studienebene aus klinischen Studienpapieren über RL-basierte numerische Begründung 通过基于RL的数值推理从临床试验文件中提高研究水平的推论 2505.22928v1 -
238 05-28 Structured Memory Mechanisms for Stable Context Representation in Large Language Models Strukturierte Speichermechanismen für stabile Kontextdarstellung in großen Sprachmodellen 在大语言模式中建立结构化内存机制,以稳定地代表大语言模式 2505.22921v1 -
239 05-28 ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room ER-REASON: Ein Benchmark-Datensatz für LLM-basierte klinische Vernunft in der Notaufnahme ER-REASON:应急室以LLM为基础的临床原因基准数据集 2505.22919v1 -
240 05-28 Talent or Luck? Evaluating Attribution Bias in Large Language Models Talent oder Glück? Bewertung der Attribution Bias in großen Sprachmodellen 人才或幸运?评价大语言模式中的可归责偏见 2505.22910v1 -
241 05-28 Conversational Alignment with Artificial Intelligence in Context Conversational Alignment mit Künstlicher Intelligenz im Kontext 与现场人工智能的对调 2505.22907v1 -
242 05-28 VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models VIGNETTE: Sozial geerdete Bias-Evaluierung für Vision-Language-Modelle VIGNETTE:社会基础的愿景-语言模型的偏见评价 2505.22897v1 -
243 05-28 When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy Wenn Modelle Grund in Ihrer Sprache: Kontrollieren Denken Trace Language kommt auf Kosten der Genauigkeit 当模型在您语言中的原因:控制思考追踪语言以准确性为代价时 2505.22888v1 -
244 05-28 Enhancing Retrieval for ESGLLM via ESG-CID – A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS Verbesserung der Retrieval für ESGLLM über ESG-CID – Ein Disclosure Content Index Finetuning Datensatz für die Mapping GRI und ESRS 通过ESG-CID – – 用于测绘GRI和ESRS的披露内容指数微调数据集,加强ESGLLM的检索 2503.10674v2 -
245 05-28 GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification GateNLP bei SemEval-2025 Aufgabe 10: Hierarchische Drei-Schritt-Prompte für mehrsprachige Narrative Klassifizierung SemEval-2025任务10:三级三级三级促进多种语文叙事分类 2505.22867v1 -
246 05-28 Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge Große Sprachmodelle für Depressionserkennung in gesprochener Sprache Integrieren Psychologisches Wissen 口语结合心理知识中承认抑郁症的大语言模式 2505.22863v1 -
247 05-28 NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding NGPU-LM: GPU-beschleunigtes N-Gram-Sprachenmodell für Kontext-Biasing in Greedy ASR-Dekodierung NGPU-LM: 加速GPU-加速型N-Gram语语模式,用于在贪婪ASR标记中进行背景切换 2505.22857v1 -
248 05-28 LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference LiTEx: Eine linguistische Taxonomie von Erklärungen zum Verständnis von Inner-Label-Variation in natürlicher Sprach-Inferenz LiTEx:用语言对解释进行分类,以了解在标内对自然语言推断的变异的理解 2505.22848v1 -
249 05-28 ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts ASTPrompter: Präferenzorientiertes Automatisiertes Sprachmodell Red-Teaming zur Generierung von Low-Perplexity-Unsicheren Prompts ASTPrompter:为产生低重复性不安全提示而建立首选统一自动语言示范红队 2407.09447v4 -
250 05-28 Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation Bayesian Attention Mechanism: Ein probabilistisches Framework für die Positionskodierung und Kontextlängen-Extrapolation Bayesian注意机制:定位编码和背景长度外推概率框架 2505.22842v1 -
251 05-28 The Aloe Family Recipe for Open and Specialized Healthcare LLMs Das Aloe-Familienrezept für offene und spezialisierte LLMs im Gesundheitswesen 开放和专门保健的Aloe家庭食堂 2505.04388v2 -
252 05-28 What Has Been Lost with Synthetic Evaluation? Was wurde mit synthetischer Bewertung verloren? 合成评价失去了什么? 2505.22830v1 -
253 05-28 Self-Critique and Refinement for Faithful Natural Language Explanations Selbst-Kritik und Raffinesse für treue natürliche Spracherklärungen 忠实自然语言自我简化和完善解释 2505.22823v1 -
254 05-28 Comparing Human and AI Rater Effects Using the Many-Facet Rasch Model Vergleich menschlicher und KI-Rater-Effekte mit dem Multi-Facet-Rasch-Modell 使用多面 Rasch 模型比较人类和AI Rater效应 2505.18486v2 -
255 05-28 Toward universal steering and monitoring of AI models Zur universellen Steuerung und Überwachung von KI-Modellen 实现对AI 模式的普遍指导和监测 2502.03708v2 -
256 05-28 First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay Erste Schritte auf dem Weg zu LLM-Agenten: Eine Fallstudie mit Dungeons & Dragons Gameplay 偷听LLM代理物的第一批步骤:与Dungeons & Tragons游戏游戏游戏进行案例研究 2505.22809v1 -
257 05-28 Towards a More Generalized Approach in Open Relation Extraction Auf dem Weg zu einem allgemeineren Ansatz bei der Förderung offener Beziehungen 争取在开放关系采掘中采取更加普遍的做法 2505.22801v1 -
258 05-28 Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning Instruct-SkillMix: Eine leistungsstarke Pipeline für LLM Instruction Tuning 指令- SkillMix: 用于LLM 指令导导图的强大管道 2408.14774v4 -
259 05-28 SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains SequentialBreak: Große Sprachmodelle können durch Einbetten von Jailbreak Prompts in Sequential Prompt Chains ausgeblendet werden 顺序式布雷克:大语言模型可以通过将破狱线索嵌入顺序式提示链来蒙骗大语言模型 2411.06426v3 -
260 05-28 Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory Kulturelle Bewertungen von Vision-Sprachen-Modellen haben viel von der Kulturtheorie zu lernen 展望-语言模式的文化评价有许多可学习的文化理论 2505.22793v1 -
261 05-28 Can Large Language Models Match the Conclusions of Systematic Reviews? Können große Sprachmodelle mit den Schlussfolgerungen systematischer Bewertungen übereinstimmen? 大语言模型能否与系统审查的结论相匹配? 2505.22787v1 -
262 05-28 MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators MEDAL: Ein Rahmen für Benchmarking von LLMs als mehrsprachige Open-Domain Chatbots und Dialogevaluatoren MEDAL:多语言开放域聊天和对话评价员对LLMs进行基准评估的框架 2505.22777v1 -
263 05-28 GraphNarrator: Generating Textual Explanations for Graph Neural Networks GraphNarrator: Erzeugen von Texterklärungen für Graph Neuronale Netzwerke 图示记录器:生成图形神经网络的文字解释 2410.15268v2 -
264 05-28 Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages Zählen von Bäumen: Eine baumbankgetriebene Erforschung syntaktischer Variationen in Sprache und Schrift über Sprachen hinweg 计数树:在树库驱动下探索不同语言的言语和书写方式的口语和书写方式差异 2505.22774v1 -
265 05-28 Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems Automatisierte Bewertung der Annotationen von automatisierten Feedback-Systemen 自动反馈系统自动读取系统输入说明 2505.22771v1 -
266 05-28 Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction Brauchen wir noch menschliche Annotatoren? Prompting große Sprachmodelle für Aspect Sentiment Quad Prediction 我们还需要人类告别员吗? 2502.13044v3 -
267 05-28 A Survey of Uncertainty Estimation Methods on Large Language Models Eine Übersicht über Methoden der Unsicherheitsschätzung bei großen Sprachmodellen 大语言模型不确定性估算方法调查 2503.00172v2 -
268 05-28 StressTest: Can YOUR Speech LM Handle the Stress? StressTest: Kann Ihre Rede LM mit dem Stress umgehen? 压力测试:你的演讲能解决压力吗? 2505.22765v1 -
269 05-28 Decomposed Opinion Summarization with Verified Aspect-Aware Modules Zerlegte Meinungszusammenfassung mit verifizierten Aspect-Aware-Modulen 与经核查的光谱软件模块拆解的意见摘要 2501.17191v3 -
270 05-28 Resolving Lexical Bias in Model Editing Lösung Lexischer Bias in der Modellbearbeitung 解析示范编辑中的法理偏见 2408.10411v3 -
271 05-28 FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian FAMA: Das erste großformatige Open-Science-Sprechstiftungsmodell für Englisch und Italienisch FAMA:英语和意大利语第一个大型开放科学演讲基金会模型 2505.22759v1 -
272 05-28 FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference FlashFormer: Ganzmodell-Kernel für effiziente Low-Batch-Inferenz FlashFormer: 用于高效低批量推断的全模块内核 2505.22758v1 -
273 05-28 Pre-Training Curriculum for Multi-Token Prediction in Language Models Pre-Training Curriculum für Multi-Token-Vorhersage in Sprachmodellen 语言模式多肯预测培训前课程 2505.22757v1 -
274 05-28 Decomposing Elements of Problem Solving: What “Math” Does RL Teach? Zersetzende Elemente der Problemlösung: Was “Math” lehrt RL? 问题解决的分解要素:RL教什么“马思”? 2505.22756v1 -
275 05-28 VideoRAG: Retrieval-Augmented Generation over Video Corpus VideoRAG: Retrieval-Augmented Generation über Video Corpus VideoRAG: 利用视频公司回收的原始一代 2501.05874v3 -
276 05-28 Climate Finance Bench Klimafinanzierungsbank 气候融资法官 2505.22752v1 -
277 05-28 AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models AutoL2S: Auto-Lang-Short-Reasoning für effiziente große Sprachmodelle 自动L2S:高效大语言模式的自动长期短期理由 2505.22662v1 -
278 05-28 GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning GuessArena: Raten Sie, wer ich bin? Ein selbstadaptives Framework zur Bewertung von LLMs in Domain-spezifischem Wissen und Vernunft GuessArena:猜猜我是谁? 评估特定知识和理由领域LMLM的自我激励框架 2505.22661v1 -
279 05-28 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model 3DLLM-Mem: Langzeit-Raum-Temporal-Speicher für körpereigenes 3D-Großsprachmodell 3DLLM-Mem:3D大语言模型内嵌成的3D大语言长期空间-时间记忆 2505.22657v1 -
280 05-28 VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models VScan: Rethinking Visual Token Reduction für effiziente große Vision-Sprache Modelle Vscan:重新思考如何降低视力,以建立高效的大型视觉语言模型 2505.22654v1 -
281 05-28 Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents Position: Ungewissheitsquantifizierung braucht eine Neubewertung für großsprachige Modellagenten 位置:大语言示范物剂的不确定性量化需求评估 2505.22655v1 -
282 05-28 The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason Das Klettern schnitzt Weisheit tiefer als der Gipfel: Über die lärmenden Belohnungen im Lernen zur Vernunft 攀爬的雕刻比首脑会议更深的智慧:学习理性的吵闹奖励 2505.22653v1 -
283 05-28 Sherlock: Self-Correcting Reasoning in Vision-Language Models Sherlock: Selbstkorrekte Vernunft in Vision-Sprachen-Modellen 夏洛克:视觉语言模型中的自我校正理由 2505.22651v1 -
284 05-28 Training Language Models to Generate Quality Code with Program Analysis Feedback Schulung von Sprachmodellen zur Generierung von Qualitätscodes mit Feedback zur Programmanalyse 具有方案分析反馈的产生质量守则培训语言模式 2505.22704v1 -
285 05-28 WebDancer: Towards Autonomous Information Seeking Agency WebDancer: Auf dem Weg zu einer autonomen Informationsagentur WebDancer:走向自主信息搜索机构 2505.22648v1 -
286 05-28 Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese Charakterisierung von Bias: Benchmarking von großen Sprachmodellen in vereinfachter versus traditionellem Chinesisch 区分偏见:将大型语言模式与传统中文相比的简化程度基准化 2505.22645v1 -
287 05-28 Learning Composable Chains-of-Thought Komposierbare Ketten lernen-von-Gedanken 学习综合研究链 2505.22635v1 -
288 05-28 Spatial Knowledge Graph-Guided Multimodal Synthesis Raumwissen Graph-geführte multimodale Synthese 空间知识图表辅助多模式合成 2505.22633v1 -
289 05-28 Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs Stochastische Chamäleons: irrelevanter Kontext Halluzinationen Offenbarung Klassenbasierte (Mis)Verallgemeinerung in LLMs 电磁变色龙:无关联的地貌幻觉流星级(Mis) 2505.22630v1 -
290 05-28 Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions Chain-of-Talkers (CoTalk): Schnelle menschliche Anmerkung von Dense Image Captions 谈话链(Contalk):人类对高密度图像描述的快速记号 2505.22627v1 -
291 05-28 Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding 快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM 2505.22618v1 -
292 05-28 The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models Der Entropie-Mechanismus des Verstärkten Lernens für sinnvolle Sprachmodelle 理由语言模式强化学习的全英机制 2505.22617v1 -
293 05-28 Bridging Supervised Learning and Reinforcement Learning in Math Reasoning Bridging Supervised Learning und Verstärkung Lernen in Mathe-Reasoning 在数学原因方面的受监督学习和强化学习架桥 2505.18116v2 -
294 05-28 RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction RICO: Verbesserung der Genauigkeit und Vollständigkeit in der Bildrekapitulation durch visuelle Rekonstruktion RICO:通过视觉重建提高图像剪辑的准确性和完整性 2505.22613v1 -
295 05-28 Personalized Causal Graph Reasoning for LLMs: A Case Study on Dietary Recommendations Personalisiertes Kausaldiagramm zur Begründung von LLMs: Eine Fallstudie zu Ernährungsempfehlungen LLLM女士的个人因果图:关于饮食建议的案例研究 2503.00134v2 -
296 05-28 AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling AutoElicit: Mit großen Sprachmodellen für vorausschauende Modellierung von Expertenvoraussagen 自动:在预测模拟中使用大语言模型,供专家使用 2411.17284v5 -
297 05-28 SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement Synworld: 用于改进制剂行动知识的虚拟情景合成 2504.03561v2 -
298 05-28 Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning Self-Error-Instruct: Verallgemeinern von Fehlern für LLMs Mathematische Begründung 自错误教学法: 数学理由LLMs 的错误一般化 2505.22591v1 -
299 05-28 Precise In-Parameter Concept Erasure in Large Language Models Präzise In-Parameter-Konzeptlöschung in großen Sprachmodellen 大语言模型中精确的在写法内概念破损 2505.22586v1 -
300 05-28 ReLearn: Unlearning via Learning for Large Language Models ReLearn: Entlernen über Learning for Large Language Models Reearn:通过学习大语言模式来重新学习 2502.11190v3 -
301 05-28 Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts Weniger, aber besser: Effiziente Mehrsprachige Erweiterung für LLMs über schichtweise Mixture-of-Experts 减少但更好:通过多层混合技术高效率地多语种扩展LLMs 2505.22582v1 -
302 05-28 Fusion Steering: Prompt-Specific Activation Control Fusionssteuerung: Prompt-spezifische Aktivierungskontrolle 融合指导:即时具体活动控制 2505.22572v1 -
303 05-28 TLUE: A Tibetan Language Understanding Evaluation Benchmark TLUE: Ein Benchmark für die Bewertung der tibetischen Sprache TLUE:西藏语言理解评估基准 2503.12051v3 -
304 05-28 Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings Denken große Sprachmodelle wie das Gehirn? Sentence-Level-Evidenz aus fMRI und Hierarchischen Einbettungen 大语言模型是否像大脑一样思考? 2505.22563v1 -
305 05-28 Preference Adaptive and Sequential Text-to-Image Generation Präferenz Adaptive und sequentielle Text-zu-Bild-Generierung 适应性和顺序性文字到图像生成 2412.10419v2 -
306 05-28 ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM ClaimPKG: Verbesserung der Claim-Verifikation durch Pseudo-Subgraphen-Generation mit leichtgewichtiger Spezial-LLM CLCPKG: 通过使用轻量级专门LLM的Pseudo子集成加强索赔核实 2505.22552v1 -
307 05-28 Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs Emotion-o1: Adaptive lange Begründung für emotionales Verständnis in LLMs 情感-o1:在LLMs中为情感理解提供适应性长的理由 2505.22548v1 -
308 05-28 Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments Moderating Harm: Benchmarking von großen Sprachmodellen für Cyberbullying Detection in YouTube Kommentare 在YouTube评论中为网络欺欺欺欺欺欺欺欺欺欺欺欺欺凌探测大语言模式制定基准 2505.18927v2 -
309 05-28 Thinking with Generated Images Mit generierten Bildern denken 与生成图像一起思考 2505.22525v1 -
310 05-28 SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond SynLogic: Synthesizing verifizierbare reasoning data at scale for Learning Logical Reasoning and Beyond 协同Logic:在学习逻辑理由及以后的尺度上综合可核实的理由数据 2505.19641v3 -
311 05-28 Multi-MLLM Knowledge Distillation for Out-of-Context News Detection Multi-MLLM-Wissensdestillation für Out-of-Context-Nachrichten-Erkennung 多MLMM-MLM-MT-MLM-MT-MM-MM-MM-MM-MM-MM-MM-MT-MTLM-MM-MTM-MM-MM-MM-MTM-MM-MTFTFNTUTUTFTFTFMTUTFM-MTFM-MMM-MTM-MMM-MMMM-MMMM-MMMMMM-MMMMM-MMM-MMMM-MMM-MMM-MMM-MM-MMM-MM-M-M-MMMMMMMM-M-M-MMMMM-MM-M-MMM-MM-MMMMMMM-M-M-M-MM-MMMMMMM-MMM-M-MMMMM-MMMMMMMMMMMM-MMMMMMM-M-M-M-M-MMMMMMMM-MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM-MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识 2505.22517v1 -
312 05-28 Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations Vernunft ist nicht alles, was Sie brauchen: Prüfung LLMs für Multi-Turn Mental Health Conversations 理由并非你所需要的全部:多发性心理健康对话的检查长 2505.20201v2 -
313 05-28 Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models Closed-Form Training Dynamics Reveal Erlernte Funktionen und lineare Struktur in Word2Vec-ähnlichen Modellen 类似Word2Vec 模型中的封闭形式培训动态观测发现特性和线形结构 2502.09863v2 -
314 05-28 EvolveSearch: An Iterative Self-Evolving Search Agent EvolveSearch: Ein iterativer, sich selbst entwickelnder Suchagent EvolveSearch: 动态自我演变搜索代理 2505.22501v1 -
315 05-28 Nonlinear second-order dynamics describe labial constriction trajectories across languages and contexts Nichtlineare Dynamiken der zweiten Ordnung beschreiben labiale Constriction-Trajektorien über Sprachen und Kontexte hinweg 非线性第二序列动态描述不同语言和背景的实验室收缩轨迹 2410.08351v3 -
316 05-28 Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks Positionale Fragilität in LLMs: Wie Offset-Effekte unser Verständnis von Gedächtnisrisiken verändern LLMM中的位置易碎性:如何重塑抵消效应,我们如何理解记忆风险 2505.13171v2 -
317 05-28 AdvAgent: Controllable Blackbox Red-teaming on Web Agents AdvAgent: Kontrollierbare Blackbox Red-Teaming auf Web-Agenten 助理:在网络代理上可控黑箱红队 2410.17401v3 -
318 05-28 Effective Context in Neural Speech Models Effektiver Kontext in neuralen Sprachmodellen 神经语音模式的有效背景 2505.22487v1 -
319 05-28 How Do LLMs Perform Two-Hop Reasoning in Context? Wie führen LLMs Zwei-Hop-Reasoning im Kontext durch? LLMs如何在上下文中执行双重理由? 2502.13913v2 -
320 05-28 FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation FitCF: Ein Framework für die automatische Feature-Importanz-geführte kontrafaktische Beispielgenerierung FitCF: 自动地物、重要引导反事实实例生成框架 2501.00777v3 -
321 05-28 ConKE: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning ConKE: Konzeptualisierung - Augmented Knowledge Editing in großen Sprachmodellen für Commonsense Reasoning CONKE: 常识理由大语言模型中概念化-增强的知识编辑 2412.11418v2 -
322 05-28 Fostering Video Reasoning via Next-Event Prediction Förderung von Video-Reasoning durch Next-Event-Vorhersage 通过下一个晚上的预测促进视频宣传 2505.22457v1 -
323 05-28 Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO Unüberwachte Nachschulung für Multi-Modal LLM Reasoning via GRPO 无人监督的多模式LLM通过GROPO进行多模式LLM进修培训后培训 2505.22453v1 -
324 05-28 Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in PubMed Abstracts Gender-Neutral Große Sprachmodelle für medizinische Anwendungen: Reduzierung von Bias in PubMed Abstracts 医疗应用的性别-新大语言性别模式:在普布迈德摘要中减少偏见 2501.06365v2 -
325 05-28 RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning RAG-Zeval: Auf dem Weg zu einer robusten und interpretierbaren Bewertung von RAG-Antworten durch regelgeführte End-to-End-Relation RAG-Zeval:努力通过最终至最终规则引导理由对RAG对策进行强力和解释性评价 2505.22430v1 -
326 05-28 AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy AstroVisBench: Ein Code-Bench für wissenschaftliche Computing und Visualisierung in der Astronomie AstroVisbench:天文科学计算和可视化标准 2505.20538v2 -
327 05-28 Token embeddings violate the manifold hypothesis Token-Einbettungen verletzen die mannigfaltige Hypothese 托肯嵌入违反多重假设 2504.01002v2 -
328 05-28 Scaling Reasoning without Attention Skalierung ohne Aufmerksamkeit 无人注意的调整理由 2505.22425v1 -
329 05-28 Mitigating Overthinking in Large Reasoning Models via Manifold Steering Überdenken in großen Vernunftmodellen durch Manifold Steering verhindern 通过 MManicform 指导减轻大型理性模型中的过度思考 2505.22411v1 -
330 05-28 Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring Jenseits von externen Monitoren: Verbesserung der Transparenz von großen Sprachmodellen für eine einfachere Überwachung 外部监测之外的外部监测:提高大语言模型的透明度,促进更易监测 2502.05242v2 -
331 05-28 GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM GOAT-TTS: Expressive und realistische Sprachgenerierung über eine Dual-Branch LLM GOAT-TTS:通过双层LLM, 表达和现实的发声 2504.12339v2 -
332 05-28 Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space Breaking the Ceiling: Das Potenzial von Jailbreak-Angriffen durch Erweiterung des Strategieraums erkunden 打破上限:通过扩大战略空间探索越狱袭击的可能性 2505.21277v2 -
333 05-28 Which Demographics do LLMs Default to During Annotation? Welche Demographien haben LLMs während der Annotation voreingestellt? 在批注期间,LLMs会默认给哪些人种? 2410.08820v3 -
334 05-28 LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High LLMs kämpfen, um falsche Annahmen zurückzuweisen, wenn Fehlinformationsstakes hoch sind LLM LLM 努力拒绝错误信息摄入量高时的假假设 2505.22354v1 -
335 05-28 Explicit Learning and the LLM in Machine Translation Explizites Lernen und das LLM in maschineller Übersetzung 计算机翻译方面的明确学习和LLM 2503.09454v3 -
336 05-28 Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning Was hält Set Matters für LLM Unlearning auf? Eine Fallstudie über Entity Unlearning 哪些保留LLM 重新学习的设置事项? 关于实体重新学习的案例研究 2502.11441v3 -
337 05-28 Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance Semantische Veränderung in Slowenien nachvollziehen: Ein neuartiger Datensatz und optimaler transportbasierter Abstand 跟踪斯洛文尼亚语语语义变化:新数据集和最佳运输距离 2402.16596v2 -
338 05-28 Text2Grad: Reinforcement Learning from Natural Language Feedback Text2Grad: Stärkung des Lernens aus natürlicher Sprache Feedback Text2Grad:从自然语言反馈中加强学习 2505.22338v1 -
339 05-28 Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start Multimodale Reasoning durch verstärktes Lernen mit kaltem Start fördern 通过 “ 冷起 “ 的强化学习推进多模式理由 2505.22334v1 -
340 05-28 LLMs Think, But Not In Your Flow: Reasoning-Level Personalization for Black-Box Large Language Models LLMs denken, aber nicht in Ihrem Fluss: Grund-Level-Personalisierung für Black-Box große Sprachmodelle LLM Think, But not in your roll: 黑人大语言模型的理性程度个人化 2505.21082v2 -
341 05-28 Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering Prompt-basierte Persönlichkeit Profiling: Verstärkung Lernen für Relevanz Filtern 即时个人特征分析:加强学习促进相关性过滤 2409.04122v2 -
342 05-28 NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment NLP für soziales Gut: Eine Übersicht über Herausforderungen, Chancen und verantwortungsvolle Umsetzung NLP 社会公益:挑战、机会和负责任的部署调查 2505.22327v1 -
343 05-28 Advancing Expert Specialization for Better MoE Advancing Experten-Spezialisierung für bessere MoE 推进专家专业专业促进改善教育部 2505.22323v1 -
344 05-28 Core Context Aware Transformers for Long Context Language Modeling Core Context Aware Transformers für lange Kontext-Sprachenmodellierung 长语语言建模核心认知变型器 2412.12465v2 -
345 05-28 Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation 对回收增加的一代输出进行实况调查的不确定性量化 2505.21072v2 -
346 05-28 If Pigs Could Fly… Can LLMs Logically Reason Through Counterfactuals? Wenn Schweine fliegen könnten… können LLMs logischerweise durch Gegenfakten denken? 如果猪能飞… 2505.22318v1 -
347 05-28 MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections MUDDFormer: Breaking Residual Engpässe in Transformatoren über Multiway Dynamic Dense Connections MUDDFormer:通过多路动态感应连接在变形器中打破残余瓶颈 2502.12170v2 -
348 05-28 Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching Kann Code-Switched Texts einen Wissensschalter in LLMs aktivieren? Eine Fallstudie zum Englisch-Koreanischen Code-Switching 密码转换的文本能否激活LLML中的知识开关? 关于英朝法典转换的案例研究 2410.18436v2 -
349 05-28 LLäMmlein: Compact and Competitive German-Only Language Models from Scratch LLäMmlein: Kompakte und wettbewerbsfähige deutschsprachige Sprachmodelle von Scratch LläMmlein:来自斯克拉奇的契约和竞争性独德语言模式 2411.11171v4 -
350 05-28 Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing Adaptive Entgiftung: Schutz der allgemeinen Fähigkeiten von LLMs durch Toxicity-Aware Knowledge Editing 适应性解毒:通过毒理学知识编辑来保护长效虫的一般能力 2505.22298v1 -
351 05-28 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training 360-LlaMA-Fabrik: Plug & Play-Sequenz-Parallelität für langes Nachtraining 360-LLamaMA-Factory: 长期培训之后的插件和播放序列平行主义 2505.22296v1 -
352 05-28 Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond Light-R1: Curriculum SFT, DPO und RL für Long COT aus Scratch und darüber hinaus Light-R1:SFT、DPO和RL课程,用于Scratch及以后的长期COT 2503.10460v4 -
353 05-28 Compensating for Data with Reasoning: Low-Resource Machine Translation with LLMs Kompensieren für Daten mit Vernunft: Low-Resource-Maschinenübersetzung mit LLMs 以合理理由补偿数据:低资源机器翻译与LLMM 2505.22293v1 -
354 05-28 Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling Das Unlösbare neu denken: Wenn In-Context Search Test-Time Scaling trifft 重新思考无法解答的问题: 当 In-Ctext 搜索遇到测试时间缩放时 2505.22290v1 -
355 05-28 Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review Natürliche Sprachverarbeitung zur Unterstützung der evidenzbasierten Medizin: Eine Bewertung 支持循证医学的自然语言处理:范围审查 2505.22280v1 -
356 05-28 Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration 利用语言代理框架中的双重进程理论促进实时同时人类-AI合作 2502.11882v5 -
357 05-28 Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead Kartierung der Landschaft der afrikanischen NLP: Mapping Progress and Shaping the Road Ahead 绘制非洲全国土地规划方案景观图:绘制进展图和绘制前面的道路图 2505.21315v2 -
358 05-28 PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy PreP-OCR: Eine komplette Pipeline für die Wiederherstellung von Dokumentenbildern und verbesserte OCR-Genauigkeit PreP-OCR:一个完整的恢复文件图像和增强OCR准确性管道 2505.20429v2 -
359 05-28 Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages Umfassende Bewertung der Lexikalen Normalisierung: Grenzen-Bewusste Ansätze für ungesegmentierte Sprachen 综合评价词汇正常化:未分语言的边界意识方法 2505.22273v1 -
360 05-28 Reward Generalization in RLHF: A Topological Perspective Lohnverallgemeinerung in RLHF: Eine topologische Perspektive RLHF的奖励普遍化:地形学观点 2402.10184v7 -
361 05-28 Odysseus Navigates the Sirens’ Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation Odysseus navigiert das Lied der Sirenen: Dynamische Fokusdekodierung für die faktuelle und vielfältige Open-Ended Text Generation Odysseus 导航《锡伦斯之歌:事实和多样化的不限名额文本生成的动态焦点解码》 2503.08057v2 -
362 05-28 AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments KI für die Klimafinanzierung: Agentische Retrieval- und Multi-Step-Gründung für Frühwarnsystem-Investitionen AI 气候融资:预警系统投资的 “ 恢复 “ 和 “ 多重理由 “ 2504.05104v2 -
363 05-28 Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models Testzeit-Impfung: Ein universelles Abwehr-Rahmenwerk gegen Jailbreaks für (Multimodale) große Sprachmodelle 试验时间免疫:针对(穆斯林)大语言模式的防止越狱全面防御框架 2505.22271v1 -
364 05-28 Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL Denken lernen: Adaptive Reasoning in R1-Style-Modellen über Multi-Stage RL gestalten 学习思考何时思考:通过多级 RL 在 R1- 标准模型中塑造适应性理性 2505.10832v2 -
365 05-28 MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps MRT bei SemEval-2025 Task 8: Maximierung der Erholung von Tischen mit mehreren Schritten SemEval-2025 MRT 任务8:最大限度地从有多个步骤的表格中复苏 2505.22264v1 -
366 05-28 Something’s Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks Irgendetwas ist Fishy In The Data Lake: Eine kritische Neubewertung der Tabelle Union Suche Benchmarks “数据湖中的鱼:对表格联合搜索基准的重要重新评估” 2505.21329v2 -
367 05-28 Train Sparse Autoencoders Efficiently by Utilizing Features Correlation Bahnsparse Autoencoder effizient durch die Nutzung von Funktionen Korrelation 通过使用地物关联, 高效地列列“ 分散的自动编译器” 。 2505.22255v1 -
368 05-28 Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition Bewertung von LLMs in Speech wird oft abgeflacht: Testset Kontaminierung in großen Sprachmodellen für die Spracherkennung 对演讲中LLMs的评价经常是片断的:在大语言语音识别模型中测试设置污染 2505.22251v1 -
369 05-28 Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices Bewertung kompakter LLMs für blitzfreie iberische Sprachaufgaben auf Endbenutzer-Geräten 评价关于最终用户装置的零 - 低 - 低 - 高 - 伊比利亚语语言任务 2504.03312v2 -
370 05-28 Overcoming Non-monotonicity in Transducer-based Streaming Generation Überwindung der Nichtmonotonizität in der Transducer-basierten Streaming-Generation 克服基于基于跨国公司的溪流一代中的非分子性 2411.17170v2 -
371 05-28 On Provable Length and Compositional Generalization Auf evable Länge und kompositorische Verallgemeinerung 关于可预见长度和组 成 式 通 泛 化 2402.04875v6 -
372 05-28 BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain BioHopR: Ein Benchmark für Multi-Hop, Multi-Answer Reasoning in der biomedizinischen Domäne BioHopR:生物医学领域多层次、多层次原因基准 2505.22240v1 -
373 05-28 A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity Eine sprachlich motivierte Analyse der intonationalen Phrasierung in Text-to-Speech-Systemen: Lücken in der syntaktischen Sensibilität offenbaren 以语言动机动动分析从文字到语音系统中的国与国之间的内对文到语音系统中的图片分析:在同步感应方面消除差距 2505.22236v1 -
374 05-28 Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models Qualität Across-Sprachen beurteilen: Ein mehrsprachiger Ansatz zur Vorschulung von Datenfiltern mit Sprachmodellen 判断各语文的质量:采用多种语文办法,利用语言模式进行培训前数据过滤 2505.22232v1 -
375 05-28 Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis Advancing Hearing Assessment: ASR-basierter frequenzspezifischer Sprachtest zur Diagnose von Presbycusis 推进听力评估:基于AR的诊断预视能力频率特定语音测试 2505.22231v1 -
376 05-28 Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks Ausgewogene Berechnungslast und Darstellungsexpressivität in parallelen Hybrid-Neuralen Netzwerken 在平行混合神经网络中平衡计算负载和代表表达式 2505.19472v2 -
377 05-28 Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection Kontinuierliche Selbstverbesserung von großen Sprachmodellen durch Test-Zeit-Training mit Verifier-getriebener Probenauswahl 通过测试时间培训不断自我改进大语言模型,并进行验证-驱动抽样选择 2505.19475v2 -
378 05-28 You Do Not Fully Utilize Transformer’s Representation Capacity Sie nicht voll nutzen Transformer-Repräsentanz Kapazität 您没有充分利用变换器的代表能力 2502.09245v2 -
379 05-28 Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning Anpassung vorgebildeter Sprachmodelle für die Klassifizierung von Zitationen über selbstüberwachtes kontrastives Lernen 调整通过自我监督反竞争学习的招录分类的训练前语言模式 2505.14471v2 -
380 05-28 Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation Look & Mark: Leveraging Radiologe Eye Fixations und Bounding Boxen in multimodalen großen Sprachmodellen für die Erzeugung von Röntgenberichten im Brustkorb Look & Mark: 将辐射学家眼修补和检查框用于胸前X光报告生成的多模式大语言模型中 2505.22222v1 -
381 05-28 Advancing Sequential Numerical Prediction in Autoregressive Models Advancing Sequential Numerical Prediction in Autoregressive Modelle 自动递减模型中推进序列序号预测 2505.13077v2 -
382 05-28 The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants Die Avengers: Ein einfaches Rezept für die Vereinigung kleinerer Sprachmodelle, um proprietäre Riesen herauszufordern 《复仇者:将小型语言模式联合起来挑战产权巨人挑战小型语言模式的简单食谱》 2505.19797v2 -
383 05-28 On the Within-class Variation Issue in Alzheimer’s Disease Detection Zur klasseninternen Variationsfrage bei der Alzheimer-Erkennung 阿尔茨海默氏氏病检测的 类内变化变化问题 2409.16322v2 -
384 05-28 Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity Pangu Pro MoE: Mischung aus gruppierten Experten für effiziente Sparsamkeit Pangu Pro MoE:高效公平问题专家组混合 2505.21411v2 -
385 05-28 Pitfalls of Rule- and Model-based Verifiers – A Case Study on Mathematical Reasoning Pitfalls of Rule- and Model-based Verifiers – Eine Fallstudie zur mathematischen Begründung 规则和基于示范的验证符咒 – – 关于数学理由的个案研究 2505.22203v1 -
386 05-28 Let’s Predict Sentence by Sentence Let’s Predict Satz durch Satz 让我们按刑期预测判决 2505.22202v1 -
387 05-28 Machine Translation Models are Zero-Shot Detectors of Translation Direction Maschinelle Übersetzungsmodelle sind Null-Schuss-Detektoren der Übersetzungsrichtung 机器翻译模型是翻译方向零热探测器 2401.06769v4 -
388 05-28 ClonEval: An Open Voice Cloning Benchmark ClonEval: Eine offene Stimme Klon-Benchmark ClonEval: 开放语音克隆基准 2504.20581v2 -
389 05-28 PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims PEDANTIC: Ein Datensatz für die automatische Prüfung der Wirksamkeit von Patentansprüchen PEDANTIC: 自动审查专利索赔的缺陷数据集 2505.21342v2 -
390 05-28 Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon Breaking the Cloak! Enthüllung der chinesischen verhüllten Toxizität mit Homophon Graph und giftigem Lexikon 破解衣物! 中华便衣毒物与同声图和毒毒词汇结合 2505.22184v1 -
391 05-28 TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation TabXEval: Warum ist das ein schlechter Tisch? Eine eXhaustive Rubrik für die Tabellenbewertung TabXEval: 为什么这是一张糟糕的桌子? 用于表格评价的 e Xhaustive Rubric 2505.22176v1 -
392 05-28 Reverse Preference Optimization for Complex Instruction Following Reverse-Preference-Optimierung für komplexe Instruktionen 复杂指令的逆偏偏优化 2505.22172v1 -
393 05-28 ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments ZuverlässigEval: Ein Rezept für die stochastische LLM-Bewertung über die Methode der Momente 可靠有效:通过瞬间方法进行沙尘暴 LLM评价的食谱 2505.22169v1 -
394 05-28 Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search Tempest: Autonomes Multi-Turn-Jailbreaking von großen Sprachmodellen mit Baumsuche 暴风:利用树木搜索的大型语言模型的多发自动破获多语监狱 2503.10619v5 -
395 05-28 Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes Kontinuierliche und diskrete Diffusion mit nicht gleichzeitigen Diffusionsprozessen 与非平行扩散进程一起进行连续和分解的不连续和分解文本传播 2505.22165v1 -
396 05-28 Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy Stratifizierte selektive Probenahme für Instruction Tuning mit dedizierter Scoring-Strategie 使用专用 Scoring 战略进行教学指示指示的分批选择性抽样 2505.22157v1 -
397 05-28 Towards Practical Defect-Focused Automated Code Review Auf dem Weg zu einer praktischen fehlerorientierten automatisierten Code-Überprüfung 走向实际失效-受污染的自动编码审查 2505.17928v2 -
398 05-28 InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing InComeS: Integration von Kompressions- und Auswahlmechanismen in LLMs für effiziente Modellbearbeitung 因果:将压缩和甄选机制纳入高效模式编辑LLMLM 2505.22156v1 -
399 05-28 Incentivizing Strong Reasoning from Weak Supervision Starke Vernunft von schwacher Aufsicht anregen 以弱监管为强力理由的激励 2505.20072v2 -
400 05-28 Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language Flexible Werkzeugauswahl durch Low-dimensionale Attributausrichtung von Vision und Sprache 通过视力和语言的低维属性一致进行灵活工具选择 2505.22146v1 -
401 05-28 LLMs Reproduce Stereotypes of Sexual and Gender Minorities LLMs reproduzieren Stereotypen sexueller und geschlechtsspezifischer Minderheiten LLMs 重塑对性和性别少数群体的陈规定型观念 2501.05926v2 -
402 05-28 EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning EPO: Explizite politische Optimierung der strategischen Vernunft in LLMs durch Verstärkungslernen EPO: 通过强化学习,在LLMs中明确政策优化战略理由 2502.12486v6 -
403 05-28 Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments Begrenzte Verallgemeinerbarkeit im Argumentbergbau: State-of-The-Art-Modelle lernen Datensätze, keine Argumente 《争议采矿业的限制性通用性:国家与艺术中的模式学习数据集,非论据》 2505.22137v1 -
404 05-28 RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding RAD: Redundanz-Bewusst-Destillation für Hybridmodelle über selbstspekulative Decodierung RAD: 通过自投机代号为混合模型进行再利用-软件蒸馏 2505.22135v1 -
405 05-28 EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning EULER: Verbesserung der vernünftigen Fähigkeit großer Sprachmodelle durch fehlerinduziertes Lernen EULER:通过错误引起的学习提高大语言模式的理性能力 2505.22131v1 -
406 05-28 Towards Achieving Concept Completeness for Textual Concept Bottleneck Models Auf dem Weg zur Verwirklichung des Konzepts Vollständigkeit für textuelle Konzepte Engpassmodelle 实现文本概念瓶颈模式概念完整性 2502.11100v3 -
407 05-28 LoKI: Low-damage Knowledge Implanting of Large Language Models LoKI: Low-Damage Knowledge Implanting von großen Sprachmodellen LoKI: 低损害知识植入大语言模型 2505.22120v1 -
408 05-28 Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: Eine Geschichte von zwei Ansätzen 多语种和跨语种检索实况调查索赔:两种方法的故事 2505.22118v1 -
409 05-28 Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model Multimodale Vorhersage von Sparse Intraoperativen Hypotonieereignissen durch Sprachmodell 以语言模式为动力的草散的不合作和不连续活动多式预报 2505.22116v1 -
410 05-28 Mitigating Text Toxicity with Counterfactual Generation Eindämmung der Texttoxizität mit kontrafaktischer Generierung 减少毒剂毒性,同时防止产生事实上的产生 2405.09948v3 -
411 05-28 CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature CHIMERA: Eine Wissensbasis der Ideenrekombination in der wissenschaftlichen Literatur CHIMERA:科学文献中思想再融合的知识库 2505.20779v2 -
412 05-28 THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models THINK-Bench: Bewertung des Denkens Effizienz und nachdenkliche Qualität von Modellen großer Vernunft 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 考虑 - 考虑 - 考虑 - 考虑 - 高 重大 理由 模型 质量 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - - 思考 - 思考 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 评估 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 评估 2505.22113v1 -
413 05-28 Redundancy Principles for MLLMs Benchmarks Redundanzgrundsätze für MLLM-Benchmarks MLLLMs基准标准的裁员原则 2501.13953v2 -
414 05-28 Generative Framework for Personalized Persuasion: Inferring Causal, Counterfactual, and Latent Knowledge Generatives Rahmenwerk für personalisierte Überzeugung: Aufschluss über Kausal-, Gegen- und Latentenwissen 个性化观察分析的生成框架:推断因果关系、反事实和隐藏知识 2504.13904v2 -
415 05-28 Curse of High Dimensionality Issue in Transformer for Long-context Modeling Fluch der Hochdimensionalitätsfrage im Transformer für die Langkontextmodellierung 变异器中高多维度问题的诅咒,用于长期建模 2505.22107v1 -
416 05-28 Visuospatial Cognitive Assistant Visuospatial Cognitive Assistant 活性呼吸空间感知助理 2505.12312v3 -
417 05-28 Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding Deep Video Discovery: Agentische Suche mit Tool-Nutzung für Langzeit-Video-Verständnis 深视频发现: 用于远程视频理解的工具的 Agric 搜索 2505.18079v2 -
418 05-28 Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts Auf dem Weg zur Visuospatialen Kognition durch hierarchische Fusion von visuellen Experten 争取通过视觉专家的等级化融合实现纵向空间聚合 2505.12363v3 -
419 05-28 MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models MemOS: Ein Betriebssystem für die speichergesteigerte Generation (MAG) in großen Sprachmodellen MemOS:大语言模型中记忆增强生成操作系统(MAG) 2505.22101v1 -
420 05-28 K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor K-COMP: Retrieval-Augmented Medical Domain Frage beantworten mit wissensinjizierten Kompressor K- COMP: 以知识输入压缩器回答问题 2501.13567v3 -
421 05-28 Enhancing Target-unspecific Tasks through a Features Matrix Verbesserung von Ziel-unspezifischen Aufgaben durch eine Features Matrix 通过特征矩阵,加强针对特定目标的任务 2505.03414v4 -
422 05-28 Knowledge Base Construction for Knowledge-Augmented Text-to-SQL Knowledge Base Construction für wissensbasierte Text-zu-SQL 知识强化文字到SQL知识基础建设 2505.22096v1 -
423 05-28 Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning Lernen, Abfragen über Wissensdatenbanken zu routen, um schrittweise retrieval-augmented reasoning 学习如何通过不同知识库的路径查询,以逐步检索推荐理由 2505.22095v1 -
424 05-28 Visual Cues Support Robust Turn-taking Prediction in Noise Visuelle Queues unterstützen robuste Turn-Take Vorhersage in Lärm 视觉剖面支持强力转动噪音预测 2505.22088v1 -
425 05-28 Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations Domain-spezifisches Pruning von großen Mixture-of-Experts-Modellen mit nur wenigen Demonstrationen 大型混合型专家模型的域特定情景,少发示范 2504.06792v2 -
426 05-28 LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation LongReD: Degradierung von Langtext-Großen Sprachmodellen durch Restaurationsdestillation LongReD:通过恢复蒸馏减少长长长大语言模型的短期退化 2502.07365v3 -
427 05-28 ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation ArgInstruct: Spezialisierte Instruktion Feintuning für Computerargumentierung rgInstruct: 计算参数专业指示精度调整 2505.22076v1 -
428 05-28 GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking GraphCheck: Langfristige Textbarrieren mit extrahiertem Wissen durchbrechen Graph-Powered Fact-Checking 图表检查:利用提取知识图示根据事实进行实况调查打破长期文本障碍 2502.16514v4 -
429 05-28 PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models PRMBench: Ein feinkörniger und anspruchsvoller Benchmark für Prozess-Level-Reward-Modelle PRMBBench:进程一级奖励模式的精细和质疑基准 2501.03124v4 -
430 05-28 Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO Jenseits der Pfadauswahl: Bessere LLMs für wissenschaftliche Information Extraktion mit MimicSFT und Relevanz und Regel-induziert (R$^2$)GRPO 超出选择路径范围:与 MimicSFT和相关性及规则引起的科学信息提取更好的LLMs(2雷亚尔) 2505.22068v1 -
431 05-28 LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation LINGOLY-TOO: Entwirren von Vernunft aus Wissen mit templatisierter Orthografie-Verschleißung LINGOLY-TOO: 脱离与电磁矫形模糊学知识脱钩的原因 2503.02972v5 -
432 05-28 Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks Walk&Retrieve: Einfache und dennoch effektive Null-Schuss-Erzeugung durch Knowledge Graph Walks 漫步检索: 简单但有效的零光检索通过知识图表漫步生成 2505.16849v2 -
433 05-28 Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home? Schutz der Privatsphäre von Retrieval-Daten gegen Mitgliedschaft Inferenz Angriffe: Ist diese Frage zu nah zu Hause? 保护检索数据隐私,防止成员推断攻击:这个查询是否离家太近? 2505.22061v1 -
434 05-28 A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment Eine umfassende Umfrage in LLM(-Agent) Full Stack Sicherheit: Daten, Schulung und Bereitstellung 用LLLM(-代理)全堆安全:数据、培训和部署进行的全面调查 2504.15585v3 -
435 05-28 Benchmarking LLMs’ Swarm intelligence Benchmarking der Swarm-Intelligenz der LLM 基准确定LLLMs的Swarm情报 2505.04364v3 -
436 05-28 WiseMind: Recontextualizing AI with a Knowledge-Guided, Theory-Informed Multi-Agent Framework for Instrumental and Humanistic Benefits WiseMind: Rekontextualisieren von KI mit einem wissensorientierten, theorieinformierten Multi-Agenten-Rahmenwerk für instrumentelle und humanistische Vorteile Wisemind: 重新将AI与知识指导、理论化的多机构工具与人文效益多机构框架重新翻版 2502.20689v2 -
437 05-28 Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective Bewertung von Impliziten Bias in großen Sprachmodellen durch Angriff aus einer psychometrischen Perspektive 通过从心理角度进行攻击,评价大语言模型中隐含的偏见 2406.14023v3 -
438 05-28 Voice Adaptation for Swiss German Sprachanpassung für Schweizer Deutsch 瑞士德语语音改造 2505.22054v1 -
439 05-28 CoSER: Coordinating LLM-Based Persona Simulation of Established Roles CoSER: Koordinierung der LLM-basierten Persona-Simulation etablierter Rollen CSER: 协调LLM-以人为基础模拟既定角色 2502.09082v2 -
440 05-28 In-context Language Learning for Endangered Languages in Speech Recognition Im Zusammenhang mit dem Sprachenlernen für gefährdete Sprachen in der Spracherkennung 在语音识别中为濒危语言进行内通语言学习 2505.20445v2 -
441 05-28 KaFT: Knowledge-aware Fine-tuning for Boosting LLMs’ Domain-specific Question-Answering Performance KaFT: Knowledge-aware Feinabstimmung zur Steigerung der Domain-spezifischen Frage-Antwort-Leistung von LLMs KAFT: 提高LLM女士具体领域问题解答性能的有知识意识微调 2505.15480v2 -
442 05-28 Revisiting In-Context Learning with Long Context Language Models Das In-Context-Lernen mit langen Kontext-Sprachmodellen 以长方语言模式重新研究内文学习 2412.16926v3 -
443 05-28 FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis FCKT: Feinkörniger Cross-Task-Wissenstransfer mit semantischem Kontrast-Lernen für gezielte Stimmungsanalyse FCKT: 精细的跨任务知识转让,通过语义对抗学习进行有针对性的感应分析 2505.21040v2 -
444 05-28 Wolf Hidden in Sheep’s Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models Wolf versteckte sich in Schafsgesprächen: Auf dem Weg zu harmlosen datenbasierten Hintertürangriffen für Jailbreaking Large Language Models 隐藏在羊羊的谈话中的狼:为破碎大语言模范破碎的监狱进行无恶意的以数据为基础的后门攻击 2505.17601v2 -
445 05-28 Jailbreak Distillation: Renewable Safety Benchmarking Jailbreak Destillation: Benchmarking für erneuerbare Sicherheit 蒸馏:可再生能源安全基准 2505.22037v1 -
446 05-28 Inference-time Alignment in Continuous Space Inferenz-Zeit-Ausrichtung im Dauerraum 连续空间的推推-时间对齐 2505.20081v2 -
447 05-28 Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game Feinkörnige und thematische Bewertung von LLMs im Social Deduction Game 社会下社会游戏LLMs的精细和专题评价 2408.09946v2 -
448 05-28 Shaping Shared Languages: Human and Large Language Models’ Inductive Biases in Emergent Communication Shaping Shared Languages: Induktive Biase von menschlichen und großen Sprachmodellen in Emergent Communication 塑造共同语言:新兴交流中的人类和大语言模型的感性偏见 2503.04395v2 -
449 05-28 VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning VRAG-RL: Empower Vision-Perception-Based RAG für visuell reiches Informationsverständnis über iteratives Reasoning mit Verstärkungslernen VRAG-RL: 通过强化学习的迭代理由,增强基于愿景-观点的RAG, 以便通过强化学习获得视觉上丰富的信息了解 2505.22019v1 -
450 05-28 CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models CoThink: Token-Efficient Reasoning über Instruct Models Guiding Reasoning Models COTHING: 通过指示型号指导理由依据模型 2505.22017v1 -
451 05-28 Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains Domaino1s: Leitende LLM-Gründung für erklärbare Antworten in High-Stakes-Domains 域1:在高占用域中解释可解答案的 指导性LLM 2501.14431v2 -
452 05-28 CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models CogniBench: Ein gesetzlich inspirierter Rahmen und Datensatz zur Bewertung der kognitiven Treue großer Sprachmodelle CogniBench:评估大语言模型认知性信仰的受法律启发的框架和数据集 2505.20767v2 -
453 05-28 Faster and Better LLMs via Latency-Aware Test-Time Scaling Schnellere und bessere LLMs über Latency-Aware Test-Time Scaling 通过远程智能测试时间缩放,更快和更好LLMs 2505.19634v3 -
454 05-28 Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance Legal Assist KI: Nutzung von Transformer-basiertem Modell für effektive Rechtshilfe AI:利用基于变换器的有效法律援助模式 2505.22003v1 -
455 05-28 Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations Vergleich von Moralwerten in westlichen englischsprachigen Gesellschaften und LLMs mit Word Associations 比较西英语社会道德价值和LLMs与文字协会 2505.19674v2 -
456 05-28 Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate Gefunden in Übersetzung: Mehrsprachige LLM-Konsistenz so einfach wie übersetzen dann bewerten 在翻译中找到: 测量多语种LLM一致性, 简单如翻译,然后评价 2505.21999v1 -
457 05-28 Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data Leveraging Interview-informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data 利用访谈形成的LLMs参与示范调查应对措施:从AI光学和人类数据中比较洞察力 2505.21997v1 -
458 05-28 A Checks-and-Balances Framework for Context-Aware Ethical AI Alignment Ein Checks-and-Balances-Framework für kontext-aware Ethische AI Alignment 上下文软件道德操守统一校验和平衡框架 2502.00136v3 -
459 05-28 How to Synthesize Text Data without Model Collapse? Wie können Sie Textdaten ohne Modellkollaps synthesieren? 如何在没有模式折叠的情况下合成文本数据 ? 2412.14689v3 -
460 05-28 Learning Compositional Behaviors from Demonstration and Language Kompositionsverhalten aus Demonstration und Sprache lernen 学习示范和语言的构成行为 2505.21981v1 -
461 05-28 Sun-Shine: A Foundation Large Language Model for Tibetan Culture and Heritage Sun-Shine: Ein großes Sprachmodell der Stiftung für tibetische Kultur und Kulturerbe 阳光:西藏文化和遗产大语言模式基金会 2503.18288v3 -
462 05-28 Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset Perle: Ein multimodaler kulturbewusster arabischer Unterrichtsdatensatz 珍珠:多式文化-知识阿拉伯文教学数据集 2505.21979v1 -
463 05-28 Advancing Reasoning in Large Language Models: Promising Methods and Approaches Reasoning in großen Sprachmodellen fördern: Promising Methods and Approaches 大语言模式的推进理由:有希望的方法和办法 2502.03671v2 -
464 05-28 Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models Graph-beschränkte Vernunft: Treue Vernunft auf Wissensgraphen mit großen Sprachmodellen 受图表限制的理由:关于大语言模型知识图的忠实理由 2410.13080v2 -
465 05-28 Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA Erfahrung Retrieval-Augmentation mit elektronischen Gesundheitsakten ermöglicht genaue Entladung QA 使用电子健康记录使准确释放QA能够准确释放的经验回收-升级 2503.17933v2 -
466 05-28 Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack Die Bedrohung sehen: Schwachstellen in Visions-Sprachenmodellen für feindliche Angriffe 目睹威胁:视觉-语言模型对对抗性攻击的脆弱性 2505.21967v1 -
467 05-28 Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing Heterogene Token-Übertragung in LLM-Wissensbearbeitung abmildern 减轻LLLM知识编辑中变异式 Tok 超称 2502.00602v2 -
468 05-28 MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing MapStory: LLM-Powered Text-Driven Map Animation Prototyping mit Human-in-the-Loop-Editing 地图片断: 由LLM 授权的文本驱动地图动画动画与在 Loop 用户编译 2505.21966v1 -
469 05-28 UI-Evol: Automatic Knowledge Evolving for Computer Use Agents UI-Evol: Automatisches Knowledge Evolving für Computer Use Agents UI-Evol:计算机使用代理自动知识演化 2505.21964v1 -
470 05-28 LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents LaMDAgent: Autonomer Rahmen für die Post-Training-Pipeline-Optimierung über LLM-Agenten LaMMDAGenter:通过LLM代理机构优化培训后管道的自治框架 2505.21963v1 -
471 05-28 EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles EnsemW2S: Verbesserung der Schwach-zu-Strong-Verallgemeinerung mit großsprachigen Modellensembles EnsemW2S:用大语言模型组合加强弱至强的通用化 2505.21959v1 -
472 05-28 Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning Lösung von Wissenskonflikten in der bereichsspezifischen Datenauswahl: Eine Fallstudie zur medizinischen Instruktions-Tuning 解决特定领域数据选择方面的知识冲突:关于医疗指示调整的个案研究 2505.21958v1 -
473 05-28 VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning für die Sprachverarbeitung VQ-CTAP: 处理发言的跨模式精细序列代表性学习 2408.05758v2 -
474 05-28 Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation Testzeitskalierung mit wiederholter Probenahme verbessert die Mehrsprachigkeitsgenerierung 具有重复抽样的测试时间缩放改进多语种文本的生成 2505.21941v1 -
475 05-28 RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering RISE: Grundlegende Verbesserung durch iterative Selbst-Exploration in der Multi-Hop-Fragebeantwortung RISE: 多呼问答问答中通过迭代自我探索提高合理性 2505.21940v1 -
476 05-28 EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios EduBench: Ein umfassender Benchmarking-Datensatz zur Bewertung großer Sprachmodelle in unterschiedlichen Bildungsszenarien EduBonnch:评估不同教育情景中大语言模式的综合基准数据集 2505.16160v3 -
477 05-28 Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages 印度语文化上可调适的可调适文化语言专题翻译 2505.21937v1 -
478 05-28 RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments RedTeamCUA: Realistisches Adversarial Testen von Computer-Use-Agenten in hybriden Web-OS-Umgebungen Red TeamCUA:对混合网络-OS环境的计算机使用代理器进行现实的反反向测试 2505.21936v1 -
479 05-28 Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets Effizientes Ensemble für die Feinabstimmung von Sprachmodellen auf mehreren Datensätzen 多个数据集微调语言模型高效组合组合 2505.21930v1 -
480 05-28 Personality-aware Student Simulation for Conversational Intelligent Tutoring Systems Personalitätsbewusste Studentensimulation für gesprächsorientierte intelligente Tutoring-Systeme 具有个性意识的学生模拟交流智能教学系统的学生模拟 2404.06762v2 -
481 05-28 SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior SafetyAnalyst: Interpretierbare, transparente und Steerable Safety Moderation für KI-Verhalten 安全分析器:AI行为行为解释性、透明性和可坚固性 2410.16665v3 -
482 05-28 Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning Beyond Completion: Ein Grundlagenmodell für allgemeine Wissensgraphen-Reasoning 完成后完成:一般知识图理据基础模型 2505.21926v1 -
483 05-28 Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy Modellierung und Optimierung von Benutzereinstellungen in AI-Copiloten: Eine umfassende Umfrage und Taxonomie AI中模拟和优化用户首选模式:全面调查和分类 2505.21907v1 -
484 05-28 ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models ALPS: Aufmerksamkeit Lokalisierung und Pruning-Strategie zur effizienten Ausrichtung großer Sprachmodelle ALPS: 高效统一大语言模式的注意地方化和审慎战略 2505.18799v2 -
485 05-28 Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development Co-Saving: Ressourcenschonende Multi-Agenten-Kollaboration für Software-Entwicklung 共同节省:为开发软件进行有意识的资源、多机构协作 2505.21898v1 -
486 05-28 Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs Suche und Verfeinerung während des Denkens: Autonome Retrieval-Augmented Reasoning von LLMs 思考期间的搜索和记忆:自主检索-强化理据(LLM) 2505.11277v2 -
487 05-28 Language-Specific Latent Process Hinders Cross-Lingual Performance Sprachspezifische latente Prozessverhinderer Cross-Lingual Performance 语言特定边端进程 2505.13141v2 -
488 05-28 Self-Taught Agentic Long Context Understanding Selbstlernendes Agentisches Langes Kontext-Verständnis 自我教学 自我研究 长期背景了解 2502.15920v2 -
489 05-28 Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline Pfade, die nicht genommen werden: Verstehen und Mending the Multilingual Factual Recall Pipeline 未选择的路径:理解和终止多语种事实回回回回管道 2505.20546v2 -
490 05-28 Large Vocabulary Size Improves Large Language Models Große Vokabelgröße verbessert große Sprachmodelle 大型词汇量改进大语言模式 2406.16508v2 -
491 05-28 Text Generation Beyond Discrete Token Sampling Textgenerierung jenseits diskreter Token-Probenahme 文本生成超出分解调制当量抽样 2505.14827v2 -
492 05-28 Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation Einschließlich LLMs für großräumige Urban Complex Mobility Simulation 大型城市综合流动模拟项目LLMs 2505.21880v1 -
493 05-28 Evaluating the Retrieval Robustness of Large Language Models Bewertung der Retrieval Robustheit großer Sprachmodelle 评估大语言模型的检索能力 2505.21870v1 -
494 05-28 Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering Behebung von Problemen mit der verlorenen Retrieval-Frage bei der Retrieval Augmented Multi-Hop-Fragebeantwortung 减轻在检索增加的多层次问题解答中丢失的在追索中的问题 2502.14245v2 -
495 05-28 RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph RSCF: Relation-Semantik Konsequenter Filter für Entity-Einbettung von Wissensgrafik RSCF: 用于实体嵌入知识图的 关系-语义一致性过滤器 2505.20813v2 -
496 05-28 Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs Abstand zwischen relevanten Informationsstücken verursacht Bias im Langtext LLMs 有关信息片件在长文本LLM中造成偏见的距离 2410.14641v3 -
497 05-28 Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries Prinzipierte Inhaltsauswahl zur Generierung unterschiedlicher und personalisierter Multi-Document-Zusammenfassungen ” 创造多样化和个性化多文件摘要 “ 原则性内容选择 2505.21859v1 -
498 05-28 Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures Mini-Batch Coresets für speichereffiziente Sprachmodellschulungen auf Datenmischungen 记忆效率语言数据混合模型培训微型批量核心数据集 2407.19580v4 -
499 05-28 CULEMO: Cultural Lenses on Emotion – Benchmarking LLMs for Cross-Cultural Emotion Understanding CULEMO: Kulturelle Objektive zur Emotion – Benchmarking LLMs für Cross-Cultural Emotion Understanding CULEMO:情感文化引文 – – 衡量跨文化情感理解LMLL 2503.10688v3 -
500 05-28 Natural Language Reinforcement Learning Natürliche Sprache Stärkung Lernen 自然语言强化学习 2411.14251v3 -
501 05-27 (2) Constrained Discrete Diffusion Beschränkte diskrete Diffusion 限制的分解扩散 2503.09790v2 -
502 05-27 From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization Von EduVisBench zu EduVisAgent: Ein Benchmark- und Multi-Agent-Framework für eine sinnvolle pädagogische Visualisierung 从Edu Visb bench到Edu Visbench-Edu VisbearAgender:有理性的可视化教育基准和多机构框架 2505.16832v2 -
503 05-27 Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones Lassen Sie mich nachdenken! Eine lange Kette des Denkens kann es wert sein, auf jeden Fall viele kurze Menschen 让我想想吧!一个长期的思考链 可能值得一试 有很多短一个 2505.21825v1 -
504 05-27 Understanding Synthetic Context Extension via Retrieval Heads Synthetische Kontexterweiterung über Rücklaufköpfe verstehen 通过回收头目获取理解合成背景扩展 2410.22316v4 -
505 05-27 Representative Language Generation Repräsentative Sprachgenerierung 代 代 代 语 语 代 语 代 语 代 2505.21819v1 -
506 05-27 Revisiting Common Assumptions about Arabic Dialects in NLP Häufige Annahmen über arabische Dialekte in NLP erneut besuchen 重新审视全国语言规划中阿拉伯语方言的通用假设 2505.21816v1 -
507 05-27 Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking Scientific Paper Retrieval mit LLM-geführtem semantisch-basierendem Ranking 具有LLM-Guided语义学排名的科学论文检索 2505.21815v1 -
508 05-27 ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails ThinkGuard: Besonnenes langsames Denken führt zu voreiligen Wärtern 思考指南:慎重考虑的慢思考引领谨慎警卫车 2502.13458v2 -
509 05-27 From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs Von der Anfahrt zu den Cones: Erforschung multidimensionaler Darstellungen von Propositional Facts in LLMs ” 从方向到锥体:探索液晶中各种潜在事实的多层面代表 “ 2505.21800v1 -
510 05-27 Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task? Desecting the Ullman Variations with a SCALPEL: Warum scheitern LLMs bei Trivial Alterations to the False Belief Task? 将乌尔曼变异与SCALPEL解剖:为什么LLMs在假信仰任务三维改造中失败? 2406.14737v2 -
511 05-27 Controllable Context Sensitivity and the Knob Behind It Kontrollierbarer Kontext Empfindlichkeit und der Knob dahinter 控制环境的感应度及其背后的Knob 2411.07404v3 -
512 05-27 Wanda++: Pruning Large Language Models via Regional Gradients Wanda++: Beschneiden großer Sprachmodelle über regionale Gradienten Wanda+++:通过区域渐变来保护大语言模式 2503.04992v3 -
513 05-27 VeriTrail: Closed-Domain Hallucination Detection with Traceability VeriTrail: Closed-Domain Halluzination Erkennung mit Rückverfolgbarkeit VeriTrail: 带可追踪性闭路致幻觉探测 2505.21786v1 -
514 05-27 Born a Transformer – Always a Transformer? Geboren ein Transformer - immer ein Transformer? 天生的变形人 - - 总是变形人? 2505.21785v1 -
515 05-27 Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation Auf dem Weg zur Sicherheitsveranlagung in LLMs: KI-agentische Beratung für politisch eingebettete CoT-Datenerstellung 走向LLM女士中的安全理由:为制定政策的COT数据编制进行AI-Agentic 考虑 2505.21784v1 -
516 05-27 Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models Wasserzeichen im Sand: Unmöglichkeit der starken Wasserzeichen für generative Modelle 沙沙中的水印:在生成模型中使用强水标志的可能性 2311.04378v5 -
517 05-27 Layers at Similar Depths Generate Similar Activations Across LLM Architectures Ebenen in ähnlichen Tiefen erzeugen ähnliche Aktivierungen über LLM-Architekturen 类似深度的图层在LLM 结构中生成类似活动 2504.08775v2 -
518 05-27 GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task GMU-Systeme für die IWSLT 2025 Sprachübersetzung mit geringer Ressource geteilte Aufgabe GMU 2025年IWSLT 低资源语音翻译共享任务 2505.21781v1 -
519 05-27 When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction Wann geben LLMs ihre Fehler zu? Sie verstehen die Rolle des Modellglaubens bei der Retraktion LLM女士何时承认其错误? 2505.16170v2 -
520 05-27 Calibrating LLM Confidence by Probing Perturbed Representation Stability Kalibrierung des LLM-Vertrauens durch Probing Perturbed Repräsentationsstabilität 通过在有干扰的代表权方面确保稳定,以验证LLM信任度 2505.21772v1 -
521 05-27 BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum VerhaltenSFT: Behavioral Token Conditioning für klinische Wirkstoffe über das Proaktivitätsspektrum hinweg 行为SFT:横跨主动性频谱的临床药剂行为定性 2505.21757v1 -
522 05-27 FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering FRAMES-VQA: Benchmarking Fine-Tuning Robustheit über Multi-Modal Shifts in der visuellen Fragestellung FRAMES-VQA:确定视觉问题解答中多模式变化的精确调整强度基准 2505.21755v1 -
523 05-27 From prosthetic memory to prosthetic denial: Auditing whether large language models are prone to mass atrocity denialism Vom prothetischen Gedächtnis zur prothetischen Leugnung: Prüfung, ob große Sprachmodelle anfällig für Massenverleugnung sind 从假肢记忆到否认假肢:审计大型语言模式是否容易发生大规模暴行否认行为 2505.21753v1 -
524 05-27 Revisiting Bi-Linear State Transitions in Recurrent Neural Networks Bi-Lineare State Transitions in recurrenten neuralen Netzwerken erneut besuchen 在经常性神经网络中重新审查双利那尔州过渡 2505.21749v1 -
525 05-27 General-Reasoner: Advancing LLM Reasoning Across All Domains General-Reasoner: Bessere LLM-Reasonierung über alle Domains hinweg 通用Reasoner:在所有领域推推推LLM 2505.14652v4 -
526 05-27 Counterfactual Simulatability of LLM Explanations for Generation Tasks Counterfactual Simulatability von LLM-Erläuterungen für Generierungsaufgaben 世代任务LLM解释的反事实模拟性 2505.21740v1 -
527 05-27 Non-Markovian Discrete Diffusion with Causal Language Models Nicht-Markovianische Diskrepanz mit kausalen Sprachmodellen 非马尔科维语非马尔科维语分辨语言模式的传播 2502.09767v2 -
528 05-27 Assessing and Refining ChatGPT’s Performance in Identifying Targeting and Inappropriate Language: A Comparative Study Bewertung und Verfeinerung der Leistung von ChatGPT bei der Identifizierung von Targeting und unangemessener Sprache: Eine vergleichende Studie 评估和完善聊天部在确定针对性和不适当语言方面的绩效:比较研究 2505.21710v1 -
529 05-27 Do We Know What LLMs Don’t Know? A Study of Consistency in Knowledge Probing Wissen wir, was LLMs nicht wissen? Eine Studie der Konsistenz in der Wissensprobe 我们知道什么是不知道的LLLM不知道的吗?关于知识检验的一致性的研究。 2505.21701v1 -
530 05-27 MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs MAKIEval: Ein multilingualer, automatischer WiKidata-basierter Rahmen für die Bewertung des kulturellen Bewusstseins für LLMs MAKIEval:以多种语言自动维基数据为基础的LLMs文化认识评价框架 2505.21693v1 -
531 05-27 LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model LLMPR: Ein neuartiges LLM-getriebenes Transfer-Learning-basiertes Petitions-Ranking-Modell LLMPR:基于请愿排级的新式LLM-驱动转移学习模式 2505.21689v1 -
532 05-27 Empirical analysis of binding precedent efficiency in Brazilian Supreme Court via case classification Empirische Analyse der verbindlichen Präzedenzeffizienz im brasilianischen Obersten Gerichtshof über die Fallklassifizierung 通过案件分类对巴西最高法院具有约束力的先例效率进行经验分析 2407.07004v3 -
533 05-27 Probabilistic Reasoning with LLMs for k-anonymity Estimation Probabilistische Begründung mit LLMs für k-Anonymitätsschätzung K-匿名性估计法LLMs的概率推理 2503.09674v3 -
534 05-27 Language Model Alignment in Multilingual Trolley Problems Sprachmodellausrichtung in Mehrsprachigen Trolley-Problemen 多语言小龙卷风问题语言模型对齐 2407.02273v6 -
535 05-27 Rethinking the Outlier Distribution in Large Language Models: An In-depth Study Die Outlier-Distribution in großen Sprachmodellen neu denken: Eine vertiefte Studie 重新思考大语言模型的外部分布:深入研究 2505.21670v1 -
536 05-27 R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning R1-Code-Interpreter: LLMs mit Code über überwachtes und verstärktes Lernen zur Vernunft trainieren R1-Code-Code-解释:通过监督和强化学习,将培训的 “ 理性通识规范 “ 课程 2505.21668v1 -
537 05-27 Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v1 -
538 05-27 Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts Iterative Corpus-Verfeinerung für Material-Eigenschaftsvorhersage auf der Grundlage wissenschaftlicher Texte 以科学文本为基础的材料财产预测材料性迭代公司精炼 2505.21646v1 -
539 05-27 WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation WISE: Eine weltweite wissensbasierte semantische Evaluation für die Text-zu-Bild-Generierung WISE:为产生文字到图像制作而进行的世界知识化的语义评价 2503.07265v2 -
540 05-27 How does Alignment Enhance LLMs’ Multilingual Capabilities? A Language Neurons Perspective Wie verbessert Alignment die Mehrsprachigkeitsfähigkeiten von LLMs? 协调如何增强LLMM的多种语言能力? 2505.21505v1 -
541 05-27 Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making Schweigen ist kein Konsens: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making 沉默不是共识:通过用于临床决策的Catfish代理商在多方代理LLMs中破坏协议的偏见 2505.21503v1 -
542 05-27 ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models ViewSpatial-Bench: Bewertung multi-perspektivischer räumlicher Lokalisierung in Vision-Sprachen-Modellen 视野空间-空间区:在视觉-语言模型中评价多视角空间空间定位 2505.21500v1 -
543 05-27 Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Paper2Poster: Auf dem Weg zur multimodalen Plakatautomatisierung aus wissenschaftlichen Papieren Paper2Poster:从科学论文中走向多式海报自动化 2505.21497v1 -
544 05-27 UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents UI-Genie: Ein selbstverbesserender Ansatz zur iterativen Steigerung von MLLM-basierten mobilen GUI-Agenten UI-Genie: 一种自我改进的方法,用于在刺激下促进基于MLLLM的移动图形界面工具 2505.21496v1 -
545 05-27 How does Misinformation Affect Large Language Model Behaviors and Preferences? Wie wirkt sich Misinformation auf das Verhalten und die Präferenzen von großen Sprachmodellen aus? 错误信息如何影响大语言模式行为和偏好? 2505.21608v1 -
546 05-27 Reinforcing General Reasoning without Verifiers Verstärkung der allgemeinen Vernunft ohne Prüfer 加强一般理由说明,无验证人 2505.21493v1 -
547 05-27 Hardware-Efficient Attention for Fast Decoding Hardware-Effiziente Aufmerksamkeit für schnelle Dekodierung 快速下标记的硬件高效关注 2505.21487v1 -
548 05-27 Are Language Models Consequentialist or Deontological Moral Reasoners? Sind Sprachmodelle konsequentistische oder deontologische Moralverursacher? 语言模式是代名词还是代名词道德理由? 2505.21479v1 -
549 05-27 Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration Halluzination in großen Vision-Sprachen durch adaptive Aufmerksamkeitskalibrierung abmildern 通过适应性关注校准减轻大型视觉语言模型中的幻觉 2505.21472v1 -
550 05-27 Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration Skalierung externer Wissenseingaben über Kontext hinaus Windows von LLMs über Multi-Agent Collaboration 通过多机构协作,在LLMM LMLM的 “ 背景视窗 “ 之外扩大外部知识投入 2505.21471v1 -
551 05-27 Beyond ‘Aha!’: Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Jenseits von ‘Aha!’: Auf dem Weg zu systematischen Meta-Fähigkeiten Ausrichtung in großen vernünftigen Modellen 超越“Aha! ” : 在大理由模型中实现系统化的元能力协调 2505.10554v2 -
552 05-27 Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion Beschleunigung der Diffusions-Sprachmodell-Inferenz durch effizientes KV-Caching und geführte Diffusion 通过高效的 KV 抓取和引导传播加速传播语言模式模型推导 2505.21467v1 -
553 05-27 Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions Erinnerung an KI neu denken: Taxonomie, Operationen, Themen und Zukunftsrichtungen AI:分类、操作、专题和未来方向 2505.00675v2 -
554 05-27 GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization GeLLMO: Verallgemeinern von großen Sprachmodellen für Multi-Property-Molekül-Optimierung GELLMO:通用多财产分子优化大语言模型 2502.13398v2 -
555 05-27 ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models ID-Align: RoPE-Conscious Position Remapping für dynamische High-Resolution-Anpassung in Vision-Language-Modellen 愿景语言模型中动态高分辨率适应的重新绘图 2505.21465v1 -
556 05-27 Do LLMs Need to Think in One Language? Correlation between Latent Language and Task Performance Müssen LLMs in einer Sprache denken? Korrelation zwischen latenter Sprache und Aufgabenleistung LLM女士需要用一种语言思考吗? 2505.21458v1 -
557 05-27 Thinking beyond the anthropomorphic paradigm benefits LLM research Über das anthropomorphe Paradigma hinaus denken Vorteile der LLM-Forschung 超越人类形态范式的思考 2502.09192v2 -
558 05-27 Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication Worte wie Messer: Rückseitig-Personalisierte Modellierung und Erkennung von gewalttätiger Kommunikation 象Knives这样的词:后台化个人化和暴力通信建模和侦查 2505.21451v1 -
559 05-27 One-shot Entropy Minimization Ein Schuss Entropie Minimierung 单向最小化 Entropy 最小化 2505.20282v2 -
560 05-27 When Two LLMs Debate, Both Think They’ll Win Wenn zwei LLMs diskutieren, denken beide, dass sie gewinnen werden 当两个LLM 辩论, 双方都认为他们会赢 2505.19184v2 -
561 05-27 Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs Die Hitze aufdrehen: Min-p-Sampling für kreative und kohärente LLM-Ausgaben 翻开热热:创意和一致的LLM产出的最小抽样 2407.01082v6 -
562 05-27 ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition ANCHOLIK-NER: Ein Benchmark-Datensatz für Bangla Regional Named Entity Recognition ANCHOLIK-NER:孟加拉地区命名实体识别基准数据集 2502.11198v3 -
563 05-27 Towards Better Instruction Following Retrieval Models Auf dem Weg zu einer besseren Instruktion nach den Modellen des Wiedereintritts 在检索模型后改进教学 2505.21439v1 -
564 05-27 Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge Agentisches medizinisches Wissen Grafiken verbessern medizinische Frageantworten: Die Lücke zwischen LLMs und sich entwickelndem medizinischem Wissen überbrücken 药用知识图加强医疗问题的回答:缩小LLMM与不断发展的医学知识之间的差距 2502.13010v2 -
565 05-27 Transparent and Coherent Procedural Mistake Detection Transparente und kohärente Verfahrensfehlererkennung 透明和一致的程序错误侦测 2412.11927v2 -
566 05-27 R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing R2R: Effizientes Navigieren unterschiedlicher Vernunftpfade mit klein-großen Model Token Routing R2R: 以小型模型调速器有效导航差异性理性路径 2505.21600v1 -
567 05-27 Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives Datenmixtur für große Sprachmodelle neu denken: Eine umfassende Umfrage und neue Perspektiven 重新思考大语言模型的数据组合:全面调查和新视角 2505.21598v1 -
568 05-27 A Lightweight Method to Disrupt Memorized Sequences in LLM Eine leichte Methode zum Disruptieren von gemerkten Sequenzen in LLM LLM 中破坏记忆序列的轻量方法 2502.05159v2 -
569 05-27 Can Large Language Models Understand Symbolic Graphics Programs? Können große Sprachmodelle symbolische Grafikprogramme verstehen? 大语言模型能理解符号图形程序吗? 2408.08313v4 -
570 05-27 Efficiently Scaling LLM Reasoning with Certaindex Effiziente Skalierung der LLM-Vernunft mit bestimmtem Dex 高效扩增 LLM 使用 emitedex 说明 2412.20993v2 -
571 05-27 RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation RefTool: Modellverbesserung mit referenzgeführter Werkzeugerstellung RefTool:在创建参考指导工具时加强示范理由 2505.21413v1 -
572 05-27 How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation Wie man sich vor 5G Strahlung schützt? LLM-Antworten auf Implizite Fehlinformationen untersuchen 如何保护自己免受5G辐射? 调查隐蔽的错误信息的LLM反应 2503.09598v2 -
573 05-27 RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models RelationalFactQA: Ein Benchmark für die Bewertung tabellarischer Fakten aus großen Sprachmodellen 关系事实QA:从大语言模型中评估列表事实检索的基准 2505.21409v1 -
574 05-27 Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling Factual Self-Awareness in Sprachmodellen: Repräsentation, Robustheit und Skalierung 语言模式中的事实自觉意识:代表性、强力和比例 2505.21399v1 -
575 05-27 DecisionFlow: Advancing Large Language Model as Principled Decision Maker DecisionFlow: Großes Sprachmodell als prinzipieller Entscheidungsträger voranbringen 决定Flow:作为有原则的决策人推进大语言模式 2505.21397v1 -
576 05-27 Leveraging Large Language Models for Active Merchant Non-player Characters Nutzung großer Sprachmodelle für aktive Händler Nicht-Spieler-Charaktere 利用大型语言模型为活跃的商机非玩家字符发挥杠杆作用 2412.11189v3 -
577 05-27 Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science Verbesserung der Forschungsideenerzeugung durch Daten: Eine empirische Untersuchung in der Sozialwissenschaft 《通过数据改进研究概念的产生:社会科学经验调查》 2505.21396v1 -
578 05-27 Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback Align-SLM: Textlose gesprochene Sprachmodelle mit Verstärkung Lernen von KI Feedback Aleign-SLM-Align-SLM:利用AI反馈学习强化的无文字口语模式 2411.01834v2 -
579 05-27 AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs AutoJudger: Ein agentengestütztes Framework für effizientes Benchmarking von MLLMs Autojudger: MLLMs 高效基准设定的代理驱动框架 2505.21389v1 -
580 05-27 VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models VoxEval: Benchmarking des Wissensverständnisses Fähigkeiten von End-to-End gesprochenen Sprachmodellen VoxEval:确定端至端口语语言模式知识理解能力基准 2501.04962v4 -
581 05-27 PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense PHISH in MESH: Koreanische Adversarial Phonetische Substitution und phonetisch-semantische Feature-Integration Verteidigung MESH的PHISH:韩国反电话替代和音-声-声-声-声-声-声-声-地物融合国防 2505.21380v1 -
582 05-27 Analyzing values about gendered language reform in LLMs’ revisions Analysieren von Werten über die Reform der Geschlechtersprachen in LLM-Revisionen 在LLLM女士的修订中分析关于性别语言改革的价值观 2505.21378v1 -
583 05-27 Path Pooling: Training-Free Structure Enhancement for Efficient Knowledge Graph Retrieval-Augmented Generation Pfad-Pooling: Training-freie Struktur-Verbesserung für effizientes Wissen Graph Retrieval-Augmented Generation 集路道路:为高效知识图检索-启动型一代加强培训-免费结构 2503.05203v2 -
584 05-27 Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History LLM-Anpassung an soziodemographische Faktoren bewerten: Benutzerprofil vs. Dialoggeschichte 评价LLLM适应社会人口因素:用户概况与对话历史 2505.21362v1 -
585 05-27 Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning Select2Reason: Effiziente Instruction-Tuning-Datenauswahl für Long-CoT-Reasoning 选择2Reason: 用于长期成本计算理由的高效指令导出数据选择 2505.17266v2 -
586 05-27 Frequency matters: Modeling irregular morphological patterns in Spanish with Transformers Häufigkeitsfragen: Modellierung unregelmäßiger morphologischer Muster auf Spanisch mit Transformern 频率事项:用变换器模拟西班牙文的非正常形态模式 2410.21013v4 -
587 05-27 Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning Nutzung von großen Sprachmodellen für Bengalische Mathematik-Wort-Probleme bei der Lösung der Kette der Gedankenveranlagung 利用大语言模型解决孟加拉语数学字词与思维链理性的解决问题 2505.21354v1 -
588 05-27 The Multilingual Divide and Its Impact on Global AI Safety Die Mehrsprachigkeit und ihre Auswirkungen auf die globale KI-Sicherheit 多语言鸿沟及其对全球独立国际协会安全的影响 2505.21344v1 -
589 05-27 Leveraging large language models and traditional machine learning ensembles for ADHD detection from narrative transcripts Nutzung großer Sprachmodelle und traditioneller Machine-Learning-Ensembles zur ADHD-Erkennung aus erzählerischen Transkripten 利用大型语言模式和传统机器学习群群,从叙述性记录誊本中探测ADHD 2505.21324v1 -
590 05-27 Interlocking-free Selective Rationalization Through Genetic-based Learning Interlocking-free Selektive Rationalisierung durch gentechnisch-basiertes Lernen 通过基于遗传的学习实现互连、无互闭和无互换的选择性合理化 2412.10312v2 -
591 05-27 Optimizing fMRI Data Acquisition for Decoding Natural Speech with Limited Participants Optimierung der fMRI-Datenerfassung für die Dekodierung von Natural Speech mit begrenzten Teilnehmern 优化FMRI数据获取,以便与有限参加者进行自然演讲 2505.21304v1 -
592 05-27 How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian Wie Menschen und LLMs konzeptionelles Wissen organisieren: Untergeordnete Kategorien auf Italienisch erforschen 人类和LLMs如何组织概念知识:探索意大利的次类 2505.21301v1 -
593 05-27 rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset rStar-Coder: Scaling Competitive Code Reasoning mit einem Large-Scale Verifizierten Datensatz rStar-Coder:扩大竞争守则,以大型核实数据集为依据 2505.21297v1 -
594 05-27 OR-Bench: An Over-Refusal Benchmark for Large Language Models OR-Bench: Ein überwiderlegbarer Benchmark für große Sprachmodelle OR-Bench:大语言模式的过度拒绝基准 2405.20947v4 -
595 05-27 Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation Auf dem Weg zur Anpassung von Open Source großen Sprachmodellen für die Erstellung klinischer Notizen auf Expertenebene 努力调整用于专家级临床笔记制作的开放源大语言模型 2405.00715v6 -
596 05-27 MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models MMUnlearner: Reformulierung multimodaler Maschinenentlernen im Zeitalter multimodaler großer Sprachmodelle MMULALINER:在多模式大语言模式时代重新推出多模式机器 2502.11051v4 -
597 05-27 SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs SoftCoT: Soft Chain-of-Thought für effizientes Nachdenken mit LLMs SoftCot: 寻求与LLMs高效合理解释的软链 2502.12134v2 -
598 05-27 Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs Feintuning auf unterschiedlichen aufschlussreichen Ketten treibt die Inferenz CoT-Verfeinerung in LLMs an 对多种有理链条的精细调整 2407.03181v2 -
599 05-27 Multilingual Pretraining for Pixel Language Models Mehrsprachiges Vortraining für Pixel-Sprachenmodelle 多语种像素语言模型的多语种预培训 2505.21265v1 -
600 05-27 SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning SoftCoT++: Testzeitskalierung mit Soft Chain-of-Thought-Reasoning SoftCot++: 带有软思考链原因的测试时间缩放 2505.11484v2 -
601 05-27 ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision ReSCORE: Labelfreies iteratives Retriever-Training für Multi-Hop-Fragebeantwortung mit Relevanz-Konsistenz-Überwachung RESCO:无标签的与相关性-一致性监督多窗口问题解答培训的循环探索性探索性培训 2505.21250v1 -
602 05-27 Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings Bewertung von LLMs in medizinischen Textzusammenfassungen: Die Rolle der Vokabelanpassung in hohen OOV-Einstellungen 医学文本摘要:词汇适应在高OOV环境中的作用 2505.21242v1 -
603 05-27 LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners LMCD: Sprachmodelle sind Nullshot Kognitive Diagnose Lernende LMCD: 语言模型是零光认知诊断学生 2505.21239v1 -
604 05-27 RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations RASMALAI: Ressourcen für adaptive Sprachmodellierung in indischen Sprachen mit Akzenten und Intonationen RASMAALAI:以印地安语言制作具有感应和感应的适应性演讲模型的资源 2505.18609v2 -
605 05-27 Language Models Surface the Unwritten Code of Science and Society Sprachenmodelle stellen den ungeschriebenen Kodex von Wissenschaft und Gesellschaft dar 《不成文科学与社会守则》 2505.18942v2 -
606 05-27 GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding GALLa:改进源代码理解的通用大语言模型图 2409.04183v2 -
607 05-27 PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems PSRB: Ein umfassender Benchmark für die Bewertung persischer ASR-Systeme PSRB:波斯ASR系统评价综合基准 2505.21230v1 -
608 05-27 A Representation Level Analysis of NMT Model Robustness to Grammatical Errors Eine Darstellungsebenenanalyse von NMT-Modell Robustheit zu grammatischen Fehlern 对NMT模型模型对表面错误的强度代表级别分析 2505.21224v1 -
609 05-27 Pretrained LLMs Learn Multiple Types of Uncertainty Pretrained LLMs lernen mehrere Arten von Unsicherheit 事先培训的LLMs 学习多种不确定性 2505.21218v1 -
610 05-27 SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment SCIRGC: Multi-Granularitäts-Zitation Empfehlung und Zitation Sentence Preference Alignment SCIRGC: 多岛屿引文建议和引文句次调整 2505.20103v2 -
611 05-27 Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs Universal Reasoner: Ein einfacher, komponierbarer Plug-and-Play-Reasoner für gefrorene LLMs 通用理由:冻结长效LMs的单一、可合成插管和布局理由 2505.19075v2 -
612 05-27 Voting or Consensus? Decision-Making in Multi-Agent Debate Abstimmung oder Konsens? Entscheidungsfindung in Multi-Agent-Debatte 表决还是协商一致?多机构辩论中的决策 2502.19130v2 -
613 05-27 Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM’s Instruction-Following Capabilities Enthüllen von instruction-spezifischen Neuronen & Experten: Ein analytischer Rahmen für die instruction-following Fähigkeiten von LLM 具体未完成的指示性具体神经和专家:LLM教学-执行能力分析框架 2505.21191v1 -
614 05-27 Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation Lunguage: Ein Benchmark für strukturierte und sequentielle Chest-Röntgen-Interpretation Lunguage:结构化和顺序式X射线X射线口译基准 2505.21190v1 -
615 05-27 Exploring the Latent Capacity of LLMs for One-Step Text Generation Erforschung der Latent-Kapazität von LLMs für die einstufige Textgenerierung 探索单步制文本生成LLMs的原始能力 2505.21189v1 -
616 05-27 PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing GiftSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing 毒物群:通过示范众包普及有害信息合成 2505.21184v1 -
617 05-27 Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning Gehen Sie, bevor Sie laufen! Concise LLM Reasoning via Verstärkung Learning 走在跑步前! 通过强化学习解密 LLM 教学 2505.21178v1 -
618 05-27 TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment TAT-R1: Terminologie-Bewusste Übersetzung mit Verstärkungslernen und Wortausrichtung TAT-R1:用强化学习和字词一致来翻译名词-软件 2505.21172v1 -
619 05-27 M-Wanda: Improving One-Shot Pruning for Multilingual LLMs M-Wanda: Bessere One-Shot Pruning für mehrsprachige LLMs M-Wanda:改进多语种LLM的单制环流 2505.21171v1 -
620 05-27 Leveraging GANs for citation intent classification and its impact on citation network analysis Nutzung von GANs für die Klassifizierung von Zitierzielen und deren Auswirkungen auf die Analyse von Zitiernetzwerken 利用GANs利用GANs进行引用意图分类及其对引用网络分析的影响 2505.21162v1 -
621 05-27 Behavioral Analysis of Information Salience in Large Language Models Verhaltensanalyse des Informationsgehalts in großen Sprachmodellen 对大语言模式信息价值的行为分析 2502.14613v2 -
622 05-27 Assessment of L2 Oral Proficiency using Speech Large Language Models Bewertung der oralen Sprachkenntnisse von L2 anhand von sprachgroßen Sprachmodellen 使用语言大语言模式评估L2口语能力 2505.21148v1 -
623 05-27 Adaptive Deep Reasoning: Triggering Deep Thinking When Needed Adaptive Deep Reasoning: Tief denken auslösen, wenn nötig 适应性深层理性:需要时触发深思考 2505.20101v2 -
624 05-27 Hallucinations are inevitable but can be made statistically negligible. The “innate” inevitability of hallucinations cannot explain practical LLM issues Halluzinationen sind unvermeidlich, können aber statistisch vernachlässigbar gemacht werden. Die “angeborene” Unvermeidbarkeit von Halluzinationen kann praktische LLM-Probleme nicht erklären 幻觉的“内在”不可避免性无法解释实际的LLM问题。 2502.12187v2 -
625 05-27 Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Leveraging LLM und selbstüberwachte Trainingsmodelle für die Spracherkennung in chinesischen Dialekten: Eine vergleichende Analyse 利用LLM和中国语语音识别自驾培训模式:比较分析 2505.21138v1 -
626 05-27 Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction Scaling und Prompting für eine verbesserte Korrektur von End-to-End-Spoken-grammatischen Fehlern 缩放和提示改进端至端口语语语法错误校正 2505.21137v1 -
627 05-27 Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling Effiziente Längenverallgemeinerbare Aufmerksamkeit über Causal Retrieval für die Lang-Kontext-Sprachenmodellierung 长文本语言建模通过 “ 目的检索 “ 吸引长文本语言建模 2410.01651v3 -
628 05-27 Creativity in LLM-based Multi-Agent Systems: A Survey Kreativität in LLM-basierten Multi-Agent-Systemen: Eine Umfrage 以LLM为基础的多种机构系统中的创造性:调查 2505.21116v1 -
629 05-27 Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Wird es morgen noch wahr sein? Mehrsprachige Evergreen-Frageklassifikation zur Verbesserung des Vertrauenswürdigen QA 提高可信赖的质量保证的多语种长青问题分类 2505.21115v1 -
630 05-27 Does quantization affect models’ performance on long-context tasks? Beeinflusst die Quantisierung die Performance von Modellen bei langen Kontextaufgaben? 量化是否影响模型在长期任务方面的绩效? 2505.20276v2 -
631 05-27 A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction Ein leichtes Multi-Expert Generatives Sprachmodellsystem für Engineering Information and Knowledge Extraction 工程信息和知识采掘轻量多专家生成语言示范系统 2505.21109v1 -
632 05-27 Thinker: Learning to Think Fast and Slow Denker: Schnell und langsam denken lernen 思考者:学会快速和缓慢思考 2505.21097v1 -
633 05-27 BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge BLUCK: Ein Benchmark-Datensatz für Bengalische Sprachkenntnisse und kulturelles Wissen BLUK:孟加拉语言理解和文化知识基准数据集 2505.21092v1 -
634 05-27 Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs) Position ist Macht: Systemprompts als Mechanismus von Bias in großen Sprachmodellen (LLMs) 位置是电源:系统提示作为大语言模型比阿语机制(LLMs) 2505.21091v1 -
635 05-27 Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch Lösende LLM-Vernunftfähigkeit durch skalierbare Fragesynthese von Scratch 从 Scratch 通过可缩放问题合成解排 LLM 解排功能性LLM 2410.18693v2 -
636 05-27 Predicting Implicit Arguments in Procedural Video Instructions Implizite Argumente in verfahrenstechnischen Video-Anweisungen voraussagen 程序性录像教学中预测隐含的论据 2505.21068v1 -
637 05-27 Plan2Align: Predictive Planning Based Test-Time Preference Alignment for Large Language Models Plan2Align: Predictive Planning Based Test-Time Preference Alignment für große Sprachmodelle 计划2对等:以预测规划为基础的大语言模型试验时间首选比对齐 2502.20795v2 -
638 05-27 Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction Visuelle Queues verbessern vorausschauende Wende-Taking für zwei-Partei menschliche Interaktion 提高两党人互动的预测转向 2505.21043v1 -
639 05-27 How Private are Language Models in Abstractive Summarization? Wie privat sind Sprachmodelle in abstrakter Zusammenfassung? 私人语言模式在抽象总结中如何? 2412.12040v2 -
640 05-27 Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models Debate-to-Detect: Neuformulieren von Fehlinformationserkennung als Real-World-Debatte mit großen Sprachmodellen 辩论至检测:重拟错误信息探测作为有大语言模式的现实世界辩论 2505.18596v2 -
641 05-27 Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models Optimierung des Case-Based-Reasoning-Systems für die Generierung funktionaler Testskripte mit großen Sprachmodellen 为具有大语言模型的功能测试脚本生成优化基于个案的理由说明系统 2503.20576v3 -
642 05-27 Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation Def-DTS: Deduktive Begründung für Open-Domain Dialog Themensegmentierung Def-DTS: 公开对话的削减理由 2505.21033v1 -
643 05-27 Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers Pause Tokens erhöhen streng die Expressivität der konstant-tiefen Transformer 严格提高常数面变换器的表达性 2505.21024v1 -
644 05-27 Can Community Notes Replace Professional Fact-Checkers? Können Community Notes professionelle Fact-Checker ersetzen? 社区说明能否取代专业实况调查人? 2502.14132v2 -
645 05-27 LLMs are Frequency Pattern Learners in Natural Language Inference LLMs sind Frequency Pattern Learners in Natural Language Inferenz LLMs是自然语言推断的频率模式学习者。 2505.21011v1 -
646 05-27 Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods Kompromisse zwischen Ausrichtung und Hilfsbereitschaft in Sprachmodellen mit Lenkungsmethoden 使用指导方法的语文模式的平衡兼顾和利弊取舍 2401.16332v5 -
647 05-27 Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models? Ungewissheit unverhüllt: Kann die Exposition gegenüber mehr In-Kontext-Beispielen Ungewissheit bei großen Sprachmodellen erhöhen? 不确定性未消除:接触更多内置实例能减轻大语言模型的不确定性吗? 2505.21003v1 -
648 05-27 RvLLM: LLM Runtime Verification with Domain Knowledge RvLLM: LLM Laufzeitverifizierung mit Domänenwissen RvLLM: LLM 使用域知识运行时间校验 2505.18585v2 -
649 05-27 Articulatory strategy in vowel production as a basis for speaker discrimination Artikulatorische Strategie in der Vokalproduktion als Grundlage für die Diskriminierung von Sprechern 元音制作的交替战略,作为议长歧视的基础 2505.20995v1 -
650 05-27 Who Reasons in the Large Language Models? Wer begründet in den großen Sprachmodellen? 大语言模型中谁的理由? 2505.20993v1 -
651 05-27 LLMs with Industrial Lens: Deciphering the Challenges and Prospects – A Survey LLMs mit Industrieobjektiv: Die Herausforderungen und Aussichten bestimmen – Eine Umfrage 与工业镜头的LLM:挑战与前景的解析 – – 调查 2402.14558v2 -
652 05-27 RefAV: Towards Planning-Centric Scenario Mining RefAV: Auf dem Weg zum planerisch-zentralen Szenario Bergbau RefAV: 走向规划中心情景采矿 2505.20981v1 -
653 05-27 Evaluating and Steering Modality Preferences in Multimodal Large Language Model Bewertung und Steuerung von Modalitätseinstellungen im multimodalen Large Language Model 评价和指导多式大语言模式模式模式模式模式模式模式的优惠 2505.20977v1 -
654 05-27 Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing Kontrastives Lernen auf LLM Back Generation Treebank für Cross-Domain-Konstituenz Parsing 在LLM 后一代植树库进行反向学习 2505.20976v1 -
655 05-27 Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA Reason-Align-Respond: LLM-Reasoning mit Wissensgraphen für KGQA ausrichten 合理对称:KGQA以知识图表对称LLM 2505.20971v1 -
656 05-27 Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation Personalisierte Abfrage Auto-Completion für langfristige und kurzfristige Interessen mit adaptiver Entgiftung Generation 适应性戒毒一代的长期和短期利益个人自问自动完成 2505.20966v1 -
657 05-27 HalluCounter: Reference-free LLM Hallucination Detection in the Wild! HalluCounter: Reference-free LLM Halluzination Detection in the Wild! 万圣节:无参考的LLM 幻觉探测在野外! 2503.04615v2 -
658 05-27 Context-Aware Content Moderation for German Newspaper Comments Context-Aware Content Moderation für die deutsche Zeitung Kommentare 德国报纸评论的背景资料内容调控 2505.20963v1 -
659 05-27 Research Community Perspectives on “Intelligence” and Large Language Models Forschungsgemeinschaftsperspektiven zu “Intelligenz” und großen Sprachmodellen 关于“情报”和大语言模式的社区研究观点 2505.20959v1 -
660 05-27 More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives Mehr ist nicht immer besser? Viel-Shot-In-Context-Lernen mit differenzierten und neugewichtigen Zielen verbessern 越多越好,越多越好?用差异化和再加权目标,加强多热化的内流学习 2501.04070v3 -
661 05-27 QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization QwenLong-CPRS: Auf dem Weg zu $\infty$-LLMs mit dynamischer Kontextoptimierung 20Long-CPRS:争取以动态环境优化实现美元/美元-LLMs 2505.18092v2 -
662 05-27 QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning QwenLong-L1: Auf dem Weg zu einem langen Kontext Große Vernunftmodelle mit Stärkungslernen QuwenLong-L1:寻求具有强化学习作用的长期大型理由模型 2505.17667v2 -
663 05-27 Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training Zwei Experten sind alles, was Sie zum Lenken Denken brauchen: Kognitive Bemühungen in MoE-Reasoning-Modellen ohne zusätzliches Training verstärken 两位专家是指导思考所需要的两个专家:在没有额外培训的情况下加强教育部理由说明模式中的认知努力 2505.14681v2 -
664 05-27 Conversational Code Generation: a Case Study of Designing a Dialogue System for Generating Driving Scenarios for Testing Autonomous Vehicles Conversational Code Generation: eine Fallstudie zur Konzeption eines Dialogsystems zur Generierung von Fahrszenarien für die Prüfung autonomer Fahrzeuge 相互交流的代码生成:设计一个对话系统,为测试自用车辆创造驱动情景的对话系统案例研究 2410.09829v2 -
665 05-27 On VLMs for Diverse Tasks in Multimodal Meme Classification Auf VLMs für vielfältige Aufgaben in der multimodalen Meme-Klassifikation 关于多式气象分类中多种任务VLMs 2505.20937v1 -
666 05-27 EPIC: Efficient Position-Independent Caching for Serving Large Language Models EPIC: Effizientes positionsunabhängiges Caching für das Servieren großer Sprachmodelle EPIC: 高效的、独立定位的为大语言模式服务的工作 2410.15332v3 -
667 05-27 Information-Theoretic Complementary Prompts for Improved Continual Text Classification Information-Theoretische Ergänzungsprompte für eine verbesserte fortlaufende Textklassifikation 改进持续性文本分类信息理论补充提示 2505.20933v1 -
668 05-27 Multi-objective Large Language Model Alignment with Hierarchical Experts Multi-objektive großsprachige Modellausrichtung mit Hierarchischen Experten 多目标大语言多目标模式,与等级专家相配合 2505.20925v1 -
669 05-27 “Oh LLM, I’m Asking Thee, Please Give Me a Decision Tree”: Zero-Shot Decision Tree Induction and Embedding with Large Language Models “Oh LLM, ich frage dich, bitte gib mir einen Entscheidungsbaum”: Nullschnelle Entscheidungsbauminduktion und Einbettung mit großen Sprachmodellen “哦,LLM,我问你,请给我一棵决定树”: “零热决定树上演和嵌入大语言模型” 2409.18594v2 -
670 05-27 Automated Privacy Information Annotation in Large Language Model Interactions Automatisierte Datenschutzerklärung Annotation in Interaktionen mit großen Sprachmodellen 大语言模式互动中自动隐私信息说明 2505.20910v1 -
671 05-27 Towards Objective Fine-tuning: How LLMs’ Prior Knowledge Causes Potential Poor Calibration? Auf dem Weg zu einer objektiven Feinabstimmung: Wie verursacht LLMs’ vorheriges Wissen eine potenzielle schlechte Kalibrierung? 目标微调:LLMS的先前知识原因如何造成潜在的不协调? 2505.20903v1 -
672 05-27 A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models Eine Stereotyp-Inhaltsanalyse zu farbbezogenen sozialen Bias in großen Visions-Sprachmodellen 关于大视觉语言模式中与肤色有关的社会偏见的定型内容分析 2505.20901v1 -
673 05-27 Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing Dub-S2ST: Textlose Sprach-zu-Sprach-Übersetzung für nahtloses Synchronisieren Dub-S2ST: 无缝Dubbing无文本语音翻译 2505.20899v1 -
674 05-27 The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions Die versteckten Dimensionen der LLM-Ausrichtung: Eine mehrdimensionale Analyse der orthogonalen Sicherheitsanweisungen LLM 对齐的隐藏面:对正交安全方向的多维分析 2502.09674v4 -
675 05-27 Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use Loquacious Set: 25.000 Stunden transkribierte und vielfältige englische Spracherkennungsdaten für Forschung und kommerzielle Nutzung 便利的一套:25 000小时被分配和多样化的英语语音识别数据,供研究和商业使用 2505.21578v1 -
676 05-27 Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation Kreuz von links nach rechts Gehirn: Adaptiver Texttraum für Vision-und-Sprachen-Navigation 从左脑到右脑交叉:愿景和语言导航的适应性文本梦想者 2505.20897v1 -
677 05-27 How Do Transformers Learn Variable Binding in Symbolic Programs? Wie lernen Transformer variable Bindungen in Symbolischen Programmen? 变换者如何在符号程序中学习变数绑定 ? 2505.20896v1 -
678 05-27 EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models EasyDistill: Ein umfassendes Toolkit für effektive Wissensdestillation von großen Sprachmodellen 简易蒸馏:大语言模式有效知识蒸馏综合工具箱 2505.20888v1 -
679 05-27 ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention ComplexEhemaliger: Disruptived Advance Transformer Inferenz-Fähigkeit über Head-Specific Complex Vector Achtung 复杂形式:通过头部特定复杂矢量的注意,干扰推进变压器推断能力 2505.10222v2 -
680 05-27 Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality Macht-Rechts-Dekay-Verlust für große Sprachmodell Finetuning: Fokussierung auf Informationssparsität zur Verbesserung der Generationsqualität 大语言模型调整的功率法减退损失:侧重于信息平等以提高世代质量 2505.16900v3 -
681 05-27 Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective Auf dem Weg zur Analyse und dem Verständnis der Grenzen von VAPO: Eine theoretische Perspektive 分析和理解VAPO的局限性:理论视角 2505.17997v2 -
682 05-27 Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning Vergessen im LLM-Fine-Tuning durch Low-Perplexity Token Learning verhindern 减轻LLM 微调调整通过低重复调调调学习的忘却现象 2501.14315v3 -
683 05-27 MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection MSA bei SemEval-2025 Task 3: Hochwertiges schwaches Etikettieren und LLM-Ensemble-Verifikation für Mehrsprachige Halluzinationserkennung SemEval-2025 SMAS 任务3:高品质的差错标签和多种语言幻觉探测的LLM组合核查 2505.20880v1 -
684 05-27 Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties Trans-EnV: Ein Rahmen zur Bewertung der sprachlichen Robustheit von LLMs gegen englische Sorten Trans-EnV: 反英语多样性LLMs语言能力评价框架 2505.20875v1 -
685 05-27 Can LLMs Learn to Map the World from Local Descriptions? Können LLMs lernen, die Welt aus lokalen Beschreibungen zu kartieren? LLMs能够学习用当地描述绘制世界地图吗? 2505.20874v1 -
686 05-27 Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG Divide-Then-Align: Ehrliche Ausrichtung auf Basis der Wissensgrenze der RAG 分离后对齐:基于RAG知识界限的诚实一致 2505.20871v1 -
687 05-27 AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection AmpleHate: Verstärkte Aufmerksamkeit für die Vielseitige Implizite Hate-Erkennung 全面:扩大对易变性隐含仇恨侦测的注意 2505.19528v2 -
688 05-27 Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks Strukturierte Denkfragen: Verbesserung der LLM-Verallgemeinerung bei ursächlichen Folgeaufgaben 结构思考事项:改进因果推断任务中的普遍化 2505.18034v2 -
689 05-27 SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment SAFEPATH: Verhindern schädlicher Vernunft in der Kette der Gedanken durch frühzeitige Ausrichtung SAFPATH:通过早期协调防止在研究链中产生有害理由 2505.14667v2 -
690 05-27 SEPS: A Separability Measure for Robust Unlearning in LLMs SEPS: Eine Separabilitätsmessung für robustes Lernen in LLMs SEPS: LLMM 中强有力解学的分离措施 2505.14832v2 -
691 05-27 DUSK: Do Not Unlearn Shared Knowledge DUSK: Gemeinsames Wissen nicht entschärfen DUSK: 不共享未读共享知识 2505.15209v2 -
692 05-27 An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks Ein LLM-as-Judge Metric zur Überwindung der Lücke mit menschlicher Bewertung in SE-Aufgaben 消除社会经济任务中与人的评价差距的法学硕士法官 2505.20854v1 -
693 05-27 Concealment of Intent: A Game-Theoretic Analysis Concealment of Intent: Eine Game-Theoretische Analyse 隐藏意图:游戏理论分析 2505.20841v1 -
694 05-27 Tuning LLM Judge Design Decisions for 1/1000 of the Cost Tuning LLM Richter Design Entscheidungen für 1/1000 der Kosten 1 000美元费用1 000美元法官设计决定 2501.17178v4 -
695 05-27 The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents Die Kraft der Persönlichkeit: Eine menschliche Simulationsperspektive zur Untersuchung von Large Language Model Agents 个性力量:从人类模拟角度调查大语言示范物剂 2502.20859v2 -
696 05-27 Enhance Mobile Agents Thinking Process Via Iterative Preference Learning Mobile Agenten durch iteratives Preference-Lernen weiter denken 加强移动媒介思考流程动态动态迭代性优先学习 2505.12299v2 -
697 05-27 Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning 不要半听半听:在连续教学图示中获取关键部分信息 2403.10056v4 -
698 05-27 Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations Inklusive Systematische Bewertungen aktivieren: Einschließlich Preprint-Artikel mit großsprachigen modellgetriebenen Bewertungen 促进包容性的系统审查:将预印条款纳入大语言模式示范评价 2503.13857v3 -
699 05-27 WizardCoder: Empowering Code Large Language Models with Evol-Instruct WizardCoder: Empowering Code Große Sprachmodelle mit Evol-Instruct 巫师编码器:授权使用电动制造器的守则大语言模型 2306.08568v2 -
700 05-27 MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving MA-LoT: Modell-Kollaboration Lean-based Long Chain-of-Thought Reasoning verbessert formalen Theorem Proving MA-LOT:示范-协作:基于精液的探讨理由长期链加强正式理论证明 2503.03205v3 -
701 05-27 R-TOFU: Unlearning in Large Reasoning Models R-TOFU: Unlearning in großen Vernunftmodellen R-TOFU:在大理由模型中重新学习 2505.15214v2 -
702 05-27 AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset AdParaphrase v2.0: Attraktive Ad-Texte mit einem Präferenz-Annotierten Paraphrase-Datensatz generieren AdParadhanv2.0:利用附加说明的首选参数句数据集生成有吸引力的附加文本 2505.20826v1 -
703 05-27 Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation Verstärkte Informativitätsoptimierung für die langformige Retrieval-Augmented Generation 长期回收型后期人种最佳利用强化信息 2505.20825v1 -
704 05-27 Predicting drug-gene relations via analogy tasks with word embeddings Vorhersage von Drogen-Gene-Beziehungen über Analogieaufgaben mit Worteinbettungen 通过用词嵌入词词类比任务预测毒品与基因的关系 2406.00984v5 -
705 05-27 Tracing and Reversing Rank-One Model Edits Rank-One-Modellbearbeitungen verfolgen und umkehren 追踪和校正一等一模式编辑 2505.20819v1 -
706 05-27 HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices HomeBench: Bewertung von LLMs in Smart Homes mit gültigen und ungültigen Anweisungen über einzelne und mehrere Geräte HomeBench: 评估智能住宅中具有跨越单一和多种装置的无效和无效指令的智能住宅中LLMs 2505.19628v2 -
707 05-27 Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints Semantisches Parsing für große Sprachmodelle neu denken: LLM-Performance mit semantischen Hinweisen verbessern 重新思考大语言模型的语义分解:用语义提示提高LLM性能 2409.14469v2 -
708 05-27 TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent TrojanStego: Ihr Sprachmodell kann geheim ein Steganographic Privacy Leaking Agent sein TrojanStego:您的语言模式可以秘密地隐秘地隐秘地渗漏剂。 2505.20118v2 -
709 05-27 Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective Rethinking Information Synthese in multimodalen Fragen Antwort auf eine multi-agente Perspektive 以多机构视角回答多式联运问题 重新思考信息综述 2505.20816v1 -
710 05-27 Exploring the Necessity of Reasoning in LLM-based Agent Scenarios Erforschung der Notwendigkeit der Vernunft in LLM-basierten Agent-Szenarien 探讨基于LLM代理设想情况中合理理由的必要性 2503.11074v2 -
711 05-27 CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis CulFiT: Ein feinkörniges Kulturbewusstsein LLM-Training Paradigma über Mehrsprachige Kritikdatensynthese CulFIT:通过多种语言的克里端数据综合分析进行精美的有文化意识的LLM培训模型 2505.19484v2 -
712 05-27 Improved Representation Steering for Language Models Verbesserte Repräsentationssteuerung für Sprachmodelle 改进语文模式代表性指导 2505.20809v1 -
713 05-27 Sentiment Reasoning for Healthcare Sentiment Reasoning für die Gesundheitsversorgung 保健的情感理由 2407.21054v4 -
714 05-27 A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models Eine Graphenperspektive zur Untersuchung struktureller Wissensmuster in großen Sprachmodellen 《大语言模式知识结构模式研究图示展望》 2505.19286v2 -
715 05-27 WizardLM: Empowering large pre-trained language models to follow complex instructions WizardLM: Ermächtigen von großen vortrainierten Sprachmodellen, komplexe Anweisungen zu befolgen 巫灵LM:授权大型预先培训的语文模式遵循复杂的指令 2304.12244v3 -
716 05-27 MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability MaskSearch: Ein universelles Pre-Training-Framework, um Agentische Suchfähigkeit zu verbessern 保护面具搜索:加强制剂搜索能力的普遍培训前框架 2505.20285v2 -
717 05-27 SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences SpecExtend: Ein Drop-in-Enhancement für spekulative Decoding von langen Sequenzen 外观:对长期序列的投机性代谢的减少增强 2505.20776v1 -
718 05-27 Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs Achten Sie auf Ihr Po! Messen und Abmildern von KI-Sicherheitsrisiken bei Rollenspielen Feintuning von LLMs 当心你的阿宝! 衡量并减轻AI公司在角色扮演中的安全风险 微调LLMs 2502.20968v2 -
719 05-27 ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools ChemHAS: Hierarchische Agenzien-Stacking zur Verbesserung von Chemiewerkzeugen ChemHAS:加强化学工具的等级代理人 2505.21569v1 -
720 05-27 Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage 增强法典制定中强化学习的加强守则LLMS 代码生成:调查 2412.20367v3 -
721 05-27 Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains Denken Sie leise, denken Sie schnell: Dynamische Latent-Kompression von LLM-vernünftigen Ketten 默默思考,快速思考:LLM 解释性链条的动态延迟压缩 2505.16552v3 -
722 05-27 No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models Kein LLM ist frei von Bias: Eine umfassende Studie der Bias-Bewertung in großen Sprachmodellen No LLM “ 免于偏见:对大语言模式的偏见评价的全面研究 “ 。 2503.11985v2 -
723 05-27 Systematic Generalization in Language Models Scales with Information Entropy Systematische Generalisierung in Sprachmodellen Skalen mit Informationsentropie 语言模型中系统化的通用化( 带有信息信封的语言模型缩放) 2505.13089v2 -
724 05-27 Can Small Language Models Learn, Unlearn, and Retain Noise Patterns? Können kleine Sprachmodelle Geräuschmuster lernen, nicht lernen und erhalten? 小语言模型能够学习、不学习和保留噪音模式吗? 2407.00996v3 -
725 05-27 Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator Schalldämpfer: Von der Entdeckung zur Eindämmung von Selbst-Bias im LLM-as-Benchmark-Generator 沉默器:从发现到减少LLM-as-Bunchmark-Generator中的自我比亚 2505.20738v1 -
726 05-27 BQA: Body Language Question Answering Dataset for Video Large Language Models BQA: Körper Sprache Frage-Frage-Beantwortung Datensatz für Video Große Sprachmodelle BQA:视频大语言模型的体语言问题解答数据集 2410.13206v2 -
727 05-27 SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution SPA-RL: Verstärkung der LLM-Agenten durch schrittweise Fortschrittszuweisung SPA-RL:通过逐步推进加强LLM代理 2505.20732v1 -
728 05-27 What LLMs Miss in Recommendations: Bridging the Gap with Retrieval-Augmented Collaborative Signals Was LLMs in Empfehlungen vermissen: Die Lücke mit retrieval-Augmented Collaborative Signals überbrücken 在建议中错过了什么的LLM女士:用检索增强的合作信号弥合差距 2505.20730v1 -
729 05-27 S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models S1-Bench: Ein einfacher Benchmark für die Bewertung von System 1 Denkfähigkeit von Großmodellen S1-区:评估系统1思考大理由模型的能力的简单基准 2504.10368v3 -
730 05-27 Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection Effiziente und präzise Optimierung: Der Vorteil des Gedächtnisses in exemplargeführter Reflexion 高效和准确的迅速优化:外光引导反射中内存的益处 2411.07446v2 -
731 05-27 Autoregressive Speech Synthesis without Vector Quantization Autoregressive Sprachsynthese ohne Vector Quantization 无矢量量化的自动递减语音合成 2407.08551v2 -
732 05-27 ProgCo: Program Helps Self-Correction of Large Language Models ProgCo: Programm hilft bei der Selbstkorrektur großer Sprachmodelle ProgC:帮助大语言模式自我校正方案 2501.01264v2 -
733 05-27 LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multimodal Large Language Models LatentExplainer: Erklären von latenten Darstellungen in tiefgenerativen Modellen mit multimodalen großen Sprachmodellen 前任Explainer:在多模式大语言模型的深创模型中解释前述表述 2406.14862v6 -
734 05-27 Analyzing Biases in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework Analyse von Biasen im politischen Dialog: Tagging US-Präsidentschaftsdebatten mit einem erweiterten DAMSL-Rahmen 分析政治对话中的偏见:美国总统辩论与扩展的DAMSL框架拖累美国总统辩论 2505.19515v2 -
735 05-27 MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding MUSEG: Verstärktes zeitliches Verständnis von Video über Timestamp-Aware Multi-Segment Erdung MUSEG:通过Timestamp-Aware多部分定位加强视频时间理解 2505.20715v1 -
736 05-27 GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement GigaSpeech 2: Ein sich entwickelnder, großformatiger und multidomänischer ASR-Korpus für ressourcenarme Sprachen mit Automatisiertem Crawling, Transkription und Verfeinerung GigaSpeech2:具有自动拖网、拖网、拖网和精炼功能的低资源语言不断演化、大型和多领域ASR公司 2406.11546v2 -
737 05-27 Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective Physik-Aufklärung in kleinen Sprachmodellen: Eine multidimensionale Analyse aus pädagogischer Perspektive 《小语言模型中的物理原因解剖:从教育角度的多层次分析》 2505.20707v1 -
738 05-27 NeUQI: Near-Optimal Uniform Quantization Parameter Initialization NeUQI: Beinahe-optimale einheitliche Quantisierung Parameter Initialisierung NeUQI: 近最佳统一量化参数初始化 2505.17595v2 -
739 05-27 Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases Zwischen Circuits und Chomsky: Pre-Pretraining auf Formal Languages Imparts Linguistic Biases 巡回巡回和乔姆斯基之间:正式语言语言语言预科培训 2502.19249v2 -
740 05-27 RaDeR: Reasoning-aware Dense Retrieval Models RaDeR: Vernünftige Dense-Retrieval-Modelle RaDER: 合理觉悟常量检索模型 2505.18405v2 -
741 05-27 Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing Erhöhung der Messlatte: Ermittlung der Werte von großen Sprachmodellen durch Generative Evolving-Tests 提高律师资格:通过创造演变测试调查大语言模式的价值 2406.14230v4 -
742 05-27 vCache: Verified Semantic Prompt Caching vCache: Verifizierter semantischer Prompt-Caching vCache: 校验语义快速缓冲 2502.03771v3 -
743 05-27 Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration Beyond Templates: Dynamische Anpassung von Reasoning-Demonstrationen durch Machbarkeits-Bewusst-Exploration 超越模板:通过可行性研究软件探索对说明理由的演示进行动态调整 2505.20700v1 -
744 05-27 Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models Token-Level Akzeptieren oder ablehnen: Ein Micro Alignment-Ansatz für große Sprachmodelle 接受或拒绝时肯级别:大语言模式微调整方法 2505.19743v2 -
745 05-27 Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages Phir Hera Fairy: Ein englisches Märchen ist ein starker Faker der fließenden Rede in Low-Resource indischen Sprachen Phir Hera Fairy:英国仙女是印度低资源语言流利流利的有力名人 2505.20693v1 -
746 05-27 Can we Debias Social Stereotypes in AI-Generated Images? Examining Text-to-Image Outputs and User Perceptions Können wir Debias Social Stereotype in KI-generierten Bildern? Prüfung von Text-to-Image-Ausgaben und Benutzerwahrnehmungen 我们能否在AI-光化图像中贬低社会陈规定型观念?审查文本到图像的产出和用户的看法 2505.20692v1 -
747 05-27 A Survey of LLM $\times$ DATA Eine Umfrage über LLM $\times$ DATEN 对LLLM 美元-美元-美元-美元-数据数据的调查 2505.18458v2 -
748 05-27 SELF-PERCEPT: Introspection Improves Large Language Models’ Detection of Multi-Person Mental Manipulation in Conversations SELF-PERCEPT: Introspection verbessert die Erkennung von Multi-Person-Gedankenmanipulation in Gesprächen durch große Sprachmodelle SELF-PERCEPT: 调查改进大语言模型在对话中探测多人心理操纵 2505.20679v1 -
749 05-27 Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples Flow of Reasoning: Schulung von LLMs für divergente Reasoning mit minimalen Beispielen 理由流动:不同理由与最微小例子培训LLM 2406.05673v6 -
750 05-27 Pretraining Language Models to Ponder in Continuous Space Vorschulung von Sprachmodellen im kontinuierlichen Raum 连续空间Ponder语言模型培训前 2505.20674v1 -
751 05-27 Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System Viele Köpfe sind besser als eins: Verbesserte wissenschaftliche Idee-Generation durch ein LLM-basiertes Multi-Agent-System 许多领导人比一个领导人好得多:由以LLM为基础的多种机构系统改进科学思想的一代 2410.09403v4 -
752 05-27 Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders Enthüllen sprachspezifischer Funktionen in großen Sprachmodellen über Sparse Autoencoder 通过 Sparse 自动编译器在大语言模型中未解析特定语言特征 2505.05111v2 -
753 05-27 DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models DRP: Destillierte Reasoning Pruning mit skill-aware Schritt Zersetzung für effiziente große Reasoning Modelle DRP: 以技能认知方式逐步分解高效大型理由解释模型 2505.13975v2 -
754 05-27 An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment Eine eingehende Bewertung großer Sprachmodelle in der Satzvereinfachung mit fehlerbasierter Human Assessment 深入评价以基于错误的人类评估为根据的简化刑期的大语言模式 2403.04963v3 -
755 05-27 Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework Multi-Faceted-Evaluierung lernen: Ein einheitliches und robustes Framework 学习如何调整多面评价:统一和强有力的框架 2502.18874v3 -
756 05-27 Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing Subtile Fehler bei der Begründung: Präferenz-Lernen durch Error-injected Self-editing 理由解释中的字幕错误:通过错误输入自编辑学习偏好 2410.06638v4 -
757 05-27 Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond Auf dem Weg zu LLM Unlearning Resilient to Relearning Attacks: Eine scharfsinnige Minimierungsperspektive und darüber hinaus 走向LLM 学会学会学会学会重新学习攻击的不学习能力:锐化-尽量减少知识的视角及展望 2502.05374v4 -
758 05-27 Shadow-FT: Tuning Instruct via Base Shadow-FT: Tuning Instruct via Base 影子-FT:通过基地的调试指示 2505.12716v2 -
759 05-27 ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning ReMA: Meta-Denken lernen für LLMs mit Multi-Agenten-Verstärkungs-Lernen ReMA:学习多机构强化学习的LLMLM的元思维 2503.09501v3 -
760 05-27 Knowledge Boundary of Large Language Models: A Survey Wissensgrenze von großen Sprachmodellen: Eine Umfrage 大语言模式的知识范围:调查 2412.12472v2 -
761 05-27 How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines Wie können neurale Netzwerke mit Skalierungsgesetzen ausgebaut werden? Eine Umfrage und praktische Leitlinien 如何提升具有扩展法的神经网络? 2502.12051v3 -
762 05-27 Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning Self-Route: Automatische Mode-Umschaltung über Capability-Schätzung für effizientes Reasoning 自操作: 通过能力估计法进行自动模式转换,以高效理由推理 2505.20664v1 -
763 05-27 TeroSeek: An AI-Powered Knowledge Base and Retrieval Generation Platform for Terpenoid Research TeroSeek: Eine KI-Powered Knowledge Base und Plattform zur Retrieval-Generation für Terpenoidforschung TeroSeek: AI-Prepenorids研究知识库和检索生成平台 2505.20663v1 -
764 05-27 TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization TailorKV: Hybrides Framework für lange Kontext-Inferenz durch maßgeschneiderte KV-Cache-Optimierung 定制 KV: 通过定制 KV Cache 优化实现长文本推断的混合框架 2505.19586v2 -
765 05-27 BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism BacktrackAgent: Verbesserung des GUI-Agenten mit Fehlererkennung und Backtracking-Mechanismus 后向跟踪:加强有错误探测和回溯跟踪机制的图形界面代理 2505.20660v1 -
766 05-27 DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs DynamicKV: Task-Aware Adaptive KV Cache-Kompression für LLMs mit langem Kontext DiriveKV: 长期LMS 任务- 软件适应 KV 缓存压缩 2412.14838v4 -
767 05-27 Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen 利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变 2505.20658v1 -
768 05-27 Chinese Cyberbullying Detection: Dataset, Method, and Validation Chinesische Cyberbully-Erkennung: Datensatz, Methode und Validierung 中国网络欺凌探测:数据集、方法和校验 2505.20654v1 -
769 05-27 Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning Enthüllen der wichtigsten Faktoren für die Destillierung Kette-of-Thought-Reasoning 理据链的理据的理据 2502.18001v3 -
770 05-27 When More is Less: Understanding Chain-of-Thought Length in LLMs Wenn mehr weniger ist: Verstehst du die Kettenlänge in LLMs? 越少越多: 了解LLM 中所寻求的链条长度 2502.07266v3 -
771 05-27 FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information Fintagging: Ein LLM-fähiger Benchmark für die Gewinnung und Strukturierung von Finanzinformationen 金融信息抽取和结构安排:LLM已准备就绪的金融信息提取和结构框架基准 2505.20650v1 -
772 05-27 DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization DRPruning: Effiziente großsprachige Modellprüfung durch distributiv robuste Optimierung DRP 运行:通过分布式强力优化实现高效大语言模式 2411.14055v2 -
773 05-27 STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models STEER-BENCH: Benchmark für die Bewertung der Steerability von großen Sprachmodellen STEER-BENCH:评估大语言模型可耐性的基准 2505.20645v1 -
774 05-27 Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations Prompt-basierte LLMs für Position Bias-Aware Reranking in personalisierten Empfehlungen 个人化建议中位置比亚软件重新排位的即时即时全资 2505.04948v2 -
775 05-27 A-MEM: Agentic Memory for LLM Agents A-MEM: Agentischer Speicher für LLM-Agenten A-MEM: LLM 剂的剂内存 2502.12110v8 -
776 05-27 Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation Rethinking MUSHRA: Bewältigung moderner Herausforderungen in der Text-zu-Speech-Bewertung 重新思考MUSHRA:应对文本到语音评价中的现代挑战 2411.12719v3 -
777 05-27 GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration GMoE: Stärkung von LLMs Feinsteuerung über MoE Graph Collaboration GMOE:通过教育部图表合作,赋予LLMs Fine-Turning女士权力 2412.16216v3 -
778 05-27 STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing STEM-POM: Bewertung von Sprachmodellen Mathe-Symbol-Reasoning in Document Parsing STEM-POM: 评估文档分析中的语言模型数学类比理由 2411.00387v2 -
779 05-27 Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models 以信息引导为导向,对不利于大语言模式的自治歧视大语种模式采取因果干预 2504.12898v3 -
780 05-27 Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing Benchmarking und Pushing der Multi-Bias Elimination Boundary von LLMs über Causal Effect Schätzung-geführte Debiasing 通过因果关系估测-制导偏向性,确定和推动消除长效LLMs的多比消除边界 2505.16522v2 -
781 05-27 Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning Monocle: Hybride lokale und globale In-Context-Evaluierung für die Langtext-Generierung mit unsicherem aktivem Lernen 单项:对具有不确定和积极学习能力的长篇和不确定的代代人进行地方-全球混合文 文 评价 2505.20195v2 -
782 05-27 Test-Time Learning for Large Language Models Test-Time Learning für große Sprachmodelle 大语言模型试验时间学习 2505.20633v1 -
783 05-27 SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis SV-TrustEval-C: Bewertung von Struktur und semantischer Vernunft in großen Sprachmodellen für die Analyse von Quellencode-Anfälligkeiten SV-信任值-C:在源码脆弱性分析大语言模型中评估结构和语义理由 2505.20630v1 -
784 05-27 Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration 长期范围:通过多代理问题驱动的协作实现分化和征服 2505.20625v1 -
785 05-27 POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization POLAR: Benchmark für multilinguale, multikulturelle und multi-eventuelle Online-Polarisierung POLAR: 多种语言、多文化和多种晚上在线极化的基准 2505.20624v1 -
786 05-27 Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages Towards Inclusive ASR: Untersuchung der Sprachumwandlung für Dysarthric Speech Recognition in Low-Resource Sprachen 努力实现包容性的ASR:低资源语言中承认代谢语言语音转换调查 2505.14874v2 -
787 05-27 SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation SeqPO-SiMT: Sequentielle Politikoptimierung für die gleichzeitige maschinelle Übersetzung SeqPO-SIMT:同步机器翻译的序列政策优化 2505.20622v1 -
788 05-27 LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers LLM-FE: Automatisiertes Feature Engineering für Tabellendaten mit LLMs als Evolutionsoptimierer LLM-FE: 制表数据的自动地貌工程,LLMM作为进化优化器 2503.14434v2 -
789 05-27 Retrospex: Language Agent Meets Offline Reinforcement Learning Critic Retrospex: Sprachagent trifft Offline-Verstärkung Lernkritik Retrospex: 语言代理 与离线强化学习中心相会 2505.11807v2 -
790 05-27 REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning 实际检索: 数学理由的回收增量精液预言 2505.20613v1 -
791 05-27 Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models Roboflow100-VL: Ein Multi-Domain-Objekterkennungs-Benchmark für Vision-Language-Modelle 机器人流100-VL:愿景-语言模型多功能物体探测基准 2505.20612v1 -
792 05-27 Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings Hierarchische Mamba trifft auf Hyperbolische Geometrie: Ein neues Paradigma für strukturierte Spracheinbettungen 等级式 Mamba 相遇超双曲几何: 结构化语言嵌入的新范式 2505.18973v2 -
793 05-27 Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients Vergleiche zwischen einer großsprachigen, auf Echtzeit-Compound-Diagnostik basierenden medizinischen KI-Schnittstelle und Ärzten für Fälle der gewöhnlichen inneren Medizin mit simulierten Patienten 使用模拟病人的大型语言模型基于实时复合诊断器实时诊断模型的医学AI 接口和使用模拟病人的普通内科病人医生对普通内科病例的比较 2505.20609v1 -
794 05-27 NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human NAP^2: Ein Benchmark für Natürlichkeit und Datenschutz-Erhaltung Text-Rewriting durch Lernen vom Menschen 国家行动纲领第2号: “ 从人类学习 “ 的自然和隐私保护文本改写基准 2406.03749v2 -
795 05-27 Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation Auf dem Weg zur Vorschulung Robustes ASR-Stiftungsmodell mit akustisch-bewusster Datenvergrößerung ASR基金会样板,配有声-声-声-声数据增强数据增强模型 2505.20606v1 -
796 05-27 TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis TCSinger 2: Anpassbare Mehrsprachige Null-Shot-Singen-Stimme-Synthese TCSinger 2:可定制的多语种零弹唱声合成 2505.14910v2 -
797 05-27 Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations Gender and Positional Biases in LLM-based Hiring Entscheidungen: Belege aus vergleichenden CV/Résumé Bewertungen 以LLM为基础的雇用决定中的性别与职位两重情况:比较 CV/摘要评价中的证据 2505.17049v2 -
798 05-26 (1) Effectiveness of Prompt Optimization in NL2SQL Systems Wirksamkeit der Prompt-Optimierung in NL2SQL-Systemen NL2SQL系统迅速优化的效能 2505.20591v1 -
799 05-26 Training a Generally Curious Agent Ein allgemein neugieriger Agent ausbilden a 训练一般好奇剂 2502.17543v3 -
800 05-26 Emotion Classification In-Context in Spanish Emotion Classification In-Context auf Spanisch 西班牙文《情感分类西班牙文内引文》 2505.20571v1 -
801 05-26 The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages Der NaijaVoices-Datensatz: Pflege von großformatigen, qualitativ hochwertigen, kulturell-richschen Sprachdaten für afrikanische Sprachen NaijaVoices数据集:培养非洲语言的大型、高质量、文化-Rich语音数据 2505.20564v1 -
802 05-26 Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning Jenseits von Markovian: Reflektierende Exploration über Bayes-Adaptive RL für LLM-Reasoning 马尔科维安之后:通过Bayes-Adapative RL进行反射勘探,用于LLM 理由分析 2505.20561v1 -
803 05-26 Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text Task-informierte Anti-Kurriculum durch Masken verbessert Downstream-Performance auf Text 通过遮罩改进文字下流业绩,以任务化的反文体 2502.12953v2 -
804 05-26 Predicting Through Generation: Why Generation Is Better for Prediction Vorhersagen durch Generation: Warum Generation besser für Vorhersagen ist 通过一代人预测:为什么一代人更有利于预测 2502.17817v2 -
805 05-26 MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly MMLongBench: Benchmarking von langkontexten Visions-Sprachenmodellen effektiv und gründlich MMLongBench:有效和彻底地确定长长、长、长、长、远景-语言模式的基准 2505.10610v2 -
806 05-26 From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning Von Tokens zu Gedanken: Wie LLMs und Menschen Kompression für Bedeutung traden 从Tokens到思想:LLM和人类如何用贸易压缩来达到意义 2505.17117v2 -
807 05-26 Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models Skalierung über Skalierung: Untersuchung von Test-Zeit-Skalierung Pareto in großen vernünftigen Modellen 缩放过缩放: 探索大型理由模型中的测试时间缩放派 2505.20522v1 -
808 05-26 Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting Projekt Riley: Multimodaler Multi-Agent LLM Zusammenarbeit mit emotionaler Vernunft und Abstimmung 莱利项目:与情感原因和投票合作 2505.20521v1 -
809 05-26 Aggregation Artifacts in Subjective Tasks Collapse Large Language Models’ Posteriors Aggregation Artefakte in subjektiven Aufgaben Zusammenklappen der Poster von großen Sprachmodellen 在主观任务中聚合个体行为 折叠大语言模型的别墅 2410.13776v4 -
810 05-26 Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects Multimodale Emotionserkennung in Gesprächen: Eine Übersicht über Methoden, Trends, Herausforderungen und Perspektiven 在对话中多时的情感认识:对方法、趋势、挑战和前景的调查 2505.20511v1 -
811 05-26 ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis ArVoice: Ein Multi-Sprecher-Datensatz für die arabische Sprachsynthese ArVoice:用于阿拉伯语语音合成的多发言者数据集 2505.20506v1 -
812 05-26 Large Language Models for IT Automation Tasks: Are We There Yet? Große Sprachmodelle für IT-Automatisierungsaufgaben: Sind wir noch da? 信息技术自动化任务大语言模型:我们是否还存在? 2505.20505v1 -
813 05-26 Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review Verkörperte KI mit Basismodellen für mobile Serviceroboter: Ein Systematischer Test 与 “ 移动服务机器人:系统审查 “ 基金会模型 2505.20503v1 -
814 05-26 Gatsby Without the ‘E’: Crafting Lipograms with LLMs Gatsby ohne das ‘E’: Lipogramme mit LLMs herstellen Gatsby没有“E”:用LLMs制作乳胶 2505.20501v1 -
815 05-26 Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism Beyond Keywords: Bewertung großsprachiger Modellklassifikation von Nuanced Ableism 超越关键词:评价大语言多语言可变性模式分类 2505.20500v1 -
816 05-26 Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification Erklären Sie: Evidenz-getriebene Vorhersagen für erklärbare Drogenziel-Identifikation 寻求解释:对可解释药物目标识别的由证据驱动的预测 2402.04068v4 -
817 05-26 CLEVRER-Humans: Describing Physical and Causal Events the Human Way CLEVRER-Mensch: Physikalische und kausale Ereignisse auf menschliche Weise beschreiben CLEVRER-人类:将自然和因果事件描述为人类道路 2310.03635v2 -
818 05-26 Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages Inceptive Transformers: Erweitern von kontextuellen Darstellungen durch Multi-Scale-Feature-Lernen über Domains und Sprachen hinweg 感动变异器:通过跨领域和跨语言的多阶段专题学习,加强背景代表方式 2505.20496v1 -
819 05-26 InFact: Informativeness Alignment for Improved LLM Factuality InFact: Informatives Alignment für verbesserte LLM-Faktizität 事实:改进LLM事实质量的信息协调 2505.20487v1 -
820 05-26 The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph Das Beste aus beiden Welten: Qualität und Vielfalt bei der Datenauswahl mit zweiteiligem Graphen überbrücken 《最佳世界和最佳世界:在数据选择中将质量和多样性与双部分图联系起来》 2410.12458v2 -
821 05-26 Conversation Kernels: A Flexible Mechanism to Learn Relevant Context for Online Conversation Understanding Gesprächs-Kernel: Ein flexibler Mechanismus, um relevante Kontexte für Online-Konversations-Verständnis zu lernen 对话核心:学习在线对话理解相关背景的灵活机制 2505.20482v1 -
822 05-26 BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics BrainStratify: Grob-zu-Fein-Entwechslung von intrakranieller Neuraldynamik 大脑分解: 神经内神经动力学的粗向法解析 2505.20480v1 -
823 05-26 Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought Bias-Augmented Consistency Training reduziert biased Reasoning in Chain-of-Thought 避免和强化的一致培训减少在寻求的连锁努力中造成不利和 不利理由 2403.05518v2 -
824 05-26 Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalization of Misinformation Detection Models Gestern Nachrichten: Benchmarking Multi-Dimensional Out-of-Distribution Verallgemeinerung von Misinformation Detection Modelle 昨天的新闻:对错误信息探测模型的多种不同传播通用进行基准衡量 2410.18122v2 -
825 05-26 The Impact of a Chatbot’s Ephemerality-Framing on Self-Disclosure Perceptions Der Einfluss des Ephemerality-Framing eines Chatbots auf die Wahrnehmung der Selbstoffenbarung 查塔博特人的即时态度对自我披露感知的影响 2505.20464v1 -
826 05-26 Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection Skalierungsgesetze für das Vergessen beim Finetuning mit Vorschulungs-Dateninjektion 调整前数据输入时遗忘法律的扩大范围 2502.06042v2 -
827 05-26 Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries Amulett: Komplexe Multi-Turn-Gespräche mit LLM Jurys auf dem Stand Anulet: 将复杂多发多发对话与LLM Juries 挂起立 2505.20451v1 -
828 05-26 HAMburger: Accelerating LLM Inference via Token Smashing HAMburger: Beschleunigung der LLM-Inferenz durch Token Smashing HAMburger:通过Token打碎加速LLM推理 2505.20438v1 -
829 05-26 Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages? Hilft Synthetische Daten bei der Nennung der Entitätserkennung für Sprachen mit geringer Ressource? 合成数据是否有助于为低资源语言命名实体识别? 2505.16814v2 -
830 05-26 The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project The UD-NewsCrawl Treebank: Reflexionen und Herausforderungen aus einem groß angelegten Tagalog Syntactic Annotation Project UD-News-Crawcrow Treebank:大型Tagalog聚合笔记项目反思和挑战 2505.20428v1 -
831 05-26 SEMMA: A Semantic Aware Knowledge Graph Foundation Model SEMMA: Ein semantisches Wissensdiagramm-Stiftungsmodell SEMMA: 语义认知知识图基础模型 2505.20422v1 -
832 05-26 GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation GraphGen: Verbessertes Supervised Fine-Tuning für LLMs mit wissensgetriebener Synthetischer Datengenerierung 图图Gen:加强具有知识驱动合成合成数据生成的LMLMs的监管微调 2505.20416v1 -
833 05-26 Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision Verbesserung der logischen Vernunft in Sprachmodellen durch symbolisch geführte Monte-Carlo-Prozessüberwachung 通过有符号指导的蒙特卡洛进程监督,加强语文模式的逻辑理由解释 2505.20415v1 -
834 05-26 SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents SWE-Rebench: Eine automatisierte Pipeline für die Task Collection und die dekontaminierte Evaluation von Software Engineering Agents SWE-rebench:软件工程剂任务收集和除污评价自动管道 2505.20411v1 -
835 05-26 What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models Was änderte sich? Instruktionsgeführte Bildbearbeitungen mit multimodalen großen Sprachmodellen erkennen und bewerten 以多模式大语言模式对指导指导图像编辑进行检测和评估 2505.20405v1 -
836 05-26 MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding MangaVQA und MangaLMM: Ein Benchmark und Spezialmodell für multimodales Manga-Verständnis MangaVQA和MangaLMM:多模式漫画理解基准和专门模式 2505.20298v1 -
837 05-26 DiSA: Diffusion Step Annealing in Autoregressive Image Generation DiSA: Diffusionsschritt Annealing in autoregressiver Bildgenerierung DiSA: 自动递减图像生成中的传播步骤 2505.20297v1 -
838 05-26 Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution? Selbstreflektierende Unsicherheiten: Kennen LLMs ihre interne Antwortverteilung? 自我反感的不确定性:LLMs知道他们的内部答案分布吗? 2505.20295v1 -
839 05-26 Reasoning LLMs are Wandering Solution Explorers Grundlegende LLMs sind wandernde Lösungs-Explorer 理据LLMs是游荡的解决方案探索者 2505.20296v1 -
840 05-26 Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery Verbesserung der Verständlichkeit von Texterklärungen durch unüberwachte Concept Discovery 通过未受监督的概念发现提高通过不受监督的概念解释的可理解性 2505.20293v1 -
841 05-26 Visualized Text-to-Image Retrieval Visualisierung von Text-zu-Bild-Retrieval 可视化文本到图像检索 2505.20291v1 -
842 05-26 Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding Time-R1: Nach dem Training Großer Vision-Sprachenmodell für die zeitliche Videoerdung 时间-R1:培训后用于实时视频定位的大型视觉语言模型 2503.13377v2 -
843 05-26 VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction VLM-3R: Vision-Language-Modelle erweitert mit instruction-aligned 3D reconstruction VLM-3R:通过指示统一3D重建增强的愿景-语言模型 2505.20279v1 -
844 05-26 The Coverage Principle: A Framework for Understanding Compositional Generalization Das Coverage-Prinzip: Ein Rahmen für das Verständnis der kompositorischen Verallgemeinerung 覆盖范围原则:理解普遍组成框架 2505.20278v1 -
845 05-26 OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction OmniCharacter: Auf dem Weg zu immersiven Rollenspiel-Agenten mit nahtloser Sprach-Persönlichkeits-Interaktion OmniCharacter:争取用无缝无言语-语言个性交互作用来模拟角色扮演剂 2505.20277v1 -
846 05-26 Bias and Volatility: A Statistical Framework for Evaluating Large Language Model’s Stereotypes and the Associated Generation Inconsistency Bias and Volatility: Ein statistischer Rahmen für die Bewertung der Stereotypen und der damit verbundenen Inkonsistenz der Generation 偏见和不稳定:评价大语言模式定型观念和关联一代人不一致的统计框架 2402.15481v5 -
847 05-26 Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs Fann oder Flop: Ein Multigenre, Multiera Benchmark für arabische Poesie in LLMs Fann 或 Flop: 多种语言、阿拉伯语诗类理解多元基准 2505.18152v2 -
848 05-26 Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models Durch Täuschung sehen: Irreführende Schöpfer-Intent in multimodalen Nachrichten mit Vision-Sprache-Modellen entdecken 通过欺骗观察:以视觉语言模式在多模式新闻中揭开错误领导创造者意图的隐蔽 2505.15489v2 -
849 05-26 We Need to Measure Data Diversity in NLP – Better and Broader Wir müssen die Datenvielfalt in NLP messen – besser und breiter 我们需要在《国家劳工政策》中衡量数据多样性 – – 更好和更广泛 2505.20264v1 -
850 05-26 Lifelong Safety Alignment for Language Models Lebenslange Sicherheitsausrichtung für Sprachmodelle 语言模型终身安全比对 2505.20259v1 -
851 05-26 On the Compatibility of Generative AI and Generative Linguistics Über die Vereinbarkeit generativer KI und generativer Linguistik 关于 “ 创造性语言 “ 和 “ 创造性语言 “ 的兼容性 2411.10533v2 -
852 05-26 ARM: Adaptive Reasoning Model ARM: Anpassungsorientiertes Modell ARM:适应性理由说明模式 2505.20258v1 -
853 05-26 The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language Der Faetar-Benchmark: Spracherkennung in einer sehr unterbesetzten Sprache Faetar基准:以资源非常不足的语言进行语音承认 2409.08103v4 -
854 05-26 Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs Position: Mechanische Dolmetschbarkeit sollte Feature-Konsistenz in SAEs priorisieren 位置: 机械可解释性:应优先考虑高级专业环境评估中的地物一致性 2505.20254v1 -
855 05-26 Learning Extrapolative Sequence Transformations from Markov Chains Extrapolative Sequenztransformationen von Markov-Ketten lernen 来自Markov 链条的学习外推序列变换 2505.20251v1 -
856 05-26 Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models Beyond the Tip of Efficiency: Enthüllen der untergetauchten Bedrohungen von Jailbreak Attacken in kleinen Sprachmodellen 超越 “ 效率之便 “ :以小语言模式破狱袭击的潜伏威胁 2502.19883v3 -
857 05-26 WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models WXImpactBench: Ein disruptives Wetter-Impact-Verständnis Benchmark für die Bewertung großer Sprachmodelle WXImpact Bennech:评估大语言模型的干扰天气影响理解基准 2505.20249v1 -
858 05-26 KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing KnowTrace: Bootstrapping iterative Retrieval-Augmented Generation mit strukturierter Wissensverfolgung KnowTrace: 与结构化知识追踪相配套的 刺激性迭代回收- 启动型生成 2505.20245v1 -
859 05-26 On Path to Multimodal Historical Reasoning: HistBench and HistAgent Auf dem Weg zu multimodaler historischer Vernunft: HistBench und HistAgent 通向多式联运历史原因原因之路:历史时尚与历史代理人 2505.20246v1 -
860 05-26 It’s High Time: A Survey of Temporal Information Retrieval and Question Answering Es ist höchste Zeit: Eine Umfrage der zeitlichen Informationen Retrieval und Fragen beantworten 《高时:时间信息检索和回答问题调查》 2505.20243v1 -
861 05-26 Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models Vielfältig, nicht kurz: Ein längengesteuerter Selbstlernrahmen zur Verbesserung der Antwortvielfalt von Sprachmodellen 多样性,不是短的:提高语文模式应对多样性的长期控制自学框架 2505.16245v2 -
862 05-26 MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation MMLU-ProX: Mehrsprachiger Benchmark für eine erweiterte Bewertung von großen Sprachmodellen MMLU-ProX:高级大语言模式评价多语种基准 2503.10497v2 -
863 05-26 RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning RAGEN: Selbst-Evolution in LLM-Agenten durch Multi-Turn-Verstärkungs-Lernen verstehen 通过多阶段强化学习了解LLM代理商的自我演变 2504.20073v2 -
864 05-26 BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving BPP-Suche: Verbesserung des Baumes der Gedanken Grund für mathematische Modellierung Problem Lösung BPP-Search:为数学建模问题解决加强思考理由树 2411.17404v4 -
865 05-26 Efficient Speech Translation through Model Compression and Knowledge Distillation Effiziente Sprachübersetzung durch Modellkompression und Wissensdestillation 通过模型压缩和知识蒸馏高效语音翻译 2505.20237v1 -
866 05-26 Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue Überbrückung der langfristigen Lücke: Eine memory-aktive Politik für den aufgabenorientierten Dialog mit mehreren Sessions 缩小长期差距:多会议着重任务的对话的记忆 - 积极政策 2505.20231v1 -
867 05-26 FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models FLAME-MoE: Eine transparente End-to-End-Forschungsplattform für Mixture-of-Experts-Sprachmodelle FLAME-MOE:混合专家语言模型透明端对端研究平台 2505.20225v1 -
868 05-26 Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction Rollen Sie die Würfel & Blick, bevor Sie springen: Gehen über die kreativen Grenzen der Next-Token-Vorhersage 跳跃前的骰子滚动和看一看:超越了次声预测的创造性极限 2504.15266v2 -
869 05-26 Dependency Parsing is More Parameter-Efficient with Normalization Abhängigkeit Parsing ist mehr Parameter-Effizient mit Normalisierung 依赖性剖析的参数比正常化的参数要高 2505.20215v1 -
870 05-26 How to Improve the Robustness of Closed-Source Models on NLI Wie man die Robustheit von Closed-Source-Modellen auf NLI verbessert 如何改进封闭源码模式在非国家借贷方面的有效性 2505.20209v1 -
871 05-26 Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking Adaptive Klassifikator-freie Führung über Dynamisches Low-Confidence-Masking 通过动态低信任面罩提供适应性分类无限制指导 2505.20199v1 -
872 05-26 CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts CodeTaxo: Erweiterung der Taxonomie mit begrenzten Beispielen über Code Language Prompts 代码塔克斯:通过代码语言提示,以有限实例加强分类法的扩展 2408.09070v2 -
873 05-26 SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs SHARP: Entsperren der interaktiven Halluzination durch Stance-Transfer in Rollenspiel-LLMs SHARP:通过在角色扮演中转移角色来解锁互动幻觉 2411.07965v4 -
874 05-26 THiNK: Can Large Language Models Think-aloud? THiNK: Können große Sprachmodelle denken? 大语言模型能思考吗? 2505.20184v1 -
875 05-26 “KAN you hear me?” Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding “KAN hörst du mich?” Kolmogorov-Arnold-Netzwerke für gesprochenes Sprachverständnis erkunden 探索科尔莫戈洛夫-阿诺尔德语言理解网络 2505.20176v1 -
876 05-26 From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data Von der Ausrichtung zur Weiterentwicklung: Bootstrapping Audio-Language Alignment mit synthetischen Daten 从对齐到推进: 用合成数据推动音频语言对齐 2505.20166v1 -
877 05-26 Visual Abstract Thinking Empowers Multimodal Reasoning Visuelles Abstraktes Denken macht multimodale Vernunft 视觉抽象思考赋予多模式理由 2505.20164v1 -
878 05-26 Exploring Generative Error Correction for Dysarthric Speech Recognition Erforschung der Generativen Fehlerkorrektur bei der Erkennung von Dysarthric Speech 探索为承认沙皇演说识别而产生错误校正的探索 2505.20163v1 -
879 05-26 Capability-Based Scaling Laws for LLM Red-Teaming Capability-Based Scaling-Gesetze für LLM Red-Teaming LLM 红色团队合作以能力为基础的增强法律 2505.20162v1 -
880 05-26 Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning Prismatische Synthese: Gradientenbasierte Datendiversifizierung steigert Generalisierung in LLM-Reasoning 理论综合:基于逐步的数据多样化促进LLM理由说明的概括化 2505.20161v1 -
881 05-26 Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up Reversal of Thought: Erweiterung von großen Sprachmodellen mit präference-guided Reverse Reasoning Warm-up 思想的逆转:加强大语言模式,以优惠、有引导的反反反向理由暖化 2410.12323v3 -
882 05-26 Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs Pangu Light: Gewichtswiederinitialisierung für das Beschneiden und Beschleunigen von LLMs Pangu光: 灯光和加速LMLM的重量再启动 2505.20155v1 -
883 05-26 UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models UORA: Einheitliche Orthogonale Reinitialisierungsanpassung im Parameter-Effizient Feintuning großer Modelle UORA:大型模型参数-有效精美设计中统一的正正正重新初始化适应 2505.20154v1 -
884 05-26 Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities Gedachte politische Optimierung: Überwindung externer Leitlinien und interner Fähigkeiten 优化政策:将外部指导和内部能力结合起来 2505.15692v2 -
885 05-26 Polynomial, trigonometric, and tropical activations Polynomische, trigonometrische und tropische Aktivierungen 多边、三角和热带活性 2502.01247v2 -
886 05-26 Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models Hart negatives Kontrastives Lernen für feinkörniges geometrisches Verständnis in großen multimodalen Modellen 大型多模式模型中精细几何理解的硬反向硬学习 2505.20152v1 -
887 05-26 RESTOR: Knowledge Recovery in Machine Unlearning RESTOR: Wissensrückgewinnung in Maschinellem Lernen 机械学习中的知识恢复 2411.00204v3 -
888 05-26 SeMe: Training-Free Language Model Merging via Semantic Alignment SeMe: Training-freies Sprachmodell Zusammenführen über semantische Ausrichtung SeME:通过语义一致合并的无培训语言模式 2505.20144v1 -
889 05-26 GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models GUARD: Rollenspiel zur Generierung von Jailbreakings in natürlicher Sprache zur Prüfung der Einhaltung der Leitlinie für große Sprachmodelle GUARD: 利用《大语言模式遵守试验准则准则》创造以自然语言破门破门 2402.03299v5 -
890 05-26 StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs StructEval: Benchmarking der Kapazitäten von LLM zur Erzeugung struktureller Outputs DructEval:将LLMs的能力与产生结构性产出挂钩 2505.20139v1 -
891 05-26 P$^2$ Law: Scaling Law for Post-Training After Model Pruning P$^2$ Gesetz: Skalierungsgesetz für Post-Training nach Modellprüfung P$2美元 法律:示范 “ 谨慎 “ 后培训后培训后扩大法 2411.10272v3 -
892 05-26 AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings AweDist: Aufmerksamkeitsbewusste Einbettung Destillation für neue Eingabe-Token-Einbettungen AweDist: 新的输入式嵌入式嵌入器的注意嵌入蒸馏 2505.20133v1 -
893 05-26 Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers Iterative Selbstanreizung macht große Sprachmodelle als Agent-Sucher aus 迭代自我激励激励增强大语言模型作为代理搜索者的能力 2505.20128v1 -
894 05-26 PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks PandaGuard: Systematische Bewertung der LLM-Sicherheit gegen Jailbreaking-Angriffe PandaGuard:系统评估防止侵入监狱袭击的LLM安全性 2505.13862v3 -
895 05-26 Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models Retrieval Modelle sind nicht Tool-Savvy: Benchmarking Tool Retrieval für große Sprachmodelle 检索模型不是工具保存工具:大语言模型基准工具检索工具 2503.01763v2 -
896 05-26 Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings Crabs: Ressourcenverbrauch über Auto-Generation für LLM-DoS-Angriff unter Black-Box-Einstellungen Crabs: 在黑盒设置下通过LLM-DoS攻击的自动生成来消耗资源 2412.13879v4 -
897 05-26 Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi’s Zibaldone Genannte Entity Recognition in Historic Italian: Der Fall von Giacomo Leopardis Zibaldone 在历史上意大利文中命名实体识别:Giacomo Leopardi的Zibaldone案 2505.20113v1 -
898 05-26 ResSVD: Residual Compensated SVD for Large Language Model Compression ResSVD: Residual Compensated SVD für großsprachliche Modellkompression ResSVD: 大语言模型压缩剩余补偿SVD 2505.20112v1 -
899 05-26 Language-Agnostic Suicidal Risk Detection Using Large Language Models Sprach-agnostische Suizidrisikoerkennung mit großen Sprachmodellen 使用大语言模型进行语言不可知的自杀风险探测 2505.20109v1 -
900 05-26 Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities Große Sprachmodelle treffen auf Wissensgraphen für Fragenbeantwortung: Synthese und Chancen 大语言模式满足回答问题的知识图表:综合与机遇 2505.20099v1 -
901 05-26 S2LPP: Small-to-Large Prompt Prediction across LLMs S2LPP: Kleine bis große Vorhersagen über LLMs S2LPP: 小到大迅速预测 2505.20097v1 -
902 05-26 MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning MA-RAG: Multi-Agent Retrieval-Augmented Generation über kollaborative Chain-of-Thought-Reasoning MA-RAG:通过协作研究链解释理由实现多权获取-提款人一代 2505.20096v1 -
903 05-26 Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models Sicherheit durch Vernunft: Eine empirische Studie zu vernünftigen Guardrail-Modellen 安全理由:对护卫车模型说明理由的经验研究 2505.20087v1 -
904 05-26 Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models Enthüllen der Intrinsischen Ethischen Verletzlichkeit von ausgerichteten großen Sprachmodellen 揭示统一大语言模式内在道德脆弱性 2504.05050v3 -
905 05-26 SAEs Are Good for Steering – If You Select the Right Features SAEs sind gut für das Lenken – wenn Sie die richtigen Funktionen auswählen SAEs 有利于指导 – – 如果您选择了正确的特性 2505.20063v1 -
906 05-26 “Alexa, can you forget me?” Machine Unlearning Benchmark in Spoken Language Understanding „Alexa, kannst du mich vergessen?” Machine Unlearning Benchmark in Spoken Language Understanding “亚历克斯,你能忘记我吗?” 2505.15700v2 -
907 05-26 Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion Multimodale LLM-geführte semantische Korrektur in Text-zu-Bild-Diffusion 文字到图像传播中多式LLM-指导的语义校正 2505.20053v1 -
908 05-26 MVP: Multi-source Voice Pathology detection MVP: Multi-Source Sprachpathologie-Erkennung MVP:多源语音病理检测 2505.20050v1 -
909 05-26 Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks Grammatik der formalen Unsicherheit: Wann man LLMs bei automatisierten Aufgaben zur Begründung vertraut 正式不确定性的语法:在自动说明理由任务中何时信任LLMs 2505.20047v1 -
910 05-26 Bemba Speech Translation: Exploring a Low-Resource African Language Bemba Speech Translation: Erforschen einer ressourcenarmen afrikanischen Sprache 本巴语言翻译:探索非洲低资源语言 2505.02518v2 -
911 05-26 REARANK: Reasoning Re-ranking Agent via Reinforcement Learning REARANK: Reasoning Re-Ranking Agent über Verstärkungs-Lernen REARANK: 通过加强学习,为重新升级的代理提供理由 2505.20046v1 -
912 05-26 Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs Unsichere Aufmerksamkeitsköpfe: Effiziente Unüberwachte Unsichere Quantifizierung für LLMs 确定性 – – 警告注意头头:对LLMs进行高效率的、无监督的、不确定性的量化 2505.20045v1 -
913 05-26 The More Similar, the Better? Associations between Latent Semantic Similarity and Emotional Experiences Differ across Conversation Contexts Je ähnlicher, desto besser? Assoziationen zwischen latenter semantischer Ähnlichkeit und emotionaler Erfahrung unterscheiden sich über Gesprächskontexte ” 更相似的 “ 、 “ 更好 “ 、 “ 经常语义相似性与情感经历之间联系 “ 、 “ 不同对话背景 “ 、 “ 更好 “ 、 “ 不同对话背景 “ 、 “ 不同情感经历 “ 、 “ 不同对话背景 “ 、 “ 更好 “ 、 “ 更好 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 2309.12646v3 -
914 05-26 Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation Enthüllen der Macht der Quelle: Quelle-basierte Minimum Bayes Risiko-Dekodierung für neurale maschinelle Übersetzung 资料来源:基于源的神经机器翻译最低贝ys风险代号。 2406.11632v5 -
915 05-26 Multi-modal brain encoding models for multi-modal stimuli Multimodale Gehirnkodierungsmodelle für multimodale Reize 多模式刺激多模式大脑编码模型 2505.20027v1 -
916 05-26 A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron? Eine Umfrage über die Sicherheitsbedrohungen von Computer-Verwendern: JARVIS oder Ultron? JARVIS还是ULTRON? 调查计算机用户的安全和安保威胁:JARVIS还是ULTRON? 2505.10924v2 -
917 05-26 A Survey of LLM-based Agents in Medicine: How far are we from Baymax? Eine Umfrage von LLM-basierten Medikamenten in der Medizin: Wie weit sind wir von Baymax entfernt? 对医学中以LLM为主的药剂的调查:我们离Baymax有多远? 2502.11211v2 -
918 05-26 Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking Training von LLM-basierten Agenten mit synthetischen selbstreflektierten Trajektorien und partieller Maske 具有合成自我反射轨迹和部分遮罩的以LLM为基础的代理人员培训 2505.20023v1 -
919 05-26 TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation TTPA: Token-Level Tool-use Preference Alignment Training Framework mit feinkörniger Bewertung TTPA: 采用精细评价法的全方位工具使用优先调整培训框架 2505.20016v1 -
920 05-26 On the class of coding optimality of human languages and the origins of Zipf’s law Über die Klasse der Kodierung der optimalen menschlichen Sprachen und die Ursprünge des Zippschen Gesetzes 在人类语言最优化的编码和齐普夫法律的起源方面 2505.20015v1 -
921 05-26 Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation Ist Rationale Qualität Materie? Verbesserung der psychischen Störung Detektion durch selektive Begründung Destillation 理由质量是否重要? 通过选择性理由蒸馏加强精神失常检测 2505.20014v1 -
922 05-26 WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback WebCoT: Web-Agenten verbessern Begründung durch Rekonstruieren Kette-von-Gedanken in Reflexion, Verzweigung und Rollback WebCot:通过在反射、分流和回滚中重新构建研究链,加强网络代理理由 2505.20013v1 -
923 05-26 ProcessBench: Identifying Process Errors in Mathematical Reasoning ProcessBench: Identifizierung von Prozessfehlern in mathematischer Reasoning 进程快节: 识别数学原因中的进程错误 2412.06559v4 -
924 05-26 Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition Mischung von LoRA-Experten für die automatische Spracherkennung mit geringem Ressourcenbedarf LoRA 低资源多中心自动语音识别专家混合 2505.20006v1 -
925 05-26 Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents Unvollkommenheit: Simulieren von Studenten mit unterschiedlichen kognitiven Ebenen mit LLM-basierten Agenten 普及缺陷:利用基于LLM的代理物模拟具有不同认知水平的学生 2505.19997v1 -
926 05-26 How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation Wie gut übersetzen große Begründungsmodelle? Eine umfassende Bewertung für Multi-Domain maschinelle Übersetzung 大理由模型如何翻译?多功能机器翻译的全面评价 2505.19987v1 -
927 05-26 What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs Was bedeutet Neuro für Cardio? Untersuchung der Rolle klinischer Spezialdaten in medizinischen LLMs ” 神经中度 “ 与 “ 心脏病 “ 有何关系? 调查临床特殊数据在医疗长效管中的作用 2505.10113v2 -
928 05-26 DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset DeepDialogue: Ein multi-Turn emotional-Rich gesprochener Dialog Datensatz 深对话:多发情感- Rich 口语对话框数据集 2505.19978v1 -
929 05-26 Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language Conversational Lexicography: Abfrage Lexicographic Data on Knowledge Graphs mit SPARQL durch natürliche Sprache 通过自然语言查询与SPARQL通过自然语言的 SPARQL 知识图的文献资料 2505.19971v1 -
930 05-26 CP-Router: An Uncertainty-Aware Router Between LLM and LRM CP-Router: Ein unsicherer Router zwischen LLM und LRM CP-Router:LLM和LRM之间的不确定软件路由器 2505.19970v1 -
931 05-26 The Limits of Preference Data for Post-Training Die Grenzen der Präferenzdaten für das Post-Training 培训后优先数据限值 2505.19964v1 -
932 05-26 Explanatory Summarization with Discourse-Driven Planning Erklärende Zusammenfassung mit diskursgetriebener Planung 与 “ 分流规划 “ 结合的解释性总结 2504.19339v3 -
933 05-26 MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models MiniLongBench: Der kostengünstige Long Context Benchmark für große Sprachmodelle verstehen MiniLongBunench:大语言模式低成本长方背景理解基准 2505.19959v1 -
934 05-26 DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph DCG-SQL: Verbesserung des In-Context-Lernens für Text-zu-SQL mit Deep Contextual Schema Link Graph DCG-SQL:加强内文学习,以便用深背景图示链接图进行文字到SQL的内文学习 2505.19956v1 -
935 05-26 MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research MLR-Bench: Bewertung von KI-Agenten auf Open-Ended Machine Learning Research MLR-Bench:评估AI公司在开放式机械学习研究方面的代理机构 2505.19955v1 -
936 05-26 An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning Ein erklärbares Diagnose-Framework für neurodegenerative Dementias durch Verstärkungsoptimierte LLM-Reasoning 通过强化-优化LLM解释性理疗理由的神经医学性痴呆症可解释的诊断框架 2505.19954v1 -
937 05-26 Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation Weniger für mehr: Verbesserte feedbackorientierte gemischte LLMs für die Erzeugung von Molekülen und eine feinkörnige NLI-Bewertung 减少更多:加强用于分子制导和精细国家低排放指数评价的反馈-调整混合混合LLM(MMLM) 2405.13984v3 -
938 05-26 Can Visual Encoder Learn to See Arrows? Kann Visual Encoder lernen, Pfeile zu sehen? 视觉编码器能学会看到箭头吗 ? 2505.19944v1 -
939 05-26 Constructing a BPE Tokenization DFA Aufbau einer BPE Tokenization DFA 正在构建 BPE 磁盘化 DFA 2405.07671v2 -
940 05-26 ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs ALAS: Latente Sprach-Text-Ausrichtung für gesprochenes Sprachverständnis in multimodalen LLMs messen ALAS: 计量多种模式LM 中口语语言理解的暗中语音-文本对齐 2505.19937v1 -
941 05-26 MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning MELoRA: Mini-Ensemble Low-Rank-Adapter für ein parametereffizientes Feintuning MELORA: 用于准计有效微调的小型组合式低射速适应器 2402.17263v3 -
942 05-26 GeoEdit: Geometric Knowledge Editing for Large Language Models GeoEdit: Geometrische Wissensbearbeitung für große Sprachmodelle GeoEdit:大语言模型的几何知识编辑 2502.19953v2 -
943 05-26 A Cognitive Writing Perspective for Constrained Long-Form Text Generation Eine Kognitive Schreibperspektive für die eingeschränkte Langform-Textgenerierung 受约束的长期形式制长式制式文本生成的认知式写作视角 2502.12568v3 -
944 05-26 JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs JailbreakRadar: Umfassende Bewertung von Jailbreak Attacken gegen LLMs Jailbreb Radar:全面评估对LLMs的越狱袭击 2402.05668v3 -
945 05-26 Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles Enigmata: Scaling Logical Reasoning in großen Sprachmodellen mit synthetischen überprüfbaren Puzzles 英格玛塔:在使用合成可核实拼图的大型语言模型中扩大逻辑理由 2505.19914v1 -
946 05-26 APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization APE: Ein datenzentrischer Benchmark für effiziente LLM-Anpassung in der Textzusammenfassung APE: 文本摘要中高效LLM适应数据中心基准 2505.19912v1 -
947 05-26 Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models Lineare Kontrolle des Testbewusstseins zeigt unterschiedliche Compliance in vernünftigen Modellen 对试验认知值的线性控制 2505.14617v2 -
948 05-26 ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows ScienceBoard: Bewertung multimodaler autonomer Agenzien in realistischen wissenschaftlichen Workflows 科学理事会:评估现实科学工作流程中的多式联运自治机构 2505.19897v1 -
949 05-26 Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program Große Sprachmodelle als autonome Raumfahrzeugbetreiber im Kerbal-Raumprogramm 作为Kerbal空间方案自主航天器运营商的大型语言模型 2505.19896v1 -
950 05-26 MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System MoC: Mischungen von Text Chunking Learners für retrieval-Augmented Generation System MoC: 用于检索增强型生成系统的 文本冲击学习者混合体 2503.09600v2 -
951 05-26 ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining ESLM: Risiko-Averse Selective Language Modeling für effizientes Vortraining ESLM: 有效培训前风险-反风险选择语言建模 2505.19893v1 -
952 05-26 Phare: A Safety Probe for Large Language Models Phare: Eine Sicherheitssonde für große Sprachmodelle 法尔:大语言模型的安全检测 2505.11365v4 -
953 05-26 APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs APB: Beschleunigen des verteilten Long-Context-Schlussfolgerungens durch Übergeben von komprimierten Kontextblöcken über GPUs APP: 通过通过横跨 GPU 传递压缩的上下文区块加速分布式长文字推文 2502.12085v2 -
954 05-26 Explaining the role of Intrinsic Dimensionality in Adversarial Training Erklärung der Rolle der Intrinsischen Dimensionalität im Adversarial Training 解释内在多面性在相互培训中的作用 2405.17130v2 -
955 05-26 HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation HS-STAR: Hierarchische Probenahme für selbstlernende Vernunfter über Schwierigkeitsschätzung und Budget-Umverteilung HS-STAR:通过难以估计和预算重新定位为自学理性者进行等级抽样 2505.19866v1 -
956 05-26 REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models REA-RL: Reflection-Aware Online-Verstärkungs-Lernen für effiziente große Vernunftmodelle REA-RL:为高效大型理由模型进行反思-软件在线强化学习 2505.19862v1 -
957 05-26 Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages Über die Spezialisierung hinaus: Benchmarking LLMs für die Transliteration indischer Sprachen 超越专业:为印度语言转写确定基准的LLMs 2505.19851v1 -
958 05-26 Improving Multilingual Math Reasoning for African Languages Mehrsprachige mathematische Grundlagen für afrikanische Sprachen verbessern 改进非洲语文多语种计算法 2505.19848v1 -
959 05-26 FoodTaxo: Generating Food Taxonomies with Large Language Models FoodTaxo: Generierung von Lebensmittel-Taxonomien mit großen Sprachmodellen FoodTaxo: 产生具有大语言模式的食品分类学 2505.19838v1 -
960 05-26 FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow FullFront: Benchmarking von MLLMs über den Full Front-End Engineering Workflow FullFront:在全前端工程工作流程中确定MLLMs基准 2505.17399v2 -
961 05-26 DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer’s Disease DECT: LLM-unterstütztes feinkörniges Sprachwissen und etikettierte und etikettierte Datengenerierung zur Diagnose der Alzheimer-Krankheit DECT:利用LLM协助的LLM协助的精精细语言知识以及用于诊断阿尔茨海默氏病的标签和标签保密数据生成 2502.04394v2 -
962 05-26 Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents Hierarchische Retrieval mit Evidenz-Kuration für Open-Domain-Finanzfrage-Antworten auf standardisierte Dokumente 标准化文件开放域财务问题证据说明的梯级检索 2505.20368v1 -
963 05-26 Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric Modell Utility Law: Bewertung von LLMs jenseits der Leistung durch Mechanism Interpretable Metric 示范效用法:通过解释计量机制评价业绩以外的有限利妇女 2504.07440v3 -
964 05-26 Generalizable Prompt Learning of CLIP: A Brief Overview Generalisierbares Prompt Lernen von CLIP: Ein kurzer Überblick CLIP:简要概述 2503.01263v5 -
965 05-26 Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation Registrierung von Quellen-Token zu Zielspracheräumen in mehrsprachiger neuraler maschineller Übersetzung 多种语言神经机翻译中目标语言空间 2501.02979v3 -
966 05-26 Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective Entschlüsselung bahngestützter LLM-Reasoning: Eine Optimierungsperspektive 解码轨迹辅助LLM 理由说明:优化视角 2505.19815v1 -
967 05-26 Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks Erforschung des Bewusstseins in LLMs: Eine systematische Untersuchung von Theorien, Implementierungen und Grenzrisiken 探索LLMM中的觉悟:对理论、实施和前沿风险的系统调查 2505.19806v1 -
968 05-26 Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation Compliance-to-Code: Verbesserung der finanziellen Compliance-Prüfung durch Codegenerierung 遵守到守则:通过代码生成加强金融合规检查 2505.19804v1 -
969 05-26 QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language QueryAttack: Jailbreaking Aligned Large Language Models Verwendung strukturierter, nicht-natürlicher Abfragesprache 查询:使用结构化非自然查询语言的监狱破碎的大型语言统一模式 2502.09723v3 -
970 05-26 MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs MOLE: Metadatenextraktion und -validierung in wissenschaftlichen Papieren mit LLMs MOLE: 利用LLMs在科学文件中提取和验证元数据 2505.19800v1 -
971 05-26 R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning R1-T1: Volle Förderung der Übersetzungsfähigkeit in LLMs über das Reasoning Learning R1-T1:通过解释学习充分激励LLMs翻译能力 2502.19735v3 -
972 05-26 O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering O$^2$-Sucher: Ein Such-basiertes Agentenmodell für Open-Domain Open-Ended Question Answering O$2美元-Searcher:基于搜索的开放域开放式问题解答代理模式 2505.16582v2 -
973 05-26 Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification Analyse politischer Bias in LLMs über zielorientierte Sentiment-Klassifikation 通过定向感知分类分析LLMMs中的政治偏见 2505.19776v1 -
974 05-26 What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs Was spielt bei vielen scharfen Angriffen wirklich eine Rolle? Eine empirische Studie über langanhaltende Schwachstellen in LLMs 许多热攻击的真正问题是什么? 2505.19773v1 -
975 05-26 Query Performance Prediction using Relevance Judgments Generated by Large Language Models Abfrage der Leistungsvorhersage anhand von Relevanzurteilen, die von großen Sprachmodellen erzeugt werden 使用大语言模型产生的相关性判断的查询性绩效预测 2404.01012v3 -
976 05-26 Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO Verständnis der Leistungslücke im Preference Learning: Eine Dichotomie von RLHF und DPO 了解优先学习方面的绩效差距:RLHF和DPO的二分切开术 2505.19770v1 -
977 05-26 T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search T^2Agent Ein Tool-augmented Multimodale Fehlinformation Detection Agent mit Monte Carlo Baumsuche T2 A A 工具增强的多式错误信息检测代理 蒙特卡洛树搜索工具 2505.19768v1 -
978 05-26 SGM: A Framework for Building Specification-Guided Moderation Filters SGM: Ein Rahmen für gebäudespezifikationsgeführte Moderationsfilter SGM: 构建规格引导调控过滤器的框架 2505.19766v1 -
979 05-26 In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement In-Context-Demonstrationsfragen: Zur Prompt-Optimierung für Pseudo-Supervision-Verfeinerung 内文示范事项:关于Psuedo-监督改进的迅速优化 2410.03124v2 -
980 05-26 CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement CIDRe: Ein referenzfreies Multi-Aspekt-Kriterium für die Qualitätsmessung von Code Comment CIDRe: 守则评论质量衡量的无参考性、无参考性、多特征的多标准标准 2505.19757v1 -
981 05-26 Efficient Reasoning via Chain of Unconscious Thought Effiziente Vernunft durch Kette des unbewussten Denkens 通过无意识思维链进行高效率的思考 2505.19756v1 -
982 05-26 NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering NeuSym-RAG: Hybrides neurales Symbolisches Retrieval mit Multiview-Strukturierung für PDF-Fragebeantwortung NeuSym-RAG: PDF 问题解答混合神经符号回收与多视图结构结构 2505.19754v1 -
983 05-26 Discrete Markov Bridge Diskretierte Markov-Brücke 分立马尔科夫桥 2505.19752v1 -
984 05-26 Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents Mobile-Bench-v2: Ein realistischerer und umfassenderer Benchmark für VLM-basierte mobile Agenten 移动-Bench-v2:基于VLM的移动剂更加现实和全面的基准 2505.11891v2 -
985 05-26 Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models Erforschung der Auswirkungen von Corpus Diversity auf vorschulische Sprachmodelle 探讨公司多样性对财务方面缺乏培训语言模式的影响 2310.13312v2 -
986 05-26 Stuffed Mamba: Oversized States Lead to the Inability to Forget Gefüllte Mamba: Übergroße Staaten führen zu der Unfähigkeit zu vergessen 马姆巴:国家规模过大,导致无法忘却 2410.07145v2 -
987 05-26 Distilling Closed-Source LLM’s Knowledge for Locally Stable and Economic Biomedical Entity Linking Brennen von geschlossener Quelle LLMs Wissen für lokal stabile und wirtschaftliche biomedizinische Entitätsverknüpfung 保留秘密来源LLM的当地稳定和经济生物医学实体联系知识 2505.19722v1 -
988 05-26 Graceful Forgetting in Generative Language Models Anmutiges Vergessen in generativen Sprachmodellen 在创用语言模型中优雅地忘却 2505.19715v1 -
989 05-26 MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning MT$^{3}$: Skalierung von MLLM-basierten Textbildmaschinenübersetzungen über Multi-Task-Verstärkungslernen MT$=%3}$:通过多任务强化学习,扩大基于MLLM的文本图像机翻译 2505.19714v1 -
990 05-26 FamilyTool: A Multi-hop Personalized Tool Use Benchmark FamilyTool: Ein Multi-Hop Personalisiertes Tool Benchmark FamilyTool:多希望个性化工具使用基准 2504.06766v2 -
991 05-26 Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision Fehler-Typierung für intelligentere Belohnungen: Verbesserung der Prozess-Reward-Modelle mit Fehler-Aware Hierarchische Überwachung 为智能奖赏打字出错: 改进有错误- 软件等级监督的流程评分模型 2505.19706v1 -
992 05-26 Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models Nutzung von wichtigen Stichproben zur Abgleichung von Alignment-Modulen aus großen Sprachmodellen 从大语言模型中利用重要性取样到分离对齐模块 2505.19700v1 -
993 05-26 Large Language Models for Planning: A Comprehensive and Systematic Survey Große Sprachmodelle für die Planung: Eine umfassende und systematische Erhebung 规划大语言模式:全面和系统调查 2505.19683v1 -
994 05-26 Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings Aufwärmen, bevor Sie trainieren: Entsperren der allgemeinen Vernunft in ressourcenbeschränkten Einstellungen 在您之前暖暖的列车 : 在受资源限制的设置中解锁一般理由 2505.13718v2 -
995 05-26 Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors Ihr Sprachmodell kann geheim wie Menschen schreiben: Kontrastive Paraphrasenangriffe auf LLM-generierte Textdetektoren 您的语言模式可以像人类一样秘密写作:对LLM-Generated 文本探测器的矛盾性插词攻击 2505.15337v2 -
996 05-26 Detecting LLM-Generated Korean Text through Linguistic Feature Analysis LLM-generierter koreanischer Text durch Linguistik-Feature-Analyse erkennen 通过语言特征分析探测LLM-发光韩文文本 2503.00032v3 -
997 05-26 UniICL: An Efficient Unified Framework Unifying Compression, Selection, and Generation UniICL: Ein effizientes einheitliches Framework, das Komprimierung, Auswahl und Generierung vereint UNIICL: 统一压缩、甄选和生成的有效统一框架 2405.17062v3 -
998 05-26 KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Low-Resource Speech Translation Systems des KIT für IWSLT2025: Systemverbesserung mit synthetischen Daten und Modellregularisierung KIT的IWSLT2025低资源语音翻译系统:利用合成数据和模型规范化加强系统 2505.19679v1 -
999 05-26 Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs Erdungssprache mit Vision: Eine bedingte Gegenseitige Informationskalibrierte Dekodierungsstrategie zur Reduktion von Halluzinationen in LVLMs 具有远见的地面语言:减少低地低地飘移中幻觉的有条件相互信息校准标记战略 2505.19678v1 -
1000 05-26 Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement Kalibrierung vortrainierter Sprachklassifikatoren auf LLM-generierten Noisy-Labels über iterative Veredelung 通过迭代精炼校准LLM产生的噪音标签上的训练前语言分类校准 2505.19675v1 -
1001 05-26 Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models Trennen Sie das Weizen vom Chaff: Ein Post-Hoc-Ansatz für die Wiederausrichtung der Sicherheit für feingetönte Sprachmodelle 将小麦与Chaff区分开来:对精美语言模式的安全调整后方法 2412.11041v3 -
1002 05-26 A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit Ein voll generativer Motivationsgespräch Berater Chatbot für den Umzug Raucher auf dem Weg zu der Entscheidung zu beenden 全面创造动机的访谈参赞Chatbot 移动吸烟者争取决定退出 2505.17362v2 -
1003 05-26 Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models Reformieren von Repräsentationsräumen, um Sicherheit und Überrejektion in großen Audio-Sprachenmodellen auszugleichen 重塑代表空间以平衡大型音频语言模型中的安全和过度拒绝 2505.19670v1 -
1004 05-26 GTR: Graph-Table-RAG for Cross-Table Question Answering GTR: Graph-Table-RAG für Cross-Table-Frageantworten GTR:用于跨表问题解答的图表表-RAG 2504.01346v3 -
1005 05-26 LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation LeCoDe: Ein Benchmark-Datensatz für interaktive Rechtsberatungs-Dialog-Evaluierung LeCode:交互式法律协商对话评价的基准数据集 2505.19667v1 -
1006 05-26 Conditioning LLMs to Generate Code-Switched Text LLMs konditionieren, um codegeschalteten Text zu erzeugen 将LLM 限定为生成代码开关文本 2502.12924v2 -
1007 05-26 Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning Mehrbildbeschreibungen für mehrsprachige, leichte Kognitive Impairment-Erkennung durch kontrastives Lernen enthüllen 通过差异学习发现多语种轻视认知缺陷的单形多语种描述 2505.17067v2 -
1008 05-26 GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models GenKI: Verbesserung der Open-Domain-Fragebeantwortung mit Wissensintegration und kontrollierbarer Generierung in großen Sprachmodellen GenKI:加强以大语言模式在知识整合和可控生成方面答案的开放性问题 2505.19660v1 -
1009 05-26 A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language? Eine Geschichte von zwei Strukturen: Erfassen LLMs die Fraktalkomplexität der Sprache? 两种结构的故事:LLMs是否捕捉语言的分形复杂性? 2502.14924v2 -
1010 05-26 Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation Auswählen, Lesen und Schreiben: Ein multi-agenter Rahmen volltextbasierter verwandter Arbeit Generation 选择、读取和写入:全文本相关工作生成的多机构代理框架 2505.19647v1 -
1011 05-26 Interleaved Reasoning for Large Language Models via Reinforcement Learning Interleaved Reasoning für große Sprachmodelle durch Verstärkungslernen 通过强化学习促进大语言模式 2505.19640v1 -
1012 05-26 Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models Segment First or Comprehend First? Erforschen Sie die Grenzen der unüberwachten Wortsegmentierung mit großen Sprachmodellen 首段或首段理解 ? 探索以大语言模式进行不受监督的单词分割的限制 。 2505.19631v1 -
1013 05-26 DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue DoctorAgent-RL: Ein multi-agent-kollaboratives Verstärkungs-Lernsystem für den multi-Turn-Klinischen Dialog DocrAgentor-RL:多轮临床对话多机构合作强化学习系统 2505.19630v1 -
1014 05-26 Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models Denken Sie noch einmal! Die Wirkung von Test-Time Compute auf Präferenzen, Meinungen und Überzeugungen von großen Sprachmodellen 再想想!测试时间计算对大语言模式的优惠、意见和信仰的影响 2505.19621v1 -
1015 05-26 Lens: Rethinking Multilingual Enhancement for Large Language Models Objektiv: Mehrsprachige Erweiterung für große Sprachmodelle neu denken 镜头:重新思考为大语言模式重新思考多语种增强大语言模式 2410.04407v2 -
1016 05-26 Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization Erforschung der Verallgemeinerbarkeit von Factual Halluzination Mitigation durch die Verbesserung präziser Wissensnutzung 探索通过增强利用精确的知识来减轻事实幻觉的普及性 2502.19127v2 -
1017 05-26 Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs Sind die versteckten Staaten etwas verbergen? Testen Sie die Grenzen der Faktizität-Encoding Fähigkeiten in LLMs 隐秘国是否隐藏着什么?测试LLMM中实际质量-编码能力限度。 2505.16520v2 -
1018 05-26 Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically Sprachen in mehrsprachigen Sprachstiftungsmodellen richten sowohl phonetisch als auch semantisch 多语种语言语言基金会 2505.19606v1 -
1019 05-26 Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis Machine Translation Models für Englisch-Hindi Sprachpaare bewerten: Eine vergleichende Analyse 英文-中文语文配对评价机器翻译模型:比较分析 2505.19604v1 -
1020 05-26 Preference Optimization by Estimating the Ratio of the Data Distribution Präferenzoptimierung durch Schätzung des Verhältnisses der Datenverteilung 通过估计数据分配比率实现最佳优化 2505.19601v1 -
1021 05-26 Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar Inkonsistente Tokenisierungen führen dazu, dass Sprachmodelle von japanischer Grammatik verblüfft werden. 前后不一致的招数导致语言模式被日语语法所混淆 2505.19599v1 -
1022 05-26 Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie 评估大音频语言模型对音频注射的威力:经验研究 2505.19598v1 -
1023 05-26 Multi-Agent Collaboration via Evolving Orchestration Multi-Agenten-Zusammenarbeit über Evolving Orchestration 通过不断演变的管弦化多机构协作 2505.19591v1 -
1024 05-26 SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation SepALM: Audio Sprachmodelle sind Fehlerkorrekturen für robuste Sprachtrennung SepALM: 音频语言模型是强力语音分离错误纠正器 2505.03273v2 -
1025 05-26 Learning to Reason without External Rewards Vernunft lernen ohne externe Belohnungen 学习没有外部奖励的理性 2505.19590v1 -
1026 05-26 Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing Beschleunigung der Vorfüllung für Langkontext-LLMs über Sparse Pattern Sharing 通过 Sparse 模式共享加速预填长文本 LLMs 2505.19578v1 -
1027 05-26 Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch Cheems: Eine praktische Anleitung für das Bauen und Evaluieren chinesischer Belohnungsmodelle von Scratch Cheems:从Scratch建立和评估中国奖励模型实用指南 2502.17173v3 -
1028 05-26 DocMEdit: Towards Document-Level Model Editing DocMEdit: Auf dem Weg zur Dokumenten-Level-Modellbearbeitung DocMEdit:走向文件级别示范编辑 2505.19572v1 -
1029 05-26 Rethinking Text-based Protein Understanding: Retrieval or LLM? Rethinking Text-basierte Protein-Verständnis: Retrieval oder LLM? 重新思考基于文本的蛋白质理解:检索还是LLM? 2505.20354v1 -
1030 05-26 Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights Automatisierter Text-zu-Tisch für reasoning-intensive Tabelle QA: Pipeline-Design und Benchmarking-Insights QA:管道设计和基准透视 2505.19563v1 -
1031 05-26 On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation On-Policy-Selbstjustierung mit feinkörnigem Wissen Feedback zur Halluzination Mitigation 政策上与精精精细知识的自我协调以缓解幻觉的反馈 2406.12221v6 -
1032 05-26 Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent Benchmarking multimodaler Retrieval Augmented Generation mit dynamischem VQA-Datensatz und selbstadaptivem Planungs-Agent 具有动态VQA数据集和自适应规划剂的多式回收增强型 2411.02937v5 -
1033 05-26 Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents Auf dem Weg zu Multi-Granularität Memory Association und Auswahl für langfristige Conversational Agents 走向多群体记忆协会和选择长期对话代理人 2505.19549v1 -
1034 05-26 How Syntax Specialization Emerges in Language Models Wie Syntax Spezialisierung in Sprachmodelle auftaucht 语言模式中的语法专门化如何出现 2505.19548v1 -
1035 05-26 Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements Betrug-R1 : Multi-Round Benchmark für die Bewertung der Robustheit von LLM gegen Augmented Betrug und Phishing Inducings 欺诈R1:评估防止增加欺诈和钓鱼诱骗行为LLM的有力程度的多基准 2502.12904v2 -
1036 05-26 R3: Robust Rubric-Agnostic Reward Models R3: Robuste Rubric-Agnostische Belohnungsmodelle R3:坚固的Rubric-不可知奖赏模型 2505.13388v2 -
1037 05-26 Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs Amulett: Neuausrichtung während der Testzeit für Personalisierte Präferenzanpassung von LLMs 缩略图:在试验期间重新对准,以适应LLMM的个性化偏好 2502.19148v2 -
1038 05-26 DoctorRAG: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients DoctorRAG: Medizinische RAG Durch Textabstufungen Wissen mit Patient Analogie fusionieren 医生RAG:通过文字梯度将医学RAG知识与病人分析知识与病人分析相融合 2505.19538v1 -
1039 05-26 Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation Können große Sprachmodelle ein guter emotionaler Unterstützer sein? Preference Bias auf Emotional Support Conversation abmildern 大语言模式能否成为情感支持的良好支持者? 2402.13211v3 -
1040 05-26 FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models FlowCut: Redundanz über Informationsfluss für effiziente Vision-Sprachenmodelle neu denken 流程:通过信息流动重新思考通过信息流动实现高效愿景-语言模型的冗余 2505.19536v1 -
1041 05-26 SLOT: Sample-specific Language Model Optimization at Test-time Steckplatz: Beispielspezifische Sprachmodelloptimierung zur Testzeit SPLOT: 测试时特定抽样语文示范模式优化 2505.12392v2 -
1042 05-26 SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback SIPDO: Closed-Loop Prompt Optimierung über Synthetic Data Feedback SIPDO:通过合成数据反馈,通过闭闭电话快速优化 2505.19514v1 -
1043 05-26 Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models Kausaldestillation: Übertragen strukturierter Erklärungen von großen zu kompakten Sprachmodellen 因果蒸馏:将结构化解释从大语言模式转移到集约语言模式 2505.19511v1 -
1044 05-26 StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization StepSearch: LLMs entzünden Suche Fähigkeit über Schritt-Wise Proximal Policy Optimization 切换搜索:通过 “ 一步步Wise “ 方案最佳政策优化化,将LLMs搜索能力化 2505.15107v2 -
1045 05-26 DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation DOGe: Defensive Output Generation für LLM-Schutz vor Wissensdestillation DOGe: 防知识蒸馏保护LLM的防御性产出产生 2505.19504v1 -
1046 05-26 QAEncoder: Towards Aligned Representation Learning in Question Answering System QAEncoder: Auf dem Weg zu einem abgestimmten Repräsentationslernen im Fragebeantwortungssystem QAEncolder:在问题解答系统中实现代表性统一学习 2409.20434v2 -
1047 05-26 Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents Anveshana: Ein neuer Benchmark-Datensatz für Cross-Lingual Information Retrieval über englische Abfragen und Sanskrit-Dokumente Anveshana:英语问答和梵文文件跨语言信息检索新基准数据集 2505.19494v1 -
1048 05-26 NExtLong: Toward Effective Long-Context Training without Long Documents NExtLong: Auf dem Weg zu effektiver Langtext-Schulung ohne lange Dokumente NExtLong:争取在无长文件的情况下进行有效长文培训 2501.12766v2 -
1049 05-26 When can isotropy help adapt LLMs’ next word prediction to numerical domains? Wann kann Isotropie helfen, die nächste Wortvorhersage von LLMs an numerische Domänen anzupassen? 何时才能帮助LLMS的下一个字词预测适应数字域? 2505.17135v2 -
1050 05-26 PASS-FC: Progressive and Adaptive Search Scheme for Fact Checking of Comprehensive Claims PASS-FC: Progressives und adaptives Suchschema für die Prüfung umfassender Ansprüche PASS-FC: 全面索赔事实核实渐进和适应性搜索计划 2504.09866v2 -
1051 05-26 HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning HellaSwag-Pro: Ein großformatiger zweisprachiger Benchmark zur Bewertung der Robustheit von LLMs in Commonsense Reasoning HellaSwag-Pro:用于评价常识理由解释中LLMs是否强劲的大型双语双语基准 2502.11393v2 -
1052 05-26 MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation MTR-Bench: Umfassender Benchmark für die Bewertung von Multi-Turn-Reasoning 中期审查-后期:多重理由评价综合基准 2505.17123v2 -
1053 05-26 Parrot: Multilingual Visual Instruction Tuning Papagei: Mehrsprachige visuelle Anleitung Parrot: 多语言视觉教学图示 2406.02539v3 -
1054 05-26 ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search ARise: Auf dem Weg zu einer wissensbasierten Vernunft durch Risiko-Adaptive Search ARise:通过风险-减轻风险的搜索寻求知识推理 2504.10893v2 -
1055 05-26 Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin Auf dem Weg zum Ende der Ausbildung zur automatischen Spracherkennung für nigerianische Pidgin 走向尼日利亚皮吉纳自动语音识别的端至端培训 2010.11123v2 -
1056 05-26 FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models FastCurL: Curriculum-Verstärkungs-Lernen mit Stage-Wise-Kontext-Skalierung für effizientes Training R1-ähnliche Reasoning-Modelle FastCuRL: 课程强化学习,分阶段为高效率培训提供R1类理由模型 2503.17287v4 -
1057 05-26 BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs BizFinBench: Ein geschäftsgetriebener Real-World-Finanz-Benchmark für die Bewertung von LLMs BizFin BinBenench:商业驱动的现实世界评价长效信贷额度的金融基准 2505.19457v1 -
1058 05-26 HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation HopRAG: Multi-Hop-Gründung für die Logic-Aware Retrieval-Augmented Generation HOPRAG: 逻辑-软件检索多功能原因 2502.12442v2 -
1059 05-26 Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Pixel Reasoner: Anreize für Pixel-Space-Reasoning mit kuriositätsgetriebenem Verstärkungslernen 像素理由:激励像素空间与好奇-驱动强化学习相结合的像素空间理由 2505.15966v2 -
1060 05-26 Discovering Forbidden Topics in Language Models Verbotene Themen in Sprachmodellen entdecken 发现语言模型中的禁止专题 2505.17441v2 -
1061 05-26 Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering Ausrichtung großer Sprachmodelle, um Anweisungen zu folgen und weniger Halluzinate über effektive Datenfilterung 通过有效的数据过滤使大语言模型与遵循指令和低致幻模型相匹配 2502.07340v3 -
1062 05-26 Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI Vibe Coding vs. Agentic Coding: Grundlagen und praktische Implikationen von Agentic AI Vibe 编码与 Agentic 编码:Agent AI 的基本要素和实际影响 2505.19443v1 -
1063 05-26 The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models Die Geburt des Wissens: Emergente Funktionen über Zeit, Raum und Maßstab in großen Sprachmodellen 知识的诞生:跨越时间、空间和大语言模型规模的新兴特征 2505.19440v1 -
1064 05-26 Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers Surrogate Signale aus Format und Länge: Verstärkungslernen zur Lösung mathematischer Probleme ohne Grundwahrheitsantworten 格式和长度的代用信号:为解决没有事实答案的数学问题进行强化学习 2505.19439v1 -
1065 05-26 Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents Task Memory Engine: Raumspeicher für robuste, mehrstufige LLM-Agenten 任务记忆引擎:强力多级LLM代理器的空间内存 2505.19436v1 -
1066 05-26 Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection Weg zur Vernunft: Adaptives Routing für die LLM und die Strategieauswahl 原因路线:LLM和理由选择战略的适应性分流 2505.19435v1 -
1067 05-26 One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs One-Shot reicht: Konsolidierung von Multi-Turn-Angriffen in effiziente Single-Turn-Prompts für LLMs 将多发攻击合并为LLMs的高效单发提示 2503.04856v2 -
1068 05-26 Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation Strategische Markteinblicke mit großen Sprachmodellen ableiten: Ein Benchmark für die vorausschauende kontrafaktische Generation 具有大语言模式的战略市场展望:前瞻性反实际生成基准 2505.19430v1 -
1069 05-26 Rhapsody: A Dataset for Highlight Detection in Podcasts Rhapsody: Ein Datensatz für Highlight-Erkennung in Podcasts Rhapsody: 用于播客中高亮度探测的数据集 2505.19429v1 -
1070 05-26 Frictional Agent Alignment Framework: Slow Down and Don’t Break Things Frictional Agent Alignment Framework: Langsam nach unten und nicht brechen Dinge 波动剂对齐框架:慢下来,不要打破 2505.19428v1 -
1071 05-26 MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision MAS-ZERO: Konzipieren von Multi-Agenten-Systemen mit Zero Supervision MAS-ZERO: 设计无监督的多机构系统 2505.14996v2 -
1072 05-26 The Role of Diversity in In-Context Learning for Large Language Models Die Rolle der Vielfalt im In-Context-Lernen für große Sprachmodelle 多样性在为大语言模式进行内文学习方面的作用 2505.19426v1 -
1073 05-26 Each Graph is a New Language: Graph Learning with LLMs Jeder Graph ist eine neue Sprache: Graph Learning mit LLMs 每图都是一种新语言:用LLMM学习图表 2501.11478v3 -
1074 05-26 Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers Three Minds, One Legend: Jailbreak Large Reasoning Model mit adaptiven Stacked Ciphers 三个心灵,一个传说:监狱破裂大型理性模型,有适应性堆叠加密码 2505.16241v3 -
1075 05-26 Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering Selbstreflektierende Planung mit Wissensgraphen: Verbesserung der LLM-Begründetheit bei der Beantwortung von Fragen 带有知识图的自反规划:加强LLM 问题解答的可靠性 2505.19410v1 -
1076 05-26 CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems CoTGuard: Mit Chain-of-Thought-Triggering für Urheberrechtsschutz in Multi-Agent LLM-Systemen COTGuard: 利用探索链在多个高级LLM系统中启动版权保护 2505.19405v1 -
1077 05-26 Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs Können LLMs helfen, Erkenntnisse über LLMs zu enthüllen? Eine groß angelegte, sich entwickelnde Literaturanalyse von Frontier LLMs LLMs 帮助发现关于LLM的见识? 大型、不断发展的前沿LMS文学分析 2502.18791v3 -
1078 05-26 ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL ROUTE: Robustes Multitask Tuning und Zusammenarbeit für Text-zu-SQL ROUTE: 文本到 SQL 的强有力的多任务调试和协作 2412.10138v3 -
1079 05-26 What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context for Multi-Hop QA Welches externe Wissen wird von LLMs bevorzugt? Charakterisieren und Erforschen von Beweiskette im unvollkommenen Kontext für Multi-Hop QA 普惠制普惠制普惠制普惠制普惠制所偏爱的外部知识是什么? 2412.12632v3 -
1080 05-26 Simple and Effective Baselines for Code Summarisation Evaluation Einfache und effektive Grundlagen für die Code-Summarisation-Bewertung 用于代码摘要评价的简单有效基线 2505.19392v1 -
1081 05-26 gec-metrics: A Unified Library for Grammatical Error Correction Evaluation gec-metrics: Eine einheitliche Bibliothek für die Bewertung der grammatischen Fehlerkorrektur 几何:一个用于校正校正错误校正评价的统一图书馆 2505.19388v1 -
1082 05-26 SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence SelfElicit: Ihr Sprachmodell weiß geheim, wo die relevanten Beweise sind 自 己: 您的语言模型秘密知道相关证据在哪里 2502.08767v2 -
1083 05-26 GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor GSA-TTS : Auf dem Weg zur Null-Schuss-Sprachsynthese auf Basis eines graduellen Style-Adapters GSA-TTS:在渐进式样调适器基础上实现零热话合成 2505.19384v1 -
1084 05-26 JingFang: An Expert-Level Large Language Model for Traditional Chinese Medicine Clinical Consultation and Syndrome Differentiation-Based Treatment JingFang: Ein sachverständiges Sprachmodell für die traditionelle chinesische Medizin Klinische Beratung und Syndromdifferenzierungsbasierte Behandlung JingFang:中国传统医学临床咨询和综合症差别治疗专家级大语言模式 2502.04345v2 -
1085 05-26 Identifying Knowledge Editing Types in Large Language Models Identifikation von Wissensbearbeitungstypen in großen Sprachmodellen 确定大语言模式中的知识编辑类型 2409.19663v3 -
1086 05-26 Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality Glaube Attribution als mentale Erklärung: Die Rolle der Genauigkeit, Informatizität und Kausalität 信仰归属作为精神解释:准确性、信息化和因果关系的作用 2505.19376v1 -
1087 05-26 MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations MOSAIC: Modellierung sozialer KI für die Verbreitung von Inhalten und Regulierung in Multi-Agent-Simulationen MOSAIC:多机构模拟中内容传播和监管模拟社会AI 2504.07830v2 -
1088 05-25 (7) ChartLens: Fine-grained Visual Attribution in Charts ChartLens: Feinkörnige visuelle Zuordnung in Charts 图表边:图表中精细的可视属性 2505.19360v1 -
1089 05-25 Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval Optimierte Text-Embedding-Modelle und Benchmarks für die Amharische Passage Retrieval 阿姆光通过通过检索的最佳文本嵌入模型和基准 2505.19356v1 -
1090 05-25 Estimating Online Influence Needs Causal Modeling! Counterfactual Analysis of Social Media Engagement Schätzung des Online-Einflusses braucht kausale Modellierung! Gegenfaktische Analyse von Social Media Engagement 估计在线影响需求因果建模:反事实分析社会媒体参与 2505.19355v1 -
1091 05-25 Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning Datumsfragmente: Ein versteckter Engpass an Tokenisierung für zeitliche Vernunft 日期碎片: 用于时间原因的 托肯化的隐藏瓶头 2505.16088v2 -
1092 05-25 GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance GC-KBVQA: Ein neues Vier-Stufen-Framework zur Verbesserung der wissensbasierten visuellen Frageantwortleistung GC-KKBVQA:加强基于知识的视觉回答问题业绩的四步新框架 2505.19354v1 -
1093 05-25 Optimizing Decomposition for Optimal Claim Verification Optimierung der Zersetzung für eine optimale Prüfung des Anspruchs 优化最佳索赔核实的分解 2503.15354v2 -
1094 05-25 Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation Architekturen des Irrtums: Eine philosophische Untersuchung der KI- und menschlichen Code-Generation 错误结构结构:对大赦国际和人类代码生成的哲学调查 2505.19353v1 -
1095 05-25 PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims PatentScore: Mehrdimensionale Bewertung von LLM-generierten Patentansprüchen 专利核心:对LLM-专利专利权主张的多维评价 2505.19345v1 -
1096 05-25 LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs LLM-basiertes Prompt-Ensemble für zuverlässige medizinische Entitätserkennung von EHRs 以LLM为基础,从EHRs为可靠医疗实体识别而迅速加入 2505.08704v2 -
1097 05-25 Regress, Don’t Guess – A Regression-like Loss on Number Tokens for Language Models Regress, nicht raten – Ein Rückschritt-ähnlicher Verlust an Zahlenzeichen für Sprachmodelle Regress, don’t guess - 语言模型数字调的回归式损失 2411.02083v2 -
1098 05-25 Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data? Sind Transformer durch die Verbindung getrennter Kenntnisse in Trainingsdaten in der Lage, Vernunft zu erreichen? 将培训数据方面的单独知识连接起来的变换者是否具有理性? 2501.15857v6 -
1099 05-25 Patent-CR: A Dataset for Patent Claim Revision Patent-CR: Ein Datensatz für Patentanspruchsrevision 专利专利权:专利权索赔修订数据集 2412.02549v2 -
1100 05-25 ODIN: A NL2SQL Recommender to Handle Schema Ambiguity ODIN: Ein NL2SQL-Empfänger zum Umgang mit Schema-Ambiguität ODIN: 处理 Schema 模糊性的NL2SQL建议 2505.19302v1 -
1101 05-25 SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking RituatedThinker: LLM-Grundlegung mit Real-World durch Rituated Thinking 地势感知者:通过地势思维将LLM定位在现实世界中 2505.19300v1 -
1102 05-25 Can Large Language Models Generate High-quality Patent Claims? Können große Sprachmodelle hochwertige Patentansprüche generieren? 大语言模型能否产生高质量的专利索赔? 2406.19465v3 -
1103 05-25 Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions Nicht-Just-Scaling-Gesetze: Auf dem Weg zu einem besseren Verständnis der Auswirkungen von Sprachmodellgestaltungsentscheidungen 《非公正衡量法律:更好地了解语言设计示范设计决定下游影响》 2503.03862v2 -
1104 05-25 A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations Ein notwendiger Schritt zur Treue: Konsistenz in Freitexterklärungen messen und verbessern 迈向信仰的必要步骤:衡量和增进自由解释中的一致性 2505.19299v1 -
1105 05-25 Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data Prompting ist nicht alles, was Sie brauchen! Bewertung LLM Agent Simulation Methoden mit Real-World Online Kunden Verhalten Daten 提示并非你所需要的全部! 使用真实世界在线客户行为数据评估LLM代理模拟方法 2503.20749v5 -
1106 05-25 Towards Reliable Large Audio Language Model Zuverlässiges großes Audio-Sprachenmodell 努力实现可靠的大型音频语言模式 2505.19294v1 -
1107 05-25 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? 100-LongBench: Sind de facto Long-Context-Benchmarks wortwörtlich die Lang-Context-Fähigkeit zu bewerten? 100-LongBench:事实上的长文本基准是否实际评价长文本能力? 2505.19293v1 -
1108 05-25 Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning Ausbalancieren von Wahrhaftigkeit und Aufklärung mit unsicherer Anleitung Feintuning 平衡真实和知情与不确定性软件指示 2502.11962v2 -
1109 05-25 Next Token Prediction Is a Dead End for Creativity Nächster Token Prediction ist ein totes Ende für Kreativität 下个 Tok 预测是创造性的死胡同 2505.19277v1 -
1110 05-25 TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding TheoremExplainAgent: Auf dem Weg zu videobasierten multimodalen Erklärungen für LLM-Theorem-Verständnis 理论专家:争取为LLM理论理解提供基于视频的多式解释 2502.19400v2 -
1111 05-25 A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations Ein individueller Gesprächs-Benchmark: Auf dem Weg zur Simulation personalisierter Gespräche 个人对话基准:模拟个人对话 2505.14106v2 -
1112 05-25 Unveiling Dual Quality in Product Reviews: An NLP-Based Approach Enthüllung von Dual Quality in Produktbewertungen: Ein NLP-basierter Ansatz 产品审查中不固定的双重质量:基于NLP的方法 2505.19254v1 -
1113 05-25 Do Vision-Language Models Really Understand Visual Language? Verstehen Vision-Language-Modelle wirklich visuelle Sprache? 视觉语言模型真的理解视觉语言吗? 2410.00193v3 -
1114 05-25 Rethinking Chain-of-Thought from the Perspective of Self-Training Überdenken der Gedankenkette aus der Perspektive des Selbst-Trainings 从自我培训的角度重新思考一系列问题 2412.10827v4 -
1115 05-25 PATS: Process-Level Adaptive Thinking Mode Switching PATS: Prozess-Level-Adaptive-Denkmodus-Umschaltung PATT: 进程层面适应性思维模式转换 2505.19250v1 -
1116 05-25 ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty VergleichQA: Bewertung der Faktizität Robustheit von LLMs durch Wissensfrequenzkontrolle und Unsicherheit 比较QA:通过知识频率控制和不确定性评估LLMs的实际情况 2412.20251v2 -
1117 05-25 LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models LLLMs: Eine datengestützte Untersuchung der sich entwickelnden Forschung über Grenzen großer Sprachmodelle LLLMs:关于大语言模式限制的不断发展的研究数据驱动调查 2505.19240v1 -
1118 05-25 Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator Bewertung der Textkreativität über verschiedene Domänen: Ein Datensatz und großer Sprachmodell-Evaluator 评价跨不同域域的文本创造性:数据集和大语言模式评价员 2505.19236v1 -
1119 05-25 Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets Zuverlässige Ad-hoc-Wissenschaftliche Informationsextraktion: Eine Fallstudie zu zwei Materialdatensätzen 争取实现可靠的特设科学信息提取:关于两个材料数据集的个案研究 2406.05348v3 -
1120 05-25 VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models VerifyBench: Benchmarking Referenzbasierte Prämiensysteme für große Sprachmodelle 核查时间:大语言模式基准参考奖励制度基准 2505.15801v2 -
1121 05-25 GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling GUARDIAN: LLM-Multiagent-Kollaborationen mit zeitlicher Graphenmodellierung sichern GUARDIAN: 保护LLM 多机构协作与时间图建模 2505.19234v1 -
1122 05-25 Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More Sprachmodelle, Graph Searching und Überwachung Ehebruch: Wenn mehr Aufsicht weniger ist und wie man mehr macht 语言模式、图图搜索和监督通配:越少越少监督,如何做越多 2503.10542v3 -
1123 05-25 The Impact of LoRA Adapters for LLMs on Clinical NLP Classification Under Data Limitations Die Auswirkungen von LoRA-Adaptern für LLMs auf die klinische NLP-Klassifikation unter Datenbeschränkungen LoRA适应器对LLMLMLLM对临床NLP数据限制下分类的影响 2407.19299v2 -
1124 05-25 The Overthinker’s DIET: Cutting Token Calories with DIfficulty-AwarE Training Das DIET des Überdenkers: Schneiden von Token Calories mit DIschwer-AwarE-Schulung 过度思考家的DIET: 利用Difficulticry - AwarE 培训来切开托肯卡路里 2505.19217v1 -
1125 05-25 When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas Wenn Ethik und Payoffs Diverge: LLM-Agenten in moralisch belasteten sozialen Dilemmas 道德与报酬:道德道德与报酬:道德界的LLM代理人员充斥社会困境 2505.19212v1 -
1126 05-25 Conformity in Large Language Models Konformität in großen Sprachmodellen 大语言模式的合规性 2410.12428v2 -
1127 05-25 Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models Segment-Level Diffusion: Ein Framework für kontrollierbare Langform-Generation mit Diffusions-Sprachmodellen 局部级传播:具有传播语言模型的可控长龄一代框架 2412.11333v2 -
1128 05-25 MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search MOOSE-Chem2: Erforschung der LLM-Grenzwerte in feinkörniger wissenschaftlicher Hypothese durch hierarchische Suche MOOSE-Chem2:通过等级搜索探索探索精密科学假设发现时的LLM限度 2505.19209v1 -
1129 05-25 SpeakStream: Streaming Text-to-Speech with Interleaved Data SpeakStream: Streaming von Text-zu-Speech mit interleaved Daten 语音Stream:用断开数据流流流文本到语音 2505.19206v1 -
1130 05-25 Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety Benign Proben Materie! Feinabstimmung auf Aussergewöhnliche Benign Proben stark bricht Sicherheit 重大事件 重大事件 重大事件 安全 重大事件 重大事件 重大事件 重大事件 重大事件 2505.06843v2 -
1131 05-25 SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis SimpleDeepSearcher: Deep Information Suche via Web-Powered Reasoning Trajektorie Synthesis 简单深海earcher:通过网络动力理性轨迹合成寻求深度信息 2505.16834v2 -
1132 05-25 iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use iTool: Verstärkte Feinsteuerung mit dynamischer Kalibrierung bei fortgeschrittenem Werkzeugeinsatz i Tool:加强先进工具使用动态缺乏度校准的精细测试 2501.09766v4 -
1133 05-25 A partition cover approach to tokenization Eine Partition Abdeckung Ansatz tokenization 分区覆盖模式 2501.06246v2 -
1134 05-25 Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection Irreführung durch Inkonsistenz: Ein Maßstab für politische Inkonsistenzen 以不一致性导致的错误领导:政治不一致调查基准 2505.19191v1 -
1135 05-25 LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling LIMOPro: Verfeinerung für effizientes und effektives Skalieren von Testzeiten LIMOP: 为高效率和高成效测试时间的缩放而改进理由 2505.19187v1 -
1136 05-25 Talk to Your Slides: Language-Driven Agents for Efficient Slide Editing Sprechen Sie mit Ihren Folien: Sprachgetriebene Agenten für effizientes Dia-Editing 访问您的幻灯片: 用于高效幻灯片编辑的语言驱动器 2505.11604v3 -
1137 05-25 DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation DiTAR: Diffusion Transformer Autoregressive Modellierung für Sprachgenerierung DITAR: 发声的传播变异器自动递减模型 2502.03930v3 -
1138 05-25 Position: Enough of Scaling LLMs! Lets Focus on Downscaling Position: Genug von Scaling LLMs! Konzentriert sich auf Downscaling 位置: 缩放 LLM 已经足够! 让我们集中关注缩放缩放 2505.00985v3 -
1139 05-25 Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models Scaling Reasoning, Losing Control: Bewertung von Instruktionen in großen Reasoning-Modellen 扩大理由、减少控制:根据大理由模型评价指示 2505.14810v2 -
1140 05-25 Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge Assistant-Guided Milderung von Lehrerpräferenz Bias in LLM-as-a-Richter 助理辅导减轻在LLM-as-a法官中偏爱比阿斯的教师偏爱 2505.19176v1 -
1141 05-25 Mixture of Lookup Experts Mischung von Lookup-Experten 查找专家混合 2503.15798v2 -
1142 05-25 SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs SpokenNativQA: Mehrsprachige Alltagsfragen für LLMs SpokenNativQA: 每天多语种为LLM 询问 2505.19163v1 -
1143 05-25 CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter CORAL: Lerne konsistente Repräsentationen über mehrstufiges Training mit leichterem spekulativen Entwurfer CORAL: 利用轻型投机性起草者在多阶段培训中学习一致的代表性 2502.16880v3 -
1144 05-25 FISH-Tuning: Enhancing PEFT Methods with Fisher Information FISH-Tuning: Verbesserung der PEFT-Methoden mit Fisher Information FISH-Tuning:加强渔业信息PEFT方法 2504.04050v3 -
1145 05-25 Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs Sparse-to-Dense: Ein kostenloses Mittagessen für verlustfreies Beschleunigen des Videoverständnisses in LLMs 简单到感:免费午餐,促进无损失地加速视频理解,LLMM 2505.19155v1 -
1146 05-25 Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization Speech-FT: Zusammenführen vortrainierter und fein abgestimmter Sprachdarstellungsmodelle für Cross-Task-Verallgemeinerung 演讲-TF: 合并的预先培训和经过精练发言代表模式,供跨任务一般化使用 2502.12672v2 -
1147 05-25 Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation Divide-Then-Aggregat: Eine effiziente Tool-Learning-Methode über parallele Tool-Invokation 分离后生成工具:通过平行工具使用使用效率高的工具学习方法 2501.12432v2 -
1148 05-25 Shifting AI Efficiency From Model-Centric to Data-Centric Compression Verlagerung der KI-Effizienz von der modell-zentralen zur daten-zentralen Komprimierung 将AI效率从示范目录转向数据中心压缩 2505.19147v1 -
1149 05-25 How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching? Wie wirkt sich eine Textvorverarbeitung auf die Ontologie aus? 文本预处理管道如何影响本体学同步匹配? 2411.03962v6 -
1150 05-25 Efficient Reasoning for LLMs through Speculative Chain-of-Thought Effiziente Begründung für LLMs durch spekulative Kette-of-Thought 通过投机性研究链的探索,提高LLMs的效率 2504.19095v2 -
1151 05-25 SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data SERL: Selbstspiel-Verstärkungs-Lernen für große Sprachmodelle mit begrenzten Daten SeRL: 有限数据大语言模式自我强化学习 2505.20347v1 -
1152 05-25 Language Fusion for Parameter-Efficient Cross-lingual Transfer Sprachfusion für Parameter-Effizient Cross-lingual Transfer 参数有效跨语言转让语言融合 2501.06892v2 -
1153 05-25 Natural Language Generation from Visual Events: Challenges and Future Directions Natürliche Sprachgenerierung aus visuellen Veranstaltungen: Herausforderungen und Zukunftsrichtungen 从视觉活动中产生自然语言:挑战和未来方向 2502.13034v2 -
1154 05-25 Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks Latent-Space-Adversarial-Training mit post-aware Kalibrierung zur Verteidigung großer Sprachmodelle gegen Jailbreak-Angriffe 为防御大型语言模式以防范越狱袭击而进行后天校准的后备空间对抗性培训 2501.10639v2 -
1155 05-25 RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models Alles abrufen: Ein mehrsprachiger, benannter Entity-Erkennungs-Rahmen mit großen Sprachmodellen 检索全部:多语种实体识别框架,带有大语言模式 2505.19128v1 -
1156 05-25 MMATH: A Multilingual Benchmark for Mathematical Reasoning MPATH: Mehrsprachiger Benchmark für mathematische Vernunft MMATH: 数学理由多语种基准 2505.19126v1 -
1157 05-25 Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models Mehrsprachige Ethische Bias: Der MSQAD mit statistischen Hypothesentests für große Sprachmodelle 跳入多语言伦理比喻:高语言模型统计假设测试的MSQAD 2505.19121v1 -
1158 05-25 Controlling Language Confusion in Multilingual LLMs Sprachkonfusion in mehrsprachigen LLMs kontrollieren 多语种LMM中控制语言混杂 2505.19116v1 -
1159 05-25 Is Compression Really Linear with Code Intelligence? Ist Kompression wirklich linear mit Code Intelligence? 压缩真的有代码情报线条吗? 2505.11441v2 -
1160 05-25 Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering Selbstkritische iterative Begründung für Multi-Hop-Fragebeantwortung 多点问答问题解答自创性指导性迭代理由 2505.19112v1 -
1161 05-25 Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling Verwandeln von Müll in Schatz: Beschleunigen von Inferenzen von großen Sprachmodellen mit Token-Recycling 将垃圾垃圾变成宝库:加快使用 Tok 回收利用大语言模型的推论 2408.08696v3 -
1162 05-25 MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset MALAMUTE: Ein multilingualer, hochgranularer, musterloser, bildungsbasierter Probing-Datensatz 多种语文、高语种、无模版、以教育为基础的探测数据集 2412.10105v2 -
1163 05-25 CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models CCHall: Ein neuartiger Benchmark für gemeinsame Cross-Lingual- und Cross-Modal Halluzinationen Detection in großen Sprachmodellen CCHall:在大语言模型中联合跨语言和跨模式幻觉探测新基准 2505.19108v1 -
1164 05-25 WHISTRESS: Enriching Transcriptions with Sentence Stress Detection WHISTRESS: Anreicherung von Transkriptionen mit Satz-Stress-Erkennung WHISRSES: 增加刑期压力感应检测的追踪 2505.19103v1 -
1165 05-25 ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning ASPO: Adaptive Sentence-Level-Preference-Optimierung für eine feinkörnige multimodale Begründung APPO: 调整性判决一级优惠优化有偿多模式理由 2505.19100v1 -
1166 05-25 AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios Beschwerdesache: Ein Datensatz und Benchmark für zivilrechtliche Beschwerdeszenarien 上诉案例:民事案件上诉设想情况数据集和基准 2505.16514v2 -
1167 05-25 ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models ReadBench: Vermessen der Dichte an Text Visuelle Lesefähigkeit von Vision-Sprachen-Modellen ” 阅读 “ :衡量视觉-语言模型的阅读能力 2505.19091v1 -
1168 05-25 Towards Harmonized Uncertainty Estimation for Large Language Models Hin zu einer harmonisierten Ungewissheitsschätzung für große Sprachmodelle 争取为大语言模式统一不确定性估算 2505.19073v1 -
1169 05-25 Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors Training Turn-by-Turn Prüfer für Dialog-Tutoring-Agenten: Der seltsame Fall von LLMs als Ihre Coding Tutoren 对话教学代理培训转弯验证员培训:LLMs作为你的编码导师的好奇案例 2502.13311v3 -
1170 05-25 UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models UNCERTAINTY-LINE: Längeninvariante Schätzung der Unsicherheit für große Sprachmodelle UNDES-LINE: 大语言模型不确定性的长 动 变 动 估测 2505.19060v1 -
1171 05-25 An Embarrassingly Simple Defense Against LLM Abliteration Attacks Eine erschreckend einfache Verteidigung gegen LLM-Abliterationsangriffe 一种令人尴尬的简单防御 对付LLM 缩写攻击 2505.19056v1 -
1172 05-25 Efficient Data Selection at Scale via Influence Distillation Effiziente Datenauswahl auf Scale durch Einflussdestillation 通过影响蒸馏在规模上高效数据选择 2505.19051v1 -
1173 05-25 SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models SliM-LLM: Salience-getriebene Mixed-Precision-Quantisierung für große Sprachmodelle SliM-LLM:大语言模型的盐度驱动混合精度量 2405.14917v2 -
1174 05-25 PII-Scope: A Comprehensive Study on Training Data PII Extraction Attacks in LLMs PII-Scope: Eine umfassende Studie über Trainingsdaten PII-Extraktionsangriffe in LLMs PII-范围:关于培训数据的综合研究 2410.06704v2 -
1175 05-25 Domain Adaptation of Foundation LLMs for e-Commerce Domain-Anpassung der Stiftung LLMs für e-Commerce 用于电子商务的 “ 基础基础改造 “ 领域改造 2501.09706v3 -
1176 05-25 Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models Speech-IFEval: Bewertung von Instruktions-Following und Quantifying Katastrophic Forgetting in Speech-Aware Language Models 语言-语言语言评估:评价在语言-语言软件模型中遵守教学和量化灾难性遗忘的情况 2505.19037v1 -
1177 05-25 DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models DiffPO: Diffusion-gestylte Preference-Optimierung zur effizienten Inferenz-Zeit-Ausrichtung großer Sprachmodelle DiffPO: 大语言模式有效推论-时间协调最佳优化 2503.04240v3 -
1178 05-25 SQUiD: Synthesizing Relational Databases from Unstructured Text SQUiD: Synthese von relationalen Datenbanken aus unstrukturiertem Text SQUiD: 从无结构文本中合成关系数据库 2505.19025v1 -
1179 05-25 Fractured Chain-of-Thought Reasoning Zersplitterte Kette von nachdenklichen Gründen 断断断断断断断断断断断断的探讨链原因 2505.12992v2 -
1180 05-25 AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale AM-Thinking-v1: Die Grenzen der Vernunft auf 32B-Skala verbessern AM- Thinking-v1: 推进32B级的理性前沿 2505.08311v2 -
1181 05-25 Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis Ausbildung nichtlinearer Transformer für den Schlussfolgerungsketten-of-Thought: Eine theoretische Generalisierungsanalyse 培训非线性非线性变换器,用于研究链推论:理论一般分析 2410.02167v3 -
1182 05-25 CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language CrosGrpsABS: Cross-Attention über syntaktische und semantische Graphen zur aspektbasierten Sentimentanalyse in einer Sprache mit geringem Ressourcenbedarf CrossGrpsABS:对用于低源语言频谱感应分析的同步和语义图的交叉注意 2505.19018v1 -
1183 05-25 Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection Co-AttenDWG: Co-Attentive Dimension-Wise-Gating und Expertenfusion für Multi-Modal-Offensive Content Detection 共同-DWG:多模式进攻性攻击物质探测联合加速维维维-韦兹交织和专家混合 2505.19010v1 -
1184 05-25 AAAR-1.0: Assessing AI’s Potential to Assist Research AAAR-1.0: Bewertung des Potenzials von KI zur Unterstützung der Forschung AAAR-1.0:评估大赦国际协助研究的潜力 2410.22394v4 -
1185 05-25 VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization VerIPO: Pflege der langen Vernunft in Video-LLMs über die iterative Politikoptimierung von Prüfern VERIPO:通过验证和研究的迭代政策优化在视频LLMs中培养长期理由 2505.19000v1 -
1186 05-25 Visual Program Distillation with Template-Based Augmentation Visuelle Programmdestillation mit Template-basierter Augmentation 利用基于模板的增量进行视觉程序蒸馏 2412.08564v3 -
1187 05-25 FiLLM – A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM) FiLLM – Ein philippinisch optimiertes Large Language Model auf Basis von Southeast Asia Large Language Model (SEALLM) FILLM – – 基于东南亚大语言模型的菲律宾最佳大语言模型(SEALM) 2505.18995v1 -
1188 05-25 Reinforcement Learning for Reasoning in Large Language Models with One Training Example Verstärktes Lernen zur Vernunft in großen Sprachmodellen mit einem Trainingsbeispiel 采用 “ 一个培训实例 “ 采用大语言模式强化学习 2504.20571v2 -
1189 05-25 LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts LLMs kennen ihre Schwachstellen: Enthüllen Sie Sicherheitslücken durch natürliche Verteilungsverschiebungen LLM女士知道他们的脆弱性:通过自然分布变化实现的未覆盖的安全差距 2410.10700v2 -
1190 05-25 One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models One-for-All Pruning: Ein universelles Modell zur kundenspezifischen Kompression großer Sprachmodelle ” 一为普普普 “ :大语言模式定制压缩通用模式 2505.12216v2 -
1191 05-25 Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers Automatisierte Vertrauenswürdigkeit Oracle Generation für Machine Learning Text Klassifikatoren 机械学习文字分类的自动可信赖性甲骨文生成 2410.22663v4 -
1192 05-25 STRICT: Stress Test of Rendering Images Containing Text STRICT: Stresstest von Rendering-Bildern mit Text STICT: 含有文字的图像的显示压力测试 2505.18985v1 -
1193 05-25 LLMScan: Causal Scan for LLM Misbehavior Detection LLMScan: Kausalscan zur Erkennung von LLM-Missverhalten LLMScan:用于LLM Misbehavavor探测的成因扫描 2410.16638v4 -
1194 05-25 AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models AI4Math: Ein nativer spanischer Benchmark für mathematische Grundlagenforschung auf Universitätsebene in großen Sprachmodellen AI4Matth:关于大语言模式中大学一级数学原因的土著西班牙基准 2505.18978v1 -
1195 05-25 PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues PersuasiveToM: Ein Benchmark für die Bewertung der Maschinentheorie des Geistes in überzeugenden Dialogen M:在有影响的对话中评价心理机器理论的基准 2502.21017v2 -
1196 05-25 Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE Wird architektonische Komplexität überbewertet? Wettbewerbsfähige und interpretierbare Wissensgraphenvervollständigung mit RelatE 建筑复杂程度是否被高估了? 2505.18971v1 -
1197 05-25 Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study Untersuchung der Inferenzzeitskalierung für die Kette multimodaler Gedanken: Eine Vorstudie 多式联运思维链调查推理-时间尺度:初步研究 2502.11514v2 -
1198 05-25 MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models MoLAE: Mischung aus latenten Experten für Parameter-Effiziente Sprachmodelle MoLAE:参数有效语言模型原始专家混合 2503.23100v2 -
1199 05-25 BriLLM: Brain-inspired Large Language Model BriLLM: Gehirninspiriertes Large Language Model BrILLM: 脑启发型大语言模式 2503.11299v4 -
1200 05-25 Learning to Explain: Prototype-Based Surrogate Models for LLM Classification Erklären lernen: Prototypenbasierte Surrogate-Modelle für die LLM-Klassifikation 学习解释:LLM分类原型代用模型 2505.18970v1 -
1201 05-25 Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning Nicht alle Gedanken werden gleich erzeugt: Effizientes LLM-Reasoning durch Multi-Turn-Verstärkung-Lernen 并非所有思想都产生平等:通过多发强化学习提高学习水平的效率LLM 2505.11827v2 -
1202 05-25 SynapticRAG: Enhancing Temporal Memory Retrieval in Large Language Models through Synaptic Mechanisms SynapticRAG: Verbesserung des Temporalen Gedächtnisses in großen Sprachmodellen durch synaptische Mechanismen 辛亚蒂克拉戈:通过辛亚机制加强大语言模型中的时间内存检索 2410.13553v2 -
1203 05-25 Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models Expansion Span: Kombinieren von Fading Memory und Retrieval in Hybrid State Space Models 扩展空间:在混合国家空间模型中将平缓内存和检索合并 2412.13328v2 -
1204 05-25 GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples GraphemeAug: Ein systematischer Ansatz zur Synthese von schwer negativen Keyword-Spotting-Beispielen GraphemeAug:以系统方法合成硬负负关键词 2505.14814v2 -
1205 05-25 Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk? KI für Finanzen bewerten: Ist KI bei der Bewertung von Investitionsrisiken glaubwürdig? 评估大赦国际的融资:AI在评估投资风险方面是否可信? 2505.18953v1 -
1206 05-25 BnMMLU: Measuring Massive Multitask Language Understanding in Bengali BnMMLU: Maßgebendes Multitasking-Sprachverständnis in Bengalen messen BnMMLU:用孟加拉语衡量大规模多任务语言理解 2505.18951v1 -
1207 05-25 The Price of Format: Diversity Collapse in LLMs Der Preis des Formats: Diversity Collapse in LLMs 格式价格:多样化在LLMM中崩溃 2505.18949v1 -
1208 05-25 Veracity Bias and Beyond: Uncovering LLMs’ Hidden Beliefs in Problem-Solving Reasoning Veracity Bias and Beyond: LLMs versteckten Glauben an Problemlösungen enthüllen Veracity Bias 及以后:在解决问题的理由中揭穿LLMs的隐藏的信仰 2505.16128v2 -
1209 05-25 NovelSeek: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification NovelSeek: Wenn Agent zum Wissenschaftler wird – das geschlossene Loop-System von der Hypothese zur Verifikation NovellSeek:当特工成为科学家时 – – 建立从假设到核查的闭线系统 2505.16938v2 -
1210 05-25 MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems MetaMind: Modellierung menschlicher sozialer Gedanken mit Metakognitiven Multi-Agenten-Systemen MetMind:模拟人类社会思想与代认知多机构系统 2505.18943v1 -
1211 05-25 Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages Denken Sie außerhalb der Daten: Koloniale Biasen und systemische Probleme in automatisierten Moderationspipelines für Low-Resource-Sprachen 《在数据之外思考:低资源语言自动调控管道中的殖民二进制和系统问题》 2501.13836v2 -
1212 05-25 AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments AgentClinic: ein Multimodal Agent Benchmark zur Bewertung von KI in simulierten klinischen Umgebungen AgrClinicic:在模拟临床环境中评价AI的多式联运代理商基准 2405.07960v5 -
1213 05-25 Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding? Fließendes, aber kulturell Fernes: Kann regionales Training kulturelles Verständnis lehren? 流利但文化疏远:区域培训能够教授文化理解吗? 2505.21548v1 -
1214 05-25 REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing REACT: Darstellungsextraktion und kontrollierbares Tuning zur Überwindung der Überlastung in LLM-Wissensbearbeitung REACT: 在LLM知识编辑中,通过代表提取和控制可控的提款以克服超额配置 2505.18933v1 -
1215 05-25 Can Large Language Models Infer Causal Relationships from Real-World Text? Können große Sprachmodelle Kausalbeziehungen aus Real-World Text ableiten? 大语言模型能否从真实世界文本中推断出因果关系? 2505.18931v1 -
1216 05-25 Meta-aware Learning in text-to-SQL Large Language Model Meta-aware Lernen im Text-zu-SQL-Großsprache-Modell 以文本到SQL大语言模式进行多读学习 2505.18929v1 -
1217 05-25 iAgent: LLM Agent as a Shield between User and Recommender Systems iAgent: LLM Agent als Shield zwischen Anwender- und Recommender-Systemen iAgendy:LLM代理作为用户与建议系统之间的盾牌 2502.14662v3 -
1218 05-25 SCRum-9: Multilingual Stance Classification over Rumours on Social Media SCRum-9: Mehrsprachige Stance-Klassifizierung über Gerüchte in sozialen Medien SCRUM-9:社会媒体多语言流闻的多语言分级 2505.18916v1 -
1219 05-25 Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach Multimodale LLMs unter Verteilungsverschiebungen verstehen: Ein informationstheoretischer Ansatz 在分销变更下理解多式LLMs:信息理论方法 2502.00577v2 -
1220 05-24 (6) Federated Retrieval-Augmented Generation: A Systematic Mapping Study Federated Retrieval-Augmented Generation: Eine systematische Mapping-Studie 联邦回收回源代:系统绘图研究 2505.18906v1 -
1221 05-24 Building a Functional Machine Translation Corpus for Kpelle Aufbau eines funktionalen Übersetzungskorpus für Kpelle 为Kpelle建立功能机器翻译公司 2505.18905v1 -
1222 05-24 Algorithmic Language Models with Neurally Compiled Libraries Algorithmische Sprachmodelle mit neurally compiled Bibliotheken 具有神经编译图书馆的算法语言模型 2407.04899v2 -
1223 05-24 StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos StandUp4AI: Ein neuer multilingualer Datensatz für Humorerkennung in Stand-up Comedy Videos StandUp4AI:一套新的多语种数据集,用于在跳跳喜剧视频中探测湿度 2505.18903v1 -
1224 05-24 Do LLMs have a Gender (Entropy) Bias? Haben LLMs ein Gender (Entropie) Bias? LLMs是否有性别(Entropy)偏见? 2505.20343v1 -
1225 05-24 Vague Knowledge: Evidence from Analyst Reports Vague Knowledge: Beweise aus Analystenberichten 知识模糊:分析报告提供的证据 2505.12269v3 -
1226 05-24 Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding Warum Vision Language Models mit visueller Arithmetik kollidieren? Auf dem Weg zu einem verbesserten Chart und Geometrie-Verständnis 为什么愿景语言模型与视觉自算斗争? 争取强化图表和几何理解 2502.11492v3 -
1227 05-24 CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions CRMArena-Pro: Ganzheitliche Bewertung von LLM-Agenten über unterschiedliche Geschäftsszenarien und Interaktionen CRMARENA-Pro: 不同业务情景和相互作用的LLM代理机构综合评估 2505.18878v1 -
1228 05-24 Evaluating Step-by-step Reasoning Traces: A Survey Bewertung Schritt-für-Schritt-Reasoning-Traces: Eine Umfrage 评价逐步说明理由的追踪:调查 2502.12289v2 -
1229 05-24 Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing Sci-LoRA: Mischung aus wissenschaftlichen LoRAs für Cross-Domain Lay Paraphrasing Sci-LORA:将科学LORA混合起来,用于跨域地谱图谱绘制 2505.18867v1 -
1230 05-24 Token Sampling Uncertainty Does Not Explain Homogeneity Bias in Large Language Models Token Sampling Uncertainty erklärt Homogenität Bias nicht in großen Sprachmodellen 在大语言模型中抽样抽样的不确定性不能解释同性比重 2501.19337v2 -
1231 05-24 Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework Audio Jailbreak Attacks: Aufdecken von Schwachstellen in SpeechGPT in einem White-Box-Framework 音频破室袭击:在白箱框架内揭露语音中的弱点GPPT 2505.18864v1 -
1232 05-24 Writing Like the Best: Exemplar-Based Expository Text Generation Schreiben wie das Beste: exemplar-based expository text generation 写作像最佳的:基于实例的展示性文本生成 2505.18859v1 -
1233 05-24 Large Language Models based ASR Error Correction for Child Conversations Große Sprachmodelle basierende ASR-Fehlerkorrektur für Kindergespräche 基于大语言模型的ASR大语言模型 2505.16212v2 -
1234 05-24 USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations USDC: Ein Datensatz von $\underline{U}$ser $\underline{S}$tance und $\underline{D}$ogmatism in langen $\underline{C}$onversations USCC: 以 $\ underline{U}$ser $\ underline{S}$tance 和 $\ underline{D}$ogmatism 的数据集, 以 Long $\ underline{C} 美元对数值 2406.16833v2 -
1235 05-24 Inference Compute-Optimal Video Vision Language Models Schlussfolgerung Compute-Optimal Video Vision Language Models 计算视频视觉语言模型 2505.18855v1 -
1236 05-24 Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation Smoothie: Glättende Diffusion auf Token-Embeddings für Textgenerierung 滑滑: 平滑的文本生成时用 Token 嵌入嵌入嵌入器进行传播 2505.18853v1 -
1237 05-24 On the Limit of Language Models as Planning Formalizers An der Grenze von Sprachmodellen als Planungsformalisatoren 关于作为规划正规化机构的语言模式限制 2412.09879v3 -
1238 05-24 Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning Führt Reasoning Bias ein? Eine Studie über soziale Bias Evaluation und Milderung in LLM Reasoning 是否有理由引入偏见? 社会偏见评估和减轻LLM理由研究 2502.15361v2 -
1239 05-24 Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework Signal, Bild oder Symbolisch: Die beste Eingangsdarstellung für Elektrokardiogramm-Sprachenmodelle durch ein einheitliches Framework erkunden 信号、图像或符号:通过统一框架探索电动心电图语言模型的最佳输入代表 2505.18847v1 -
1240 05-24 Multi-Party Conversational Agents: A Survey Multi-Parteien-Gesprächsagenten: Eine Umfrage 多党对话代表:调查 2505.18845v1 -
1241 05-24 Don’t Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation Nicht nur einmal suchen: Auf dem Weg zu multimodaler interaktiver Reasonierung mit selektiver visueller Revisitation 不要只看一次: 走向多模式互动理性, 选择性视觉再审视 2505.18842v1 -
1242 05-24 Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization Identifikation von legalen Holdings mit LLMs: Eine systematische Studie über Leistung, Maßstab und Erinnerung 确定拥有LLM女士的法律控股:系统研究业绩、规模和记忆 2505.02172v3 -
1243 05-24 On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization Auf die Wirkung des negativen Gradienten in der Gruppe Relative Tiefenverstärkung Optimierung 对群体相对深强化优化中的负梯度效应的影响 2505.18830v1 -
1244 05-24 Vision Meets Language: A RAG-Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance Vision trifft auf Sprache: Ein RAG-gesteigertes YOLOv8-Framework für Kaffeekrankheitsdiagnose und Farmer Assistance 语言:一个RAG-AG-AG-AGed YOLOv8咖啡疾病诊断和农民援助框架 2505.21544v1 -
1245 05-24 AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting AdaCtrl: Auf dem Weg zur adaptiven und kontrollierbaren Begründung über Schwierigkeits-Bewusst-Budgeting AdaCtrl:通过困难意识预算编制实现适应和控制性合理理由 2505.18822v1 -
1246 05-24 Preference Leakage: A Contamination Problem in LLM-as-a-judge Bevorzugte Leckage: Ein Kontaminierungsproblem im LLM-as-a-Richter 优先渗漏:LLM-作为法官的LLM中的污染问题 2502.01534v2 -
1247 05-24 MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation MAPLE: Verbesserung der Review Generation mit Multi-Aspect Prompt Learning in erklärbarer Empfehlung MMALE: 在可解释建议中以多角度迅速和迅速的分解方式加强审查的产生 2408.09865v2 -
1248 05-24 From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation? Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus? 从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充? 2505.18789v1 -
1249 05-24 ReviewEval: An Evaluation Framework for AI-Generated Reviews ReviewEval: Ein Bewertungsrahmen für KI-generierte Bewertungen E. 审评:独立审评评估框架 2502.11736v3 -
1250 05-24 A generalised editor calculus (Short Paper) Eine generalisierte Editorrechnung (Short Paper) 通用编辑器微积分( 短纸) 2505.18778v1 -
1251 05-24 Disentangling Knowledge Representations for Large Language Model Editing Entwirren von Wissensdarstellungen für die Bearbeitung von großen Sprachmodellen 分散大语言模式编辑的知识代表 2505.18774v1 -
1252 05-24 Attacking Vision-Language Computer Agents via Pop-ups Angriff auf Vision-Sprache Computer-Agenten über Pop-ups 通过弹出式攻击视觉语言计算机代理器 2411.02391v2 -
1253 05-24 Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset Auf dem Weg zu einer emotional konsistenten textbasierten Sprachredaktion: Einführung von EmoCorrector und dem ECD-TSE-Datensatz 面向情感上一致的文本语音编辑:介绍EmoCorrictor和ECD-TSE数据集 2505.20341v1 -
1254 05-24 Towards an automatic method for generating topical vocabulary test forms for specific reading passages Auf dem Weg zu einer automatischen Methode zur Generierung aktueller Vokabular-Testformulare für bestimmte Lesepassagen 建立一个自动方法,为特定阅读段落制作专题词汇测试表 2505.18762v1 -
1255 05-24 How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark Wie wird LLM-Reasoning vom irrelevanten Kontext abgelenkt? Eine Analyse mit einem kontrollierten Benchmark LLM 为何被不相关背景所忽略? 2505.18761v1 -
1256 05-24 Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection Weniger scharfe Optimierung von Sensordaten mit großen Sprachmodellen: Eine Fallstudie zur Ermüdungserkennung 利用大语言模型对传感器数据使用高语言模型的微小最优化:关于Fatigue探测的案例研究 2505.18754v1 -
1257 05-24 Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning Vereinheitlichen von Aufmerksamkeitsköpfen und Task-Vektoren über versteckte Zustandsgeometrie im In-Context-Lernen 通过内文学习中隐藏状态几何几何,统一关注负责人和任务矢量 2505.18752v1 -
1258 05-24 An Illusion of Progress? Assessing the Current State of Web Agents Eine Illusion des Fortschritts? Bewertung des aktuellen Zustands der Web-Agenten 进展幻影? 评估网络代理目前的状况 2504.01382v3 -
1259 05-24 LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges LogicCat: Ein Chain-of-Thought-Text-to-SQL-Benchmark für Multi-Domain-Reasoning-Herausforderungen LocicCat:多领域合理性挑战的 “ 探索链 “ 文本到SQL基准 2505.18744v1 -
1260 05-24 Interpretable Company Similarity with Sparse Autoencoders Interpretierbare Firmenähnlichkeit mit Sparse Autoencodern 与Sparse Autoencolders 相似 2412.02605v3 -
1261 05-24 Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen 语言模型中强化研究链理由的特征采掘和指南 2505.15634v2 -
1262 05-24 ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search ReGUIDE: Dateneffizientes GUI Grounding über räumliche Vernunft und Suche 数据高效界面:通过空间理性和搜索进行数据高效界面定位 2505.15259v2 -
1263 05-24 Demonstration Selection for In-Context Learning via Reinforcement Learning Demonstrationsauswahl für das In-Context-Lernen mittels Verstärkungs-Lernen 通过强化学习,通过强化学习,选入内文学习的示范 2412.03966v2 -
1264 05-24 Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking Zuckerbeschichtetes Gift: Benign Generation entsperrt LLM Jailbreaking 食糖毒物:善后一代解锁 LLM 监狱破解 2504.05652v2 -
1265 05-24 Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson’s Disease Classifiers Bewertung der Nützlichkeit nicht-diagnostischer Sprachdaten für die Entwicklung von Parkinson-Krankheitsklassifikatoren 评价发展帕金森病分级器的非诊断性语音数据的用处 2505.18722v1 -
1266 05-24 Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization Optimales Transport-basiertes Token-Gewichtungssystem für verbesserte Preference-Optimierung 增强优惠优化的优化运输托肯加权计划 2505.18720v1 -
1267 05-24 Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer Neurale Parameter Suche nach schlankeren Modellen und besserer Übertragung 搜索细微精制模型和更好传输的神经参数 2505.18713v1 -
1268 05-24 Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models Dynamische Manifold Evolutionstheorie: Modellierung und Stabilitätsanalyse latenter Repräsentationen in großen Sprachmodellen 动态操纵动态进化理论:大语言模型中前代代表的建模和稳定分析 2505.20340v1 -
1269 05-24 What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations Worum geht es dabei? Ein Video-zu-Text-Zusammenfassungsdatensatz für wissenschaftliche Präsentationen 这是在谈论什么?一个用于科学演示的视频到文字汇总数据集 2502.08279v4 -
1270 05-24 Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla Verbesserung der Bangla-Linguistik: Fortgeschrittene LSTM-, Bi-LSTM- und Seq2Seq-Modelle zur Übertragung von Sylheti auf moderne Bangla 改进孟加拉语言:高级LSTM、Bi-LSTM和Seq2Seqeq 将Sylheti转换为现代孟加拉语的模式 2505.18709v1 -
1271 05-24 A General Knowledge Injection Framework for ICD Coding Ein allgemeiner Wissenseinspritzrahmen für ICD Coding ICD 编码一般知识输入框架 2505.18708v1 -
1272 05-24 OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis OpenOmni: Advancing Open-Source Omnimodale große Sprachmodelle mit progressiver multimodaler Ausrichtung und Echtzeit-Self-Aware-Emotional Speech-Synthese OpenOmni:推进开放源码全现代大语言模式,采用渐进式多模式调整和实时自觉情感言语合成 2501.04561v5 -
1273 05-24 Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task Auf dem Weg zur semantischen Integration von Meinungen: Einheitliche Meinungskonzepte Ontologie und Extraktionsaufgabe 争取在语义上综合各种意见:统一意见概念的本体学和采掘业任务 2505.18703v1 -
1274 05-24 Assessing the Capability of LLMs in Solving POSCOMP Questions Bewertung der Fähigkeit von LLM bei der Lösung von POSCOMP-Fragen 评估LLLMs在解决POSCOMP问题方面的能力 2505.20338v1 -
1275 05-24 Benchmarking and Rethinking Knowledge Editing for Large Language Models Benchmarking und Rethinking Knowledge Editing für große Sprachmodelle 大语言模式知识编辑基准制定和重新思考 2505.18690v1 -
1276 05-24 A statistically consistent measure of semantic uncertainty using Language Models Ein statistisch konsistentes Maß semantischer Unsicherheit mittels Sprachmodellen 使用语言模式统计一致的语义不确定性计量 2502.00507v3 -
1277 05-24 Large Language Models in the Task of Automatic Validation of Text Classifier Predictions Große Sprachmodelle in der Aufgabe der automatischen Validierung von Textklassifikatoren Vorhersagen 文本分类自动验证任务中的大语言模型 2505.18688v1 -
1278 05-24 From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation Von der Generation zur Erkennung: Ein multimodaler Multi-Task-Datensatz zum Benchmarking von Gesundheitsmissinformationen 从产生到检测:用于确定健康错误信息基准的多式联运多任务数据集 2505.18685v1 -
1279 05-24 TULUN: Transparent and Adaptable Low-resource Machine Translation TULUN: Transparente und anpassungsfähige Maschinelle Übersetzung mit geringer Ressource TULUN: 透明和可调适的低资源机器翻译 2505.18683v1 -
1280 05-24 $PD^3F$: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models $PD^3F$: Ein steckbares und dynamisches DoS-Defense-Framework gegen Angriffe auf den Ressourcenverbrauch $PD3F$:针对大语言模式的针对资源消费攻击的可渗透和动态的多斯防御框架 2505.18680v1 -
1281 05-24 Safety in Large Reasoning Models: A Survey Sicherheit in großen vernünftigen Modellen: Eine Umfrage 大理由模型中的安全性:调查 2504.17704v3 -
1282 05-24 Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts Sozialgut oder wissenschaftliche Neugier? Entdeckung der Forschung hinter NLP-Artefakten 社会良好还是科学好奇? 发现NLP艺术作品背后的研究阵形 2505.18677v1 -
1283 05-24 IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery IRIS: Interaktives Forschungs-Ideierungssystem zur Beschleunigung der wissenschaftlichen Entdeckung IRIS:加速科学发现交互式研究标志系统 2504.16728v2 -
1284 05-24 Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps Kann MLLMs mich nach Hause führen? Eine Benchmark-Studie zur feinkörnigen visuellen Vernunft von Transit Maps MLLMM MLLM 指导我回家吗? 关于过境地图的精美视觉依据基准研究 2505.18675v1 -
1285 05-24 Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models Cross-Lingual Pitfalls: Automatisches Probieren von Cross-Lingual-Schwächen bei mehrsprachigen großen Sprachmodellen 跨语言空洞:多种语言大语言模式的自动试探跨语言弱点 2505.18673v1 -
1286 05-24 MOSLIM:Align with diverse preferences in prompts through reward classification MOSLIM: Mit verschiedenen Präferenzen in Aufforderungen durch Prämienklassifizierung ausrichten MOSLIM:通过奖励分类与各种偏好保持一致 2505.20336v1 -
1287 05-24 Language Model Distillation: A Temporal Difference Imitation Learning Perspective Sprachmodell Destillation: Ein zeitlicher Unterschied Imitation Lernperspektive 语言模型蒸馏:时间差异差异模拟学习视角 2505.20335v1 -
1288 05-24 Little Data, Big Impact: Privacy-Aware Visual Language Models via Minimal Tuning Little Data, Big Impact: Datenschutzerklärung Visual Language Models via Minimal Tuning Little Data, Big impact: 通过最小图案生成的隐私-软件视觉语言模型 2405.17423v3 -
1289 05-24 ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation ChartGalaxy: Ein Datensatz für Infografik Chart Verstehen und Generieren 图表银河:用于了解和生成信息图表的数据集 2505.18668v1 -
1290 05-24 Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics Robustheit in großen Sprachmodellen: Eine Umfrage zu Mitigationsstrategien und Evaluationsmetrics 大语言模式的强强力:减轻战略调查和评价 2505.18658v1 -
1291 05-24 Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change Climate-Eval: Ein umfassender Maßstab für NLP-Aufgaben im Zusammenhang mit dem Klimawandel 气候 – – Eval:与气候变化有关的国家土地规划任务的综合基准 2505.18653v1 -
1292 05-24 On the Emergence of Linear Analogies in Word Embeddings Zur Entstehung linearer Analogien in Word-Embeddings 单线模拟在文字嵌入中的出现 2505.18651v1 -
1293 05-24 Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study Kann LLMs die Hate Speech Detection über Sprachen hinweg verhindern? Eine Null- und Wenige-Schuss-Studie 能够跨语言探测出LMs Unlock仇恨言论吗? 2505.06149v3 -
1294 05-24 Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data Dateneffiziente Hate Speech-Erkennung durch Cross-Lingual Nearchbor Retrieval mit limitierten beschrifteten Daten 通过带有有限标签数据的跨近近邻检索检索数据有效仇恨言论检测 2505.14272v2 -
1295 05-24 SEW: Self-Evolving Agentic Workflows for Automated Code Generation SEW: Selbst-evolvierende Agentische Workflows für die automatisierte Codegenerierung SEW:自动代码生成的自演动态制剂工作流程 2505.18646v1 -
1296 05-24 Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving Verbesserung der Verallgemeinerung von sprachgroßen Sprachmodellen mit Multi-Task Behavior Imitation und Speech-Text Interleaving 加强具有多任务行为模拟和语音文本互换功能的语音大语言模式的通用化 2505.18644v1 -
1297 05-24 Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster Skip-Thinking: Chain-of-Thought-Destillation ermöglicht kleinere Sprachmodelle besser und schneller zu begründen 跳过思考: 切入式深思熟虑的蒸馏链让更小的语言模型更好、更快地使用 2505.18642v1 -
1298 05-24 Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees Multi-Step Alignment als Markov Games: Ein optimaler Online-Gradient-Abstieg mit Konvergenzgarantien 作为Markov运动会的多步对齐:带有一致保障的乐观的在线逐渐递增人种方法 2502.12678v2 -
1299 05-24 Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query Lookahead Q-Cache: Konsistentere KV-Cache-Eviktion durch Pseudo-Abfrage LOSAhead Q-Cache : 通过 Pseudo 查询实现 KV 更一致的 CAche 切除 2505.20334v1 -
1300 05-24 DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation DDO: Dual-Decision-Optimierung durch Multi-Agent-Kollaboration für LLM-basierte medizinische Beratung DDO:通过多方机构协作,优化基于LLM的医疗咨询的双重决定 2505.18630v1 -
1301 05-24 Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models Multi-Scale Manifold Alignment: Ein einheitliches Framework zur besseren Erklärbarkeit großer Sprachmodelle 多规模工作人员配置对齐:提高大语言模式解释性的统一框架 2505.20333v1 -
1302 05-24 HARP: Hesitation-Aware Reframing in Transformer Inference Pass HARP: Hezitation-Aware Reframing in Transformer Inferenz Pass HARP: 变压器推断通过中的偏移-软件重新配置 2412.07282v2 -
1303 05-24 Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models Empirische Bewertung der Wissensdestillation von Transformern zu subquadratischen Sprachmodellen 从变异器到次赤道语言模式的知识提炼经验评估 2504.14366v2 -
1304 05-24 Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? Können LLM-Wasserzeichen die unautorisierte Destillation von Wissen wirksam verhindern? LLM Watermarks能否强有力地防止未经授权的知识蒸馏? 2502.11598v2 -
1305 05-24 Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation Kann LLMs mit Ambiguity helfen? Eine quantitative Bewertung verschiedener großer Sprachmodelle auf Word Sense Disambiguation LLMs能否协助其模糊性? 量化评估关于 “ Word Sense Disanderation “ 的各种大语言模型。 2411.18337v3 -
1306 05-24 MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation MAVL: Ein mehrsprachiger Audio-Video-Text Datensatz für animierte Song-Übersetzung MAVL: 动动歌曲翻译多语种视听歌词数据集 2505.18614v1 -
1307 05-24 PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs PM-KVQ: Progressive Mixed-Precision KV Cache Quantization für Long-CoT LLMs PM-KVQ: 长 CoT LLMs 的渐进混合精度 KV 缓存量 2505.18610v1 -
1308 05-24 Flex-Judge: Think Once, Judge Anywhere Flex-Richter: Denken Sie einmal, Richter überall 灵活法官:想一想,法官 2505.18601v1 -
1309 05-24 SMI: An Information-Theoretic Metric for Predicting Model Knowledge Solely from Pre-Training Signals SMI: Ein informationstheoretisches Metric zur Vorhersage von Modellwissen ausschließlich aus Vorschulungssignalen SMI:从培训前信号中单独预测模型知识的信息理论计量方法 2502.04066v3 -
1310 05-24 Safety Alignment via Constrained Knowledge Unlearning Sicherheitsausrichtung durch eingeschränktes Wissen Unlernen 通过受限制的知识实现安全协调 2505.18588v1 -
1311 05-24 Model Extrapolation Expedites Alignment Modell Extrapolation Expeditionen Ausrichtung 模型外推快速调整 2404.16792v4 -
1312 05-24 Removal of Hallucination on Hallucination: Debate-Augmented RAG Aufhebung der Halluzination auf Halluzination: Debatte-erweiterte RAG 在幻觉中去除幻觉:辩论增强的RAG 2505.18581v1 -
1313 05-24 Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs Steigerung der Effizienz und Exploration bei der Stärkung des Lernens für LLMs 提高LLMM 强化学习的效率和探索 2505.18573v1 -
1314 05-24 ReflectDiffu:Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework ReflectDiffu: Reflect zwischen emotional-intent Ansteckung und Mimicry für Empathetic Response Generation über ein RL-Diffusion Framework 反省:通过RL-扩散框架,对情感-情感内聚变和Mmimimicry之间的反射,以便产生同情性反应 2409.10289v3 -
1315 05-24 From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test Von Wort zu Welt: Bewertung und Mitigate Kultur Bias via Word Association Test 从Word到世界:通过Word协会试验评价和消化文化偏见 2505.18562v1 -
1316 05-24 TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation TAG-INSTRUCT: Controlled Instruction Complexity Enhancement durch strukturbasierte Augmentation TAG-INSTRSUCT:通过基于结构的增强增强控制性教学复杂度 2505.18557v1 -
1317 05-24 Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation Erforschung der Vulnerabilität der Content Moderation Guardrail in großen Sprachmodellen durch Intent Manipulation 通过意向操纵探索大语言模型中内容调节保护栏的脆弱性 2505.18556v1 -
1318 05-24 Unraveling Misinformation Propagation in LLM Reasoning Nichtverbreitung von Fehlinformationen in LLM-Reasoning 以LLM 理由解释方式进行错误信息传播 2505.18555v1 -
1319 05-24 MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors MSA bei BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning für die multidimensionale Bewertung von LLMs als Math Tutoren BEA 2025年BEA管理事务管理事务协议 共同任务:对作为数学导师的LLMs进行多种不同类型评价的 2505.18549v1 -
1320 05-24 Composable Cross-prompt Essay Scoring by Merging Models Composable Cross-prompt Essay Scoring by Merging Models 通过合并模型进行可合成的跨速化 ESS Scay Scorporing 2505.18548v1 -
1321 05-24 B-score: Detecting biases in large language models using response history B-Score: Voreingenommenheit in großen Sprachmodellen anhand der Antworthistorie erkennen B-序号:利用回应历史在大型语言模型中发现偏见 2505.18545v1 -
1322 05-24 Unearthing Large Scale Domain-Specific Knowledge from Public Corpora Großes Domain-Spezifisches Wissen aus der öffentlichen Corpora entschlüsseln 从公共企业中挖掘出大型大型域域特定知识 2401.14624v4 -
1323 05-24 Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning Verbesserung des Charakter-Level-Verständnisses in LLMs durch Token Internal Structure Learning 通过 Token 内部结构学习加强LLM女士的品级理解 2411.17679v4 -
1324 05-24 NoveltyBench: Evaluating Language Models for Humanlike Diversity NoveltyBench: Sprachmodelle für die menschliche Vielfalt bewerten 新闻:评价促进人类多样性的语言模式 2504.05228v3 -
1325 05-24 Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models Verstärkte Feinsteuerungskräfte, die die Fähigkeit multimodaler großer Sprachmodelle begründen 多种多式大语言模式能力的理由 2505.18536v1 -
1326 05-24 InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models InftyThink: Die Längengrenzen der Langkontext-Reasoning in großen Sprachmodellen durchbrechen 思考:在大语言模式中打破长句理由的长度限制 2503.06692v3 -
1327 05-24 SMART: Self-Aware Agent for Tool Overuse Mitigation SMART: Self-Aware Agent für Tool Overuse Mitigation SMART: 减少工具过度使用自智能剂 2502.11435v2 -
1328 05-24 metaTextGrad: Automatically optimizing language model optimizers metaTextGrad: Sprachmodell-Optimierer automatisch optimieren setudeTextGrad: 自动优化语言模型优化器 2505.18524v1 -
1329 05-24 How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation Wie beeinflusst Sequence-Modellierung Architektur Basisfähigkeiten von vortrainierten Sprachmodellen? Erforschen von Schlüsselarchitektur-Design-Prinzipien zur Vermeidung von Basisfähigkeiten Degradation 如何按序列模拟结构模型模拟培训前语言模型的建筑影响基础能力? 探索重要建筑设计原则,以避免基础能力退化 2505.18522v1 -
1330 05-24 AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking AcuRank: Ungewissheits-Bewusst-Adaptive-Computation für Listwise-Reranking AcuRank: 列表排序的不确定性- 软件适应性计算 2505.18512v1 -
1331 05-24 EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents EscapeBench: Auf dem Weg zu mehr kreativer Intelligenz von Sprachmodell-Agenten 逃避:努力推进语言示范代理的创意智能 2412.13549v2 -
1332 05-24 Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection Gruppenadaptive Schwellenoptimierung für robuste KI-generierte Texterkennung 强力AI-发光的文本探测的集团-适应性阈值优化 2502.04528v4 -
1333 05-24 Knowledge Grafting of Large Language Models Wissen Graften von großen Sprachmodellen 大语言模式知识转让 2505.18502v1 -
1334 05-24 UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models UGPhysics: Umfassender Benchmark für Undergraduate Physics Reasoning mit großen Sprachmodellen 动脉物理学:具有大语言模型的本科物理原因综合基准 2502.00334v3 -
1335 05-24 ACECODER: Acing Coder RL via Automated Test-Case Synthesis ACECODER: Acing Coder RL über automatisierte Test-Case-Synthese 通过自动测试-案件综合合成检索编码器 RL 2502.01718v4 -
1336 05-24 The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models Der Pragmatische Geist der Maschinen: Auf der Spur des Entstehens der Pragmatischen Kompetenz in großen Sprachmodellen 机器的实用思维:追踪大语言模式中实用能力的出现 2505.18497v1 -
1337 05-24 FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers FuseGPT: Lernbare Ebenen Fusion generativer vortrainierter Transformer FuseGPT: 训练前改造器的产生型先导变异器的可学习层融合 2411.14507v2 -
1338 05-24 TextArena TextArena TextArenna 文本 2504.11442v2 -
1339 05-24 AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents AgentOccam: Eine einfache, aber starke Basis für LLM-basierte Web-Agenten AgentOccam:基于LLM的网络代理的简单而有力的基线 2410.13825v2 -
1340 05-24 ADEPT: A DEbiasing PrompT Framework ADEPT: Ein abschreckendes PrompT-Framework ADEPT: 减少偏见的促进促进框架 2211.05414v3 -
1341 05-24 Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications Synchronisieren und Anpassen von Fehlerkorrekturdaten für mobile Großsprachen-Modellanwendungen 合成和调整移动大语言模型应用错误校正数据 2505.18488v1 -
1342 05-24 AI Idea Bench 2025: AI Research Idea Generation Benchmark KI Idee Bank 2025: KI Forschung Idee Generation Benchmark AI 2025年大赦国际思想座座:AI 研究思想的产生基准 2504.14191v3 -
1343 05-24 GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data? GeoGrid-Bench: Können Stiftungsmodelle multimodale gegrittete Geo-Raumdaten verstehen? GeoGrid-Bench:基础模型能够理解多式网格地球空间数据吗? 2505.10714v2 -
1344 05-24 Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark Pädagogik-R1: Pädagogisch ausgerichtetes Reasoning-Modell mit ausgewogenem Bildungs-Benchmark 教育-R1:具有平衡教育基准的教学统一理由模型 2505.18467v1 -
1345 05-24 Measuring South Asian Biases in Large Language Models Messung südasiatischer Biasen in großen Sprachmodellen 衡量大语言模式中的南亚偏见 2505.18466v1 -
1346 05-24 From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data Von Reddit zur Generativen KI: Bewertung großer Sprachmodelle für Angstunterstützung Feinabstimmung auf Social Media-Daten 从改编到创创AI:评估社会支助大语言模式,对社会媒体数据进行微调 2505.18464v1 -
1347 05-24 Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning Selbst-GIVE: assoziatives Denken aus begrenztem strukturiertem Wissen für erweiterte Großsprachenmodell-Reasoning 自用自用:从有限的结构化知识中进行联合思考,以强化大语言模式解释理由 2505.15062v2 -
1348 05-24 Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales Verbesserte multimodale Aspect-Based-Sentiment-Analyse durch LLM-generierte Rationale 由LLM-Generered Rationsales公司进行的增强型多式多式频谱感应分析 2505.14499v2 -
1349 05-24 Accelerating Large Language Model Reasoning via Speculative Search Beschleunigen des Large Language Model Reasoning durch spekulative Suche 通过投机搜索加速大语言示范理由 2505.02865v2 -
1350 05-24 TokenSkip: Controllable Chain-of-Thought Compression in LLMs TokenSkip: Steuerbare Ketten-of-Thought-Kompression in LLMs TokenSkip: LLMM 中可控制的尝试链压缩 2502.12067v2 -
1351 05-24 Anchored Diffusion Language Model Verankertes Diffusions-Sprachenmodell 原成品的传播语言模式 2505.18456v1 -
1352 05-24 Hybrid Latent Reasoning via Reinforcement Learning Hybride Latent Reasoning durch Stärkungslernen 通过强化学习找出原因 2505.18454v1 -
1353 05-24 MedScore: Factuality Evaluation of Free-Form Medical Answers MedScore: Faktizitätsbewertung von Freiform-medizinischen Antworten 医疗核心:对免费形式医疗答案的实情评估 2505.18452v1 -
1354 05-24 $μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts $μ$-MoE: Test-Time Pruning als Mikro-Grained Mixture-of-Experts 美元-MoE:作为微粒混合剂专家进行试验时休整 2505.18451v1 -
1355 05-24 BRIT: Bidirectional Retrieval over Unified Image-Text Graph BRIT: Bidirektionale Retrieval über Unified Image-Text Graph BRIT: 统一图像文字图的双向检索 2505.18450v1 -
1356 05-24 Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model Nutzung von Online-Daten zur Verbesserung des medizinischen Wissens in einem kleinen persischen Sprachmodell 在小型波斯语言模式中利用在线数据加强医疗知识 2505.16000v2 -
1357 05-24 Efficient Long CoT Reasoning in Small Language Models Effiziente Long CoT-Reasoning in kleinen Sprachmodellen 低语言模式中有效的长期计算成本理由 2505.18440v1 -
1358 05-24 Voice of a Continent: Mapping Africa’s Speech Technology Frontier Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren 非洲大陆之声:测绘非洲语音技术前沿 2505.18436v1
Article 0
Title@2025-05-29 (4): From Chat Logs to Collective Insights: Aggregative Question Answering
Title: From Chat Logs to Collective Insights: Aggregative Question Answering | Von Chat Logs zu Collective Insights: Aggregative Question Answering | 从聊天日志到集体透视:聚合问题解答 2505.23765v1 |
Authors: Wentao Zhang, Woojeong Kim, Yuntian Deng
Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.
由大型语言模型(LLMs)驱动的交汇代理机构正在迅速成为我们日常互动的有机组成部分,产生前所未有的对话数据数量。这类数据集为社会利益、趋势话题和集体关注提供了强大的透镜。然而,现有方法通常将这些互动视为独立和缺乏从大规模对话日志的汇总和推理中可能产生的关键洞察力。在本文中,我们引入了聚合问题回答,这是一项新颖的任务,要求模型明确解释数千个用户-聊天机器人互动,以解答聚合问题,例如确定特定人口群中新出现的关切问题。为了能够进行这方面的研究,我们建立了一个基准,即WildChat-AQA,由182,330个实时聊天室对话产生的6,027个汇总问题组成。实验表明,现有的方法要么是试图有效解释,要么是产生令人望而望而却望而却步的计算成本,这突出表明需要采取新的方法,能够从大规模对话数据中获取集体见解。
Article 1
Title@2025-05-29 (4): MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Title: MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence | MMSI-Bench: Ein Benchmark für multi-Image-Spatial Intelligence | MMSI-Bunch:多图像空间情报基准 2505.23764v1 |
Authors: Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI’s o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .
在复杂的物理世界中运行的多式大型语言模型(MLLM)必须具备空间空间情报。但现有的基准只探究单一图像关系,因此无法评估现实世界部署所需要的多图像空间推理。我们引入了MMSI-Bench,这是VQA用于多图像空间情报的基准。6个3D研究人员花了300多小时仔细地从120,000多张图像中精心设计出具有挑战性的、毫不含糊的多重选择问题,每个图像配有精心设计的分散器和一个逐步推理过程。我们进行了广泛的实验,并彻底评价了34个开放源和专有MLLMS,观察了一个巨大的差距:最强的开放源模型达到大约30%的准确度,OpenAI的O3推理模型达到40%,而人类得97%。这些结果突出MMSI-Bench和今后研究的庞大领导室具有挑战性。我们还提供了一个自动错误分析管道,诊断了四种主要的失败模式,包括:(1) 地面错误,(2) 重叠和现场对称和专有专有的MLLMS,观察能力,观察到一个大的差距:最强的开放的开放模型:SOpismismismismismismismismismismis:提供宝贵的空间/pismismismismismismismismismismisprolismismismismismismlismismismismismismismismismismismisal。
Article 2
Title@2025-05-29 (4): ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Title: ZeroGUI: Automating Online GUI Learning at Zero Human Cost | ZeroGUI: Automatisieren des Online-GUI-Lernens zu null menschlichen Kosten | 零GUI: 实现零人成本在线用户界面学习自动化 2505.23762v1 |
Authors: Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai
The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.
大型视觉语言模型(VLMS)的快速发展推动了纯基于视觉的图形用户界面(GUI)的开发,能够感知和操作图形用户界面(GUI),自主地满足用户指令,但现有方法通常采用离线学习框架,面临两个核心限制:(1) 要素定位和动作监督严重依赖高质量的人工说明,(2) 适应动态和互动环境的能力有限。为解决这些限制,我们提议ZeroGUI,这是一个可扩展的在线学习框架,用于在零人成本方面将图形界面代理培训自动化。具体地说,ZeroGUI整合了(一) 基于VLM的自动任务生成,以产生来自当前环境状态的不同培训目标,(二) 基于VLM的自动奖励估算,在没有手动评价职能的情况下评估任务成功与否,以及(三) 两个阶段的在线强化学习,以不断与图形环境互动并从中学习。关于两个高级界面代理(UI-TARS和Aguvis)的实验显示ZerGUGI/GUI/GROV)显著提升了整个OS世界和Zromab/GUIGOLA的绩效。
Article 3
Title@2025-05-29 (4): Differential Information: An Information-Theoretic Perspective on Preference Optimization
Title: Differential Information: An Information-Theoretic Perspective on Preference Optimization | Differentialinformation: Eine informationstheoretische Perspektive zur Preference-Optimierung | 差别信息:关于首选优化的信息理论观点 2505.23761v1 |
Authors: Yunjae Won, Hyunji Lee, Hyeonbin Hwang, Minjoon Seo
Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.
直接偏好优化(DPO)已成为以监督方式使语言模式与人类偏好相一致的一种标准技术。尽管它取得了经验上的成功,但其日志-鼠标奖励参数的理论理由仍然不完整。在这项工作中,我们通过使用差异信息分布(DID):在象征性序列上分配,捕捉政策更新过程中获得的信息。首先,我们表明,当偏爱标签将将参考政策转化为目标政策所需的差异信息编码成一个目标政策时,DPO的正轨偏差奖励将成为通过偏好优化学习目标政策的独特最佳形式。这自然产生一种封闭式的表达形式,用于最佳抽样分布,而不是被拒绝的答复。第二,我们发现,对差异信息进行编码的偏好与一个隐含的假设从根本上联系在一起,即对在政策更新政策更新过程中广泛使用的政策偏差分配。最后,我们通过分析数据变现的精度,我们如何学习低偏差的视角加强了政策分布,而高偏差信息则带来一种顺畅的效果,这解释了结果的正统化的理论性分析结果,同时,我们学习了我们关于数据流化数据流化的演化的理论性分析,从而验证了我们的数据。
Article 4
Title@2025-05-29 (4): Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint
Title: Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint | Puzzlet von Puzzles: Wenn Vision-Language-Modelle keinen Hinweis aufnehmen können | 由谜题拼取的谜题: 当视觉语言模型无法使用提示时 2505.23759v1 |
Authors: Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues (“head” over “heels”). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.
通过图像、空间安排和符号替代将语言编码成像的Rebus 拼图、视觉拼图、视觉拼图,对当前的视觉语言模型(VLM)构成了独特的挑战。 与传统的图像字幕或问答任务不同,变复解决需要多式抽象、象征性推理以及掌握文化、语音和语言标语。 在本文中,我们调查当代VLMs通过构建一个手动生成的和附加注释的多种英语复交拼图的基准来解释和解决变现拼图的能力,从简单的图像替代到空间依赖的提示(“头”到“耳目 ” 。 我们分析了不同的VLMs是如何运作的,我们的发现表明,虽然VLMs在解码简单的视觉线索方面表现出一些惊人的能力,但是他们与需要抽象推理、横向思维和理解视觉比喻的任务进行了巨大的斗争。
Article 5
Title@2025-05-29 (4): DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
Title: DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning | DeepTheorem: Verbesserung der LLM-Gründung für Theorem Proving durch natürliche Sprache und Stärkung Lernen | 深理理论:通过自然语言和加强学习提高理论力的理论力和强化学习 2505.23754v1 |
Authors: Ziyin Zhang, Jiahao Xu, Zhiwei He, Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhengwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs’ strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem’s potential to fundamentally advance automated informal theorem proving and mathematical exploration.
然而,传统的自动理论验证(ATP)方法严重依赖正式验证系统,这些系统与培训前获得的非正式、自然语言知识所产生的LLM的强度不相符。在这项工作中,我们提议建立一个综合的非正式理论验证框架,利用自然语言加强LLM数学推理。Deep理论包括一个大型基准数据集,由121K高质量的海事组织高级别非正式理论和证据组成,涵盖不同的数学领域,严格说明正确性、困难性和主题类别,并配以系统构建的可核查的理论变体。我们设计了一个新的强化学习战略(RL-Zero),明确针对非正式理论验证,利用经核实的理论变体来鼓励强大的数学推理。此外,我们提出了全面的结果和流程评价指标,检查证据的正确性和推理步骤的质量。广泛的实验分析表明,DeepThere大大改进了LLM理论的准确性、难度和主题类别,同时有系统构建的可核实的理论变式变体。我们设计了一个新的强化学习战略(RL-Zero),明确针对非正式理论验证,利用经核实的变体来鼓励可靠的数学推理,并监督地展示了我们现有的数学推理学。
Article 6
Title@2025-05-29 (4): ATLAS: Learning to Optimally Memorize the Context at Test Time
Title: ATLAS: Learning to Optimally Memorize the Context at Test Time | ATLAS: Optimales Erlernen des Kontextes zur Testzeit | ATLAS: 学习在测试时最充分记住上下文 2505.23735v1 |
Authors: Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni
Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.
变异器被确定为在序列建模方面最受欢迎的骨干,这主要是因为其在文中检索任务和规模学习能力方面的效力。但是,它们的二次记忆和时间复杂性将它们的可应用性约束在较长的序列中,因此激发了研究人员探索有效的替代结构,如现代经常性神经网络(长期的经常性记忆模块 ) 。尽管他们最近在各种下游任务中取得了成功,但他们在需要长期理解和外推更长期序列的任务中挣扎。我们注意到,这些缺陷来自设计的三个脱节方面:(1) 内存能力有限,受内存和输入的经常准确度绘图结构所约束;(2) 在线更新的性质,即优化仅对最后输入的记忆;(3) 固定记忆的不那么直观管理。为了加强所有这三个方面,我们介绍了ATLAS,一个长期记忆模块,它具有很强的能力,通过根据当前和过去的标志优化记忆,克服了长期记忆模型的在线背景,并改进了输入输入输入输入输入的内径精度的内径的内径性模型;(2) 在线更新了AAS,我们从历史结构中得出了一种普通的系统结构结构。
Article 7
Title@2025-05-29 (4): Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time
Title: Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time | Begrenzte Rationalität für LLMs: Zufriedene Ausrichtung zur Folgezeit | LLM女士的理 理 理 理:在推断时满足一致 2505.23729v1 |
Authors: Mohamad Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi
Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign’s performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
将大型语言模型与人类相匹配具有挑战性,因为偏好反馈具有内在的多面性。虽然现有办法通常将它视为一个多目标优化问题,但往往忽视人类实际决策的方式。关于约束性理性的研究表明,人类决策遵循了使主要目标达到最优化的讽刺战略,同时确保其他人达到可接受的阈值。为了缩小这一差距,落实讽刺性调整概念,我们提议SITAlign:一个推论时间框架,通过尽可能扩大一个主要目标,同时满足基于次要标准的门槛限制,解决调整的多方面性质。我们通过得出基于推断一致方法的我们讽刺性调整的亚最佳性界限,提供理论见解。我们通过在多个基准上进行广泛的实验,验证SITignal的绩效。例如,关于PKU-SafeRLHF数据集,主要目标是最大限度地发挥帮助作用,同时确保无害性的阈值,SITAlign超越了基于次级标准的多目标分解战略。我们用22.3%的比值来得出我们基于推断性调整的比值,同时坚持GPT-4的无损率。
Article 8
Title@2025-05-29 (4): ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering
Title: ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering | ML-Agent: Verstärkung von LLM-Agenten für autonome Maschinenbautechnik | ML-代理:加强自动机械学习工程的LLM代理 2505.23723v1 |
Authors: Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, Siheng Chen
The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.
大型语言模式(LLM)代理商的出现大大推动了自主机器学习(ML)工程的发展,然而,大多数现有方法都严重依赖人工快速工程,未能根据不同的实验经验进行适应和优化。我们第一次探索基于学习的代理ML模式,即一个LLM代理商利用在线强化学习(RL),通过互动实验学习ML任务。为了实现这一点,我们提议了一个新型的代理ML培训框架,由三个关键组成部分:(1) 探索性强化微调,使LLM代理商能够产生多种行动,加强RL探索;(2) 渐进式RL,使培训能够采取单一行动步骤,加快经验收集,提高培训效率;(3) 专门针对Agric ML的奖励模块,将各种ML反馈信号整合成对RL优化的一致奖励。 利用这一框架,我们培训ML-A代理商,由7B规模的Quen-2.5LMMLM驱动,明显地推动,尽管我们仅仅接受了9 ML任务的培训,但我们的7BS-S-CS-CS-SLS-CSLS-S-S-SVAx Excal eximstreal ex ex exstrual ex ex ex ex ex eximproformacal ex
Article 9
Title@2025-05-29 (4): Label-Guided In-Context Learning for Named Entity Recognition
Title: Label-Guided In-Context Learning for Named Entity Recognition | Labelgeführtes In-Context-Lernen für die benannte Entitätserkennung | 为识别命名实体进行Label-Guided InFincle 学习 2505.23722v1 |
Authors: Fan Bai, Hamid Hassanzadeh, Ardavan Saeedi, Mark Dredze
In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.
文本内学习(ICL) 使大型语言模型(LLMS) 能够仅使用少量演示即可执行新任务。 在命名的实体识别(NER) 中, 示范通常根据与测试实例相似的语义选择, 忽略了培训标签, 并导致不优化的性能。 我们引入了DEER, 这是一种新方法, 通过象征性的统计数据利用培训标签来提高ICL的性能。 DEER首先用一个标签引导的、 象征性的检索器加强示例选择, 以标签指导的、 象征性的检索器为实体识别信息最强的象征物。 然后它促使LLM 重新检视使用标签统计识别的易出错符号, 并进行有针对性的更正。 使用四种不同的 LLMS, DEER 持续地超越现有的ICL 方法, 并接近监管的微调的性能。 进一步的分析显示它对于被看到和看不见的实体的有效性, 及其在低资源环境中的强健性。
Article 10
Title@2025-05-29 (4): Length-Controlled Margin-Based Preference Optimization without Reference Model
Title: Length-Controlled Margin-Based Preference Optimization without Reference Model | Längengesteuerte Margenbasierte Preference-Optimierung ohne Referenzmodell | 无参考模型的优化 2502.14643v2 |
Authors: Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu
Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework. This loss function regulates response length while simultaneously widening the margin between preferred and rejected outputs. By doing so, it mitigates probability degradation for both accepted and discarded responses, addressing a significant limitation of existing methods. We evaluate LMPO against state-of-the-art preference optimization techniques on two open-ended large language models, Mistral and LLaMA3, across six conditional benchmarks. Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches. The code is available at https://github.com/gengxuli/LMPO.
直接偏好优化是一种广泛采用的从人类反馈(RLHF)中学习以优惠为基础的强化优势的离线算法(DPO),目的是通过重新界定奖励功能来提高培训的简单性和稳定性,但是,DPO受到若干限制的阻碍,包括时间偏差、记忆效率低和概率退化。为了应对这些挑战,我们提议使用一个效率更高、更强的替代方法,即 “ 低偏差控制边际偏差优化(DPO) “ (DPO) 。LMOP推出一个统一的参考模型,作为DPO损失的上限,从而能够更准确地接近最初的优化目标。此外,还采用一个平均的log-概率优化战略,以尽量减少培训和推断阶段之间的差异。LMOPO的一项关键创新是其长期控制的基于边际的亏损功能,该功能在布拉德利特-Termy框架内整合。这个损失函数调节应对时间长度,同时扩大所偏好和被拒绝的产出之间的差。通过这样做,可以减少既被接受又被抛弃的对策的概率退化,从而大大限制现有方法。此外,我们评估LMOPO的偏好最佳偏好最佳选择方法。
Article 11
Title@2025-05-29 (4): Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models
Title: Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models | Nehmen Sie nicht die Prämisse für gewährt: Bewertung der Premise Critique Fähigkeit von großen Sprachmodellen | 评估大语言模型的精密克里米亚能力 2505.23715v1 |
Authors: Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs’ reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs’ proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.
大型语言模型(LLMS)取得了迅速的进步,表现出了非凡的能力,但明显的脆弱性依然存在:LLMS经常不严格地接受有缺陷或相互矛盾的前提,导致低效率推理和不可靠的产出。这突出表明了LMS拥有“Premize Critique Aculity”)的重要性,LMS被定义为能够主动识别和阐明输入处错误的能力。大多数现有研究评估LLMS在理想环境中的推理能力,在面对有缺陷的前提时基本上忽视其脆弱性。因此,我们引入了“textbf{Premise Critical Cench (PCBench) } ,设计将四个错误类型纳入三个困难级别,并配有多面的评价指标。我们对15个LLMS进行了系统的系统评价。我们的调查结果表明:(1) 多数模型严重依赖明确的及时发现错误,而自主批评有限;(2) Premisecriction 能力取决于问题难度和错误类型,直接的矛盾比复杂或程序错误更容易检测;(3) 判断能力与前置的判断能力没有一贯联系;(4) 僵化的房地触发能力触发的判断能力,在反复的推理的推理基础上加强现有的推理的推理基础。
Article 12
Title@2025-05-29 (4): SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods
Title: SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods | SenWiCh: Sense-Annotation von Low-Resource-Sprachen für WiC mit Hybrid-Methoden | SenWiCH: 使用混合方法为无线电通信中心提供低资源语言的高级说明 2505.23714v1 |
Authors: Roksana Goworek, Harpal Karlcut, Muhammad Shezad, Nijaguna Darshana, Abhishek Mane, Syam Bondada, Raghav Sikka, Ulvi Mammadov, Rauf Allahverdiyev, Sriram Purighella, Paridhi Gupta, Muhinyia Ndegwa, Haim Dubossarsky
This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.
虽然跨语文转让是利用多语文预先培训的关键战略,将语言技术推广到研究不足和类型多样的语言,但其有效性取决于质量和适当基准。我们发布了包含多种语言、跨越不同语言家庭和文字的九种低资源语言的带感标记的新数据集。为了便利数据集的创建,本文件提出了一个明显有益的半自动说明方法。通过Wordin-C(WIC)格式化的实验展示了数据集的效用,这些实验评价了这些低资源语言的转让。结果突出表明了在低资源环境中建立和评估有针对性数据集对于有效的多语言脱钩和转移研究的重要性。发布数据集和代码的目的是支持对公平、稳健和真正多语言的NLP的进一步研究。
Article 13
Title@2025-05-29 (4): SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Title: SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models | SocialMaze: Ein Benchmark für die Bewertung sozialer Vernunft in großen Sprachmodellen | 社会领域:用大语言模式评价社会原因的基准 2505.23713v1 |
Authors: Zixiang Xu, Yanbo Wang, Yue Huang, Jiayi Ye, Haomin Zhuang, Zirui Song, Lang Gao, Chenxi Wang, Zhaorun Chen, Yujun Zhou, Sixian Li, Wang Pan, Yue Zhao, Jieyu Zhao, Xiangliang Zhang, Xiuying Chen
Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model’s social reasoning ability - the capacity to interpret social contexts, infer others’ mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze
大型语言模型(LLMS)越来越多地应用于社会基础任务,如在线社区温和、媒体内容分析、社会推理游戏等。这些背景下的成功取决于模型的社会推理能力,即解释社会背景的能力、推断他人精神状态的能力,以及评估所提供信息的真实性。然而,目前还没有一个系统化的评价框架来全面评估LLMS的社会推理能力。现有的努力往往过于简化现实世界情景,包括一些过于基本的任务,无法挑战先进模型。为了缩小这一差距,我们引入了社会迷思,这是一个专门用来评价社会推理的新基准。社会迷思系统系统地包含了三个核心挑战:深刻推理、动态互动和信息不确定性。它提供了三种关键环境的六种不同任务:社会推理游戏、日常生活互动和数字社区平台。自动化和人力验证都用于确保LLMMS的社会推理能力。我们的评估揭示了几个关键洞察力:模型在处理动态互动和纳入时间变化中的信息的能力方面差异很大;具有强大思维链/思维推理的模型在需要更深的推理工作上比地更精确推理;社会推理可以大幅推理:深度推理:深度推理,在社会推理中,在现实中大幅推理中大幅地推理,我们在数据下,我们可以大幅地改进。
Article 14
Title@2025-05-29 (4): Neuro-symbolic Training for Reasoning over Spatial Language
Title: Neuro-symbolic Training for Reasoning over Spatial Language | Neuro-symbolisches Training zur Vernunft über räumliche Sprache | 以空间语言为借口的神经主义培训 2406.13828v3 |
Authors: Tanawan Premsri, Parisa Kordjamshidi
Spatial reasoning based on natural language expressions is essential for everyday human tasks. This reasoning ability is also crucial for machines to interact with their environment in a human-like manner. However, recent research shows that even state-of-the-art language models struggle with spatial reasoning over text, especially when facing nesting spatial expressions. This is attributed to not achieving the right level of abstraction required for generalizability. To alleviate this issue, we propose training language models with neuro-symbolic techniques that exploit the spatial logical rules as constraints, providing additional supervision to improve spatial reasoning and question answering. Training language models to adhere to spatial reasoning rules guides them in making more effective and general abstractions for transferring spatial knowledge to various domains. We evaluate our approach on existing spatial question-answering benchmarks. Our results indicate the effectiveness of our proposed technique in improving language models in complex multi-hop spatial reasoning over text.
以自然语言表达方式为基础的空间推理对于人类日常任务至关重要。这种推理能力对于机器以类似人类的方式与环境互动也至关重要。然而,最近的研究表明,即使是最先进的语言模型也与对文本的空间推理纠缠不休,特别是在面临嵌入空间表达方式时。这归因于没有达到普遍化所需的正确的抽象程度。为了缓解这一问题,我们提议以神经-侧翼技术培训语言模型,将空间逻辑规则作为制约,提供额外监督,以改进空间推理和问题回答。培训语言模型以遵守空间推理规则指导它们为将空间知识转移到不同领域创造更有效和一般的抽象性。我们评估了现有空间问题解答基准的方法。我们的结果表明,我们所提议的方法在改进复杂的多种空间推理文本的语言模型方面是有效的。
Article 15
Title@2025-05-29 (4): Let’s Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM’s Math Capability
Title: Let’s Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM’s Math Capability | Let’s Reason Formally: Natürlich-Formal Hybrid Reasoning verbessert LLMs Math Capability | 让我们正式解释一下: 自然-正规混合理由提高LLM的数学能力 2505.23703v1 |
Authors: Ruida Wang, Yuxin Li, Yi R., Fung, Tong Zhang
Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledge like FL into NL math reasoning effectively. Yet, this integration is challenging due to inherent disparities in problem structure and reasoning format between NL and FL. To address these challenges, we introduce NL-FL HybridReasoning, an end-to-end framework designed to incorporate the FL expert into NL math problem-solving. To bridge the NL and FL input format gap, we propose the NL-FL Problem Alignment method, which reformulates the Question-Answering (QA) problems in NL as existence theorems in FL. Subsequently, the Mixed Problem Input technique we provide enables the FL reasoner to handle both QA and existence problems concurrently. Lastly, we mitigate the NL and FL output format gap in reasoning through an LLM-based Answer Extraction mechanism. Comprehensive experiments demonstrate that the HybridReasoning framework achieves 89.80% and 84.34% accuracy rates on the MATH-500 and the AMC benchmarks, surpassing the NL baseline by 4.60% and 4.82%, respectively. Notably, some problems resolved by our framework remain unsolved by the NL baseline model even under a larger number of trials.
提高LLM的数学推理能力在数学界和计算机科学界都得到了极大的关注,最近的工作在自然语言(NL)推理和正式语言(FL)推理方面取得了长足进展,利用了基础模型中纯强化学习(RL)方法的潜力。然而,RL努力传授基础模型中未显示的新能力,强调需要有效地将FL等知识纳入NL数学推理。然而,由于NL和FL80基准之间在问题结构和推理格式上的内在差异,这种整合具有挑战性。为了应对这些挑战,我们引入了NL-FL混合校准**,这是旨在将FL专家纳入NL数学解决问题解决的端对端框架。为了弥合NL和FL格式的差距,我们提出了将NL-L问题重新纳入NL-L校准(QA)的问题作为NL. 884的标语标本。我们提供的“ML” 基线技术使得FL理性测试(NL) 和“OL”框架下的某些数字得以分别处理和“LL”推算。
Article 16
Title@2025-05-29 (4): Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Title: Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation | Kann LLMs abstrakt über Math Word Probleme ohne CoT? Entwirren Abstrakte Formulierung von Arithmetik Computation | 没有 CoT,LLMs 理学原理可以抽象地克服数学词问题吗? 2505.23701v1 |
Authors: Ziling Cheng, Meng Cao, Leila Pishdad, Yanshuai Cao, Jackie Chi Kit Cheung
Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.
以最后答案为基础的衡量标准通常用于评价数学词问题的大型语言模型(LLMS),通常被作为推理能力的替代物。然而,这些衡量标准将两种截然不同的子技能混在一起:抽象配方(使用表达式获得数学关系)和算术计算(执行计算 ) 。通过对 GSM8K 和 SVAMP 进行分解的评估,我们发现,Llama-3 和 Quen2.5 (1B-32B) 的最终回答准确性在没有计算技术的情况下被绝大多数的算术步骤而不是抽象的配方步骤所制约。与共同的信念相反,我们显示COT 主要是在计算中的辅助手段,对抽象配方影响有限。从机械上看,我们表明,这两种技能甚至在单一的向前传递过程中,没有通过抽象的、当时的计算机制采取任何推理步骤,即模型首先捕捉问题抽象,然后处理计算。Causal 补法确认这些抽象的精度是存在的、可转让的、可折的、在计算之前的。这些行为和机械性结论结论结论显示,有必要进行分解的评估,以便准确评估。
Article 17
Title@2025-05-29 (4): VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
Title: VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos | VF-Eval: Bewertung multimodaler LLMs zur Erzeugung von Feedback auf AIGC-Videos | VF-Eval:评价多式LLMs,以生成对AIGC视频的反馈 2505.23693v1 |
Authors: Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao
MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.
最近对MLLMS进行了广泛的研究,以解答视频问题。然而,大多数现有评估都侧重于自然视频,忽视合成视频,如AI产生的内容(AIGC ) 。与此同时,一些视频制作工作依靠MLLMs来评估所制作视频的质量,但MLLMs解释AIGC视频的能力仍然在很大程度上没有得到充分探讨。为了解决这个问题,我们提议了一个新的基准VF-Eval,即VF-Eval,它引入了四种任务一致性验证、错误意识、错误类型探测和推理评估,以全面评估AIGC 视频中MLLMs的能力。我们评估了13个VF-Eval的前沿MLLMs,发现即使是最优秀的模型GPT-4.1也难以在所有任务中取得一贯的良好业绩。这突出了我们基准的艰巨性。此外,为了调查VF-Eval在改进视频制作方面的实际应用,我们开展了一项实验,即RePrompt,以全面评价AIGC视频中MLMs的能力。我们评估了13个前沿MLLMs与人类反馈更加接近于视频制作。
Article 18
Title@2025-05-29 (4): Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models
Title: Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models | Kinderorientierte Sprache fördert nicht konsequent das Syntax-Lernen in Sprachmodellen | 在语言模式中促进语法学习 2505.23689v1 |
Authors: Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza
Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.
Huebner等人(2021年)的特有工作表明,语言模型(LMs)在英语儿童阅读语言(CDL)方面受过培训,能够达到与LMs在成人引导的书面文本数量大得多方面受过培训的类似综合能力,这表明CDL可以提供比通常使用的互联网浏览数据更有效的LM培训材料,然而,这些结果在语言、模式类型和评价设置方面的通用性仍然不清楚。我们通过将CDL对维基百科两个LM目标(虚假和因果)、三种语言(英文、法文、德文)和三种综合最低限度基准(英文、法文、德文)方面受过培训的模型进行比较来测试这一点。我们关于这些基准的结果显示CDL的效益不一致,在大多数情况下,这比维基百科模型的模型要差得多。我们随后找出了以前的基准中的各种缺点,并采用了一种新型的测试方法,即FIT-CLAMS,它使用频率控制的设计,以便能够在培训公司之间进行平衡的比较。我们通过最低限度的对口评价和回归分析来证明CDL培训不会产生更强的概括性,在获得同步分析和强调控制频率影响的能力的重要性。
Article 19
Title@2025-05-29 (4): Automatic classification of stop realisation with wav2vec2.0
Title: Automatic classification of stop realisation with wav2vec2.0 | Automatische Klassifizierung der Stop-Umsetzung mit wav2vec2.0 | 以 wav2vec2. 0 自动分类停止实现时间 2505.23688v1 |
Authors: James Tanner, Morgan Sonderegger, Jane Stuart-Smith, Jeff Mielke, Tyler Kendall
Modern phonetic research regularly makes use of automatic tools for the annotation of speech data, however few tools exist for the annotation of many variable phonetic phenomena. At the same time, pre-trained self-supervised models, such as wav2vec2.0, have been shown to perform well at speech classification tasks and latently encode fine-grained phonetic information. We demonstrate that wav2vec2.0 models can be trained to automatically classify stop burst presence with high accuracy in both English and Japanese, robust across both finely-curated and unprepared speech corpora. Patterns of variability in stop realisation are replicated with the automatic annotations, and closely follow those of manual annotations. These results demonstrate the potential of pre-trained speech models as tools for the automatic annotation and processing of speech corpus data, enabling researchers to `scale-up’ the scope of phonetic research with relative ease.
现代语音研究经常利用自动工具来说明语音数据,但很少有工具可用于说明许多变异的语音现象。与此同时,经过预先训练的自我监督模型,如 wav2vec2.0 显示在语音分类任务和潜在编码细微语音信息方面表现良好。我们证明,可以对 wav2vec2.0 模型进行培训,以在英语和日语中以高度精确的方式自动分类停止爆发的存在,这种模式在精细和未准备好的语音公司之间都非常有力。停止实现的变化模式与自动说明相似,并紧随手语说明的模式。这些结果表明,经过训练的语音模型作为自动注解和处理语音材料数据的工具具有潜力,使研究人员能够相对容易地“逐步”扩大语音研究的范围。
Article 20
Title@2025-05-29 (4): GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
Title: GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents | GSO: Herausfordernde Software-Optimierungsaufgaben zur Bewertung von SWE-Agenten | GSO:评估SWE-Agentics的有挑战的软件优化任务 2505.23671v1 |
Authors: Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica
Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models’ capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.
开发高性能软件是一项复杂的任务,需要专门知识。我们引入了GSO,这是评价语言模型开发高性能软件能力的基准。我们开发了一个自动管道,生成和执行绩效测试,以分析存储库,承诺历史查明10个代码库的102项挑战性优化任务,涵盖不同的领域和编程语言。向代理商提供了一个代码库和性能测试,作为精确的规格,并负责提高运行时间效率,以专家开发师的优化为衡量标准。我们的定量评估显示,领先的SWE-Agency 进行了巨大的斗争,取得了不到5%的成功率,即便在推论时间上也有有限的改进。我们的质量分析确定了关键的失败模式,包括使用低度语言的困难、采用懒惰性优化战略,以及在准确定位瓶颈方面存在的挑战。我们发布了基准的代码和工艺以及代理轨迹,以利今后的研究。
Article 21
Title@2025-05-29 (4): LoLA: Low-Rank Linear Attention With Sparse Caching
Title: LoLA: Low-Rank Linear Attention With Sparse Caching | LoLA: Low-Rank Lineare Aufmerksamkeit mit Sparse Caching | LoLA: 低兰克线性注意, 以粗糙的缓存 2505.23666v1 |
Authors: Luke McDermott, Robert W. Heath Jr., Rahul Parhi
Transformer-based large language models suffer from quadratic complexity at inference on long sequences. Linear attention methods are efficient alternatives, however, they fail to provide an accurate approximation of softmax attention. By additionally incorporating sliding window attention into each linear attention head, this gap can be closed for short context-length tasks. Unfortunately, these approaches cannot recall important information from long contexts due to “memory collisions”. In this paper , we propose LoLA: Low-rank Linear Attention with sparse caching. LoLA separately stores additional key-value pairs that would otherwise interfere with past associative memories. Moreover, LoLA further closes the gap between linear attention models and transformers by distributing past key-value pairs into three forms of memory: (i) recent pairs in a local sliding window; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. As an inference-only strategy, LoLA enables pass-key retrieval on up to 8K context lengths on needle-in-a-haystack tasks from RULER. It boosts the accuracy of the base subquadratic model from 0.6% to 97.4% at 4K context lengths, with a 4.6x smaller cache than that of Llama-3.1 8B. LoLA demonstrates strong performance on zero-shot commonsense reasoning tasks among 1B and 8B parameter subquadratic models. Finally, LoLA is an extremely lightweight approach: Nearly all of our results can be reproduced on a single consumer GPU.
以变换器为基础的大型语言模型在长序列的推论中具有二次复杂性。 线性关注方法是高效的替代方法, 但是它们无法提供精确的软体关注近似值。 此外, 线性关注方法通过将滑动窗口关注点纳入每个线性关注头, 这一差距可以因短期上下文任务而缩小。 不幸的是, 由于“ 模拟碰撞” , 这些方法无法回忆长背景下的重要信息 。 在本文中, 我们提议 LoLLAA: 低端线性关注, 且缓冲不小。 LoLA 单独存储了额外的关键值配对配对, 否则会干扰过去的关联记忆。 此外, LoLA 进一步缩小线性关注模型和变异器之间的差距, 将过去的键性值配对分配成三种记忆形式 :(i) 本地滑动窗口中的最近一对; (ii) 难以在“ 全球缓冲器” 中进行模拟的双对; (iii) 经常隐藏线性关注状态下的通用对配对。 一种只发光化策略, LoLSastA 能够让直截段段内上到8K- 直径B 的直径直径直线性操作的直径直线性操作, 直径直线性 A 直线性 A 直线性A 的直径直径直径直径直线性 A 直线性 A 直对 直径直径直径对 直对 直对 直对 直线性 A 。
Article 22
Title@2025-05-29 (4): Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Title: Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models | Mehrsprachige Frage-Antworten in Low-Resource-Einstellungen: Ein Dzongkha-Englischer Benchmark für Stiftungsmodelle | 低资源环境下的多语言问题解答:基础模型的Dzongkha-英语基准 2505.18638v2 |
Authors: Md. Tanzib Hosain, Rajan Das Gupta, Md. Kishor Morol
In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: https://github.com/kraritt/llm_dzongkha_evaluation.
在这项工作中,我们向不丹中、中学生提供DZEN,这是一套平行Dzongkha和英语测试问题的数据集。我们收集的超过5K个问题涉及各种科学课题,包括事实、应用和基于推理的问题。我们利用平行数据集测试一些大语言模型(LLMs),发现英文模型和Dzongkha模型之间存在显著的性能差异。我们还查看了不同的推动战略,发现“Cot”系统在推理问题方面起到了很好的作用,而事实问题则不那么好。我们还发现,添加英文译文提高了Dzongkha问题答复的准确性。我们的结果指出,为进一步研究提高Dzongkha和一般而言,低资源语言的LLM成绩,我们开辟了令人振奋的渠道。我们在https://github.com/kraritt/llm_dzukha_vivalation上公布了数据集:https://github. com/krant/llm_dzkha_view。
Article 23
Title@2025-05-29 (4): ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
Title: ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions | ToolHaystack: Stress-Testing Tool-Augmented Language Models in realistischen Langzeit-Interaktionen | 工具 Haystack:现实长期互动中的压力测试工具增强语言模式 2505.23662v1 |
Authors: Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo
Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
大型语言模型(LLMs)在使用外部工具解决用户询问方面表现出很强的能力,然而,大多数现有评价都假定在短期内使用工具,在现实的长期互动中对模型行为有有限的洞察力。为了填补这一空白,我们引入了工具Haystack,这是测试长期互动工具使用能力的基准。工具Haystack的每个测试实例都包括多重任务执行背景和持续对话中的现实噪音,从而能够评估模型维持环境的程度和处理各种干扰。通过将这一基准应用于14个最先进的LMs,我们发现,虽然目前的模型在标准多转弯环境中表现良好,但在工具Haystack中往往挣扎很大,突出了以往工具基准没有揭示的长期稳健性方面的关键差距。
Article 24
Title@2025-05-29 (4): Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation
Title: Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation | Aktives Layer-Kontrastives Decodieren reduziert Halluzination bei der Generierung von Großsprachenmodellen | 大型语言模式生成中活性多语言解层解码减少幻觉 2505.23657v1 |
Authors: Hongxiang Zhang, Hao Chen, Tianyi Zhang, Muhao Chen
Recent decoding methods improve the factuality of large language models~(LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.
最近解码方法通过精炼代代代中如何选择下一个符号来提高大型语言模型~(LLMs)的实际情况质量。 这些方法通常在象征性层面运作,利用内部代表来压制表面模式。 尽管如此,LLMs仍然容易产生幻觉,特别是在较长的环境下。 在本文中,我们提议了一种新的解码战略,即积极的多层调解码战略,即积极决定代中何时应用对比层。通过将解码作为一个相继的决策问题,ActLCD采用了一种强化学习政策,由有奖分的分类师指导,使事实质量在象征性层面之外达到最佳水平。我们的实验表明,AcLCD超越了五个基准的最新方法,显示了它在减少不同代中幻觉方面的有效性。
Article 25
Title@2025-05-29 (4): ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs
Title: ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs | ARC: Argumentationsdarstellungs- und Coverage-Analyse für eine Null-Shot-Lang-Dokument-Zusammenfassung mit Instruktion nach LLMs | ARC: “ 零张长文件摘要 “ 的参数代表性和覆盖面分析,在 “ LLM “ 之后指示 2505.23654v1 |
Authors: Mohamed Elaraby, Diane Litman
Integrating structured information has long improved the quality of abstractive summarization, particularly in retaining salient content. In this work, we focus on a specific form of structure: argument roles, which are crucial for summarizing documents in high-stakes domains such as law. We investigate whether instruction-tuned large language models (LLMs) adequately preserve this information. To this end, we introduce Argument Representation Coverage (ARC), a framework for measuring how well LLM-generated summaries capture salient arguments. Using ARC, we analyze summaries produced by three open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs cover salient argument roles to some extent, critical information is often omitted in generated summaries, particularly when arguments are sparsely distributed throughout the input. Further, we use ARC to uncover behavioral patterns – specifically, how the positional bias of LLM context windows and role-specific preferences impact the coverage of key arguments in generated summaries, emphasizing the need for more argument-aware summarization strategies.
集成结构化信息长期以来提高了抽象总结的质量,特别是在保留突出内容方面。在这项工作中,我们侧重于一种特定的结构形式:辩论作用,这对于在法律等高占用领域总结文件至关重要。我们调查了受指导的大型语言模型(LLMs)是否充分保存了这一信息。为此,我们引入了论证代表覆盖面(ARC),这是衡量LLM生成摘要捕捉突出论点的程度的框架。我们利用ARC分析了三个开放重量的LMS在两个领域产生的摘要,其中争论作用是核心的:长期法律意见和科学文章。我们的结果显示,LLMs在一定程度上覆盖突出的争论作用,但在生成的摘要中往往遗漏了关键信息,特别是在投入中分散的争论时。此外,我们利用ARC发现行为模式 – – 特别是LM背景窗口的定位偏差和特定角色偏好如何影响生成摘要中关键论点的涵盖,强调需要更多有争议的总结战略。
Article 26
Title@2025-05-29 (4): Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation
Title: Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation | Kleine Sprachmodelle: Architekturen, Techniken, Evaluation, Probleme und zukünftige Anpassung | 小型语言模式:建筑、技术、评价、问题和未来适应 2505.19529v2 |
Authors: Tanjil Hasan Sakib, Md. Tanzib Hosain, Md. Kishor Morol
Small Language Models (SLMs) have gained substantial attention due to their ability to execute diverse language tasks successfully while using fewer computer resources. These models are particularly ideal for deployment in limited environments, such as mobile devices, on-device processing, and edge systems. In this study, we present a complete assessment of SLMs, focussing on their design frameworks, training approaches, and techniques for lowering model size and complexity. We offer a novel classification system to organize the optimization approaches applied for SLMs, encompassing strategies like pruning, quantization, and model compression. Furthermore, we assemble SLM’s studies of evaluation suite with some existing datasets, establishing a rigorous platform for measuring SLM capabilities. Alongside this, we discuss the important difficulties that remain unresolved in this sector, including trade-offs between efficiency and performance, and we suggest directions for future study. We anticipate this study to serve as a beneficial guide for researchers and practitioners who aim to construct compact, efficient, and high-performing language models.
小型语言模型(SLMs)因其在使用较少的计算机资源的同时成功完成多种语言任务的能力而得到大量关注。这些模型对于在诸如移动设备、设备处理和边缘系统等有限环境中部署特别理想。在本研究中,我们全面评估了可持续土地管理,侧重于其设计框架、培训方法和降低模型规模和复杂性的技术。我们提供了一个新的分类系统,以组织对可持续土地管理应用的优化方法,包括诸如裁剪、量化和压缩模型等战略。此外,我们用一些现有的数据集将SLM关于评价套件的研究汇编成一些现有的数据集,建立一个严格的衡量可持续土地管理能力的平台。除此之外,我们讨论了这一部门尚未解决的重要困难,包括效率和绩效之间的权衡,我们建议今后研究的方向。我们预计这项研究将成为旨在构建紧凑、高效和高绩效语言模型的研究人员和从业人员的有益指南。
Article 27
Title@2025-05-29 (4): Are Reasoning Models More Prone to Hallucination?
Title: Are Reasoning Models More Prone to Hallucination? | Sind vernünftigere Modelle eher halluzinierend? | 理性模型更能让人产生幻觉吗? 2505.23646v1 |
Authors: Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, Tat-Seng Chua
Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.
最近发展起来的大型推理模型(LRMs)显示,在解决复杂任务时,有长期思维链推理能力(CoT)推理能力(LRMs)的强大表现。由于这些LRM多数是通过正式推理任务的培训后开发的,因此,它们是否广泛运用推理能力来帮助减少寻求事实的任务中的幻觉,现在仍然不清楚和辩论。例如,DeepSeek-RS1报告提高了简单QA(一个寻求事实的基准)的性能,而OpenAI-o3则观察到了更严重的幻觉。这种差异自然引起以下研究问题:推理模型更易产生幻觉吗?本文从三个角度处理问题。(1) 我们首先对LRMMs的幻觉进行整体评价。我们的分析显示,LRMRMs在培训后的全面演练中,经过寒冷监督的微调(SFT)和可核查的奖励RLL通常会减轻其幻觉。相比之下,光学和RM的L培训后演算过程通常也会改变我们错觉的正确性结果。
Article 28
Title@2025-05-29 (4): Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives
Title: Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives | Position: Skalierung von LLM-Agenten erfordert asymptotische Analyse mit LLM-Primitiven | 位置: 缩放 LLM 代理需要用 LLM 原始功能进行抗药性分析 2502.04358v2 |
Authors: Elliot Meyerson, Xin Qiu
Decomposing hard problems into subproblems often makes them easier and more efficient to solve. With large language models (LLMs) crossing critical reliability thresholds for a growing slate of capabilities, there is an increasing effort to decompose systems into sets of LLM-based agents, each of whom can be delegated sub-tasks. However, this decomposition (even when automated) is often intuitive, e.g., based on how a human might assign roles to members of a human team. How close are these role decompositions to optimal? This position paper argues that asymptotic analysis with LLM primitives is needed to reason about the efficiency of such decomposed systems, and that insights from such analysis will unlock opportunities for scaling them. By treating the LLM forward pass as the atomic unit of computational cost, one can separate out the (often opaque) inner workings of a particular LLM from the inherent efficiency of how a set of LLMs are orchestrated to solve hard problems. In other words, if we want to scale the deployment of LLMs to the limit, instead of anthropomorphizing LLMs, asymptotic analysis with LLM primitives should be used to reason about and develop more powerful decompositions of large problems into LLM agents.
将棘手问题分解成次级问题往往使这些问题更容易解决,更有效率。随着大型语言模型(LLMs)跨越关键可靠性临界临界值以达到日益成熟的能力,人们正日益努力将系统分解成以LLM为基础的代理器,每个代理器都可以被授予子任务。然而,这种分解(即使在自动化的情况下)往往不自然,例如,根据一个人如何将角色分配给人类团队的成员;这些角色是如何接近于最佳的分解?本立场文件认为,需要与LLLM原始体进行无症状分析,以说明这种分解系统的效率,而这种分析的洞见将释放出机会。通过将LLMM的前身作为计算成本的原子单位处理,可以将特定LLM的(往往不透明)内部工作与一组LMs如何精心安排以解决难题的内在效率区分开来。换句话说,如果我们想将LLMS的部署范围缩小到限度,而不是将这种分解的系统的效率提高到更强大的LMsrialms,那么,就应该将LMsrialmas进行更强大的分解分析。
Article 29
Title@2025-05-29 (4): YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Title: YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering | YESciEval: Robuster LLM-as-a-Richter für die Beantwortung wissenschaftlicher Fragen | YESciEval: 科学问题回答优异的LLM-as-a法官 2505.14279v2 |
Authors: Jennifer D’Souza, Hamed Babaei Giglou, Quentin Münch
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.
大型语言模型(LLMS)推动现代搜索引擎的科学问答,但其评价力度仍未得到充分探讨。我们引入了“YesciEval ” ,这是一个开放源码框架,将精细的标本评估与强化学习相结合,以减少LLM评价员的乐观偏向。我们发布多学科科学A数据集,包括对抗变量,并获得多个LMs的评价分数。我们的方法独立于专有模型和人类反馈,能够进行可扩展的、免费的评价。通过推广可靠的LLM-as-a-judge模型,这项工作支持AI的匹配,并促进对科学调查至关重要的强大、透明的评价。
Article 30
Title@2025-05-29 (4): Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education
Title: Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education | Menschliche Empathie als Encoder: KI-Assisted Depression Assessment in Special Education | 人类的同情作为编码器:大赦国际协助的特殊教育中抑郁症评估 2505.23631v1 |
Authors: Boning Zhao
Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students’ true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers’ empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional “Empathy Vector” (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy
评估特殊教育等敏感环境中的学生抑郁症具有挑战性。标准化问卷可能无法充分反映学生的真实情况。此外,自动化方法往往会随着学生的丰富叙述而动摇,缺乏来自教师与学生的同情性联系的关键和个性化的洞见。现有方法往往无法解决这种模糊性,或有效地融合教育者的理解。为了通过促进人类-AI的协同协作来克服这些限制,本文件介绍了人类同情作为Encoder(HEAE),这是一个以人类为中心的创新的、以人类为中心的AI框架,用于透明和对社会负责的抑郁症严重程度评估。我们的方法将学生叙述文字与教师衍生的、九维的“EVEV”(EV)及其由PHQ-9框架指导的维度,明确将默认的同情性洞见转化为结构化的AI投入,而不是取代人类判断力。严格实验优化了多式融合、文本代表制和分类结构结构,实现了7级严重性分类的82.74%的精确度。这项工作展示了通过结构性嵌入人类同情力,走向更负责任和道德影响性计算的道路。
Article 31
Title@2025-05-29 (4): GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns
Title: GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns | GeNRe: Ein französisches Gender-Neutral-Rewriting-System mit kollektiven Substantiven | GENRe:法国使用集体名词的性别-新书改写系统 2505.23630v1 |
Authors: Enzo Doyen, Amalia Todirascu
A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.
在自然语言处理(NLP)领域使用的文本数据中有很大一部分显示出性别偏见,特别是由于使用男性通用技术(本应用于男女混合群体),这可以延续和扩大陈规定型观念。性别重写是一项NLP任务,涉及以中或相反的形式(如从男性到女性)自动检测和取代性别形式(从男性到女性),可以用来减轻这些偏见。这种系统是用多种语言(英语、阿拉伯语、阿拉伯语、葡萄牙语、德语、法语)开发的,而自动使用性别中性化技术(相对于包容性或性别转换技术)仅对英语进行了研究。本文介绍了法国首个使用集体名的无性别重写系统GENRe,这是法国首个使用集体名词的无性别的重写系统。我们为法语采用了一种基于规则的系统(RBS),同时就我们RBS生成的数据培训了两种经过微调的语言模式。我们还探索了使用基于指示的模型,以提高我们其他系统的性能,并发现Claude 3 Opus与我们降低性别偏见的RBRBA结果。
Article 32
Title@2025-05-29 (4): AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Title: AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora | AutoSchemaKG: Autonome Wissensgraphenkonstruktion durch dynamische Schemainduktion aus Web-Scale Corpora | AutoSchemaKG:通过网络规模公司动态气相引入,建立自主知识图 2505.23628v1 |
Authors: Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
我们展示了AutoSchemaKG,这是一个完全自主的知识图形构建框架,它消除了对预先定义的模型的需求。我们的系统利用大型语言模型,同时从文本中提取三重知识,并直接产生全面的模型,同时对实体和事件进行建模,同时利用概念化来将事件组织成语义类别。处理超过5 000万份文件,我们建造了ATLAS(自动三连和Schema感应),这是一个知识图表系列,拥有9亿+百万节点和59亿边缘。这个方法在多霍QA任务上优于最新水平的基线,并增强了LLM事实质量。值得注意的是,我们的系统感应实现了与人造图的95-语义比对齐,零人工干预,表明10亿级知识图与动态导成的Schemas可以有效地补充大型语言模型的参数知识。
Article 33
Title@2025-05-29 (4): RULEBREAKERS: Challenging LLMs at the Crossroads between Formal Logic and Human-like Reasoning
Title: RULEBREAKERS: Challenging LLMs at the Crossroads between Formal Logic and Human-like Reasoning | RULEBREAKERS: Herausfordernde LLMs an der Kreuzung zwischen formaler Logik und menschlicher Vernunft | RULEBRATIERS: 在正式逻辑和类似人类的理由之间的十字路口挑战LLMS 2410.16502v2 |
Authors: Jason Chan, Robert Gaizauskas, Zhixue Zhao
Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterizes as “rulebreaker” scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognize and respond to rulebreakers (versus non-rulebreakers) in a human-like manner. Evaluating seven LLMs, we find that most models, including GPT-4o, achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models’ poor utilization of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs’ general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.
正式逻辑使计算机能够通过以象征性形式表示判决和适用规则来理解自然语言,从而得出结论。然而,在我们的研究中,“破坏规则”假想中,这种方法可以得出通常不会被人类推断或接受的结论,因为人类具有常识和事实知识。在认知科学作品的启发下,我们创建了RULEBRARIESERS,这是用来严格评价大型语言模型(LLLMS)以类似人类的方式认识和应对破坏规则者(反非破坏规则者)的能力的第一个数据集。在评价7个LLMS时,我们发现大多数模型,包括GPT-4o,在RULEBRAYERS上实现了中等精准性,并显示出一些过度严格适用逻辑规则的趋势,不同于典型人类理性者的预期。进一步的分析表明,这一明显的失败可能与模型对世界知识的利用不足及其注意力分布模式的分布模式的注意模式分配模式分配模式有关。我们的研究揭示了当前LLMS的局限性,同时,我们的研究也为最近越来越多的工作提供了一种及时的平衡,即建议采用正规逻辑改进LMS的一般推理学能力的方法,强调它们与日益扩大的LMSLMSLMs之间的偏差的风险。
Article 34
Title@2025-05-29 (4): Characterizing the Expressivity of Transformer Language Models
Title: Characterizing the Expressivity of Transformer Language Models | Charakterisierung der Expressivität von Transformer-Sprachmodellen | 描述变换语言模式的表达性 2505.23623v1 |
Authors: Jiaoda Li, Ryan Cotterell
Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. Prior work often relies on idealized models with assumptions – such as arbitrary numerical precision and hard attention – that diverge from real-world transformers. In this work, we provide an exact characterization of fixed-precision transformers with strict future masking and soft attention, an idealization that more closely mirrors practical implementations. We show that these models are precisely as expressive as a specific fragment of linear temporal logic that includes only a single temporal operator: the past operator. We further relate this logic to established classes in formal language theory, automata theory, and algebra, yielding a rich and unified theoretical framework for understanding transformer expressivity. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their theoretical capacity generalize perfectly over lengths, while they consistently fail to generalize on languages beyond it.
以变压器为基础的语言模型(LMs)取得了广泛的实证成功,但其理论表达力仍然只有部分被理解。先前的工作往往依赖于理想化模型,其假设 – – 例如任意数字精确度和重心 – – 与现实世界变压器不同。在这项工作中,我们提供了固定精密变压器的确切特征描述,其严格的未来掩码和软关注,这种理想化更密切地反映了实际实施。我们表明这些模型的准确性与线性时间逻辑的具体片段一样明确,它只包括一个时间操作器:过去的操作器。我们进一步将这一逻辑与正规语言理论、自动数据理论和代数的既定分类联系起来,形成一个丰富和统一的理论框架来理解变压器的表达性。最后,我们提出了与我们的理论紧密一致的经验结果:在理论能力范围内接受语言培训的变压器将完全超时长,而它们却始终无法在理论之外对语言进行普遍化。
Article 35
Title@2025-05-29 (4): Table-R1: Inference-Time Scaling for Table Reasoning
Title: Table-R1: Inference-Time Scaling for Table Reasoning | Tabelle-R1: Inferenz-Zeit-Skalierung für Tabellenveranlagung | 表-R1:表格理由推理的推断时间尺度 2505.23621v1 |
Authors: Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao
In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
在这项工作中,我们提出了第一份研究,以探索表格推理任务推理任务的推理时间缩放;我们制定和评价了两项培训后战略,以便能够推理时间缩放:从前沿模型推理模型中蒸馏,用可核查的奖励(RLVR)加强学习;在蒸馏方面,我们采用了由DeepSeek-R1产生的推理痕迹的大规模数据集,我们在表-R1-SFT模型中微调LMSLMS。关于RLVR,我们提出了具体任务可核实的奖励功能,并应用GROPO算法获取表-R1-零模型。我们评估了不同表格推理任务的表-R1系列模型,包括短格式QA、事实核实和自由格式QA。特别是,表-R1-Zero模型与GPT-41和DeepSeek-R1的性能匹配或超过GPT-41-4.1和DeepSeek-R1的性能。我们只使用一个7B参数LM。我们还展示了外部数据集的有力概括化。我们还展示了外数据设置。广泛的缩缩图和定性分析,并展示了作为基本推理学基础的推理学的优点,作为基础,作为基本推理学,作为基础,在一般结构中的基本推理学结构中的推理学的利。
Article 36
Title@2025-05-29 (4): EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation
Title: EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation | EXIT: Context-Aware Extractive Compression zur Verbesserung der Retrieval-Augmented Generation | EXIT: 为加强回流-提款一代而实行的背景软件抽取压缩 2412.12559v3 |
Authors: Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, SeungYoon Han, Jong C. Park
We introduce EXIT, an extractive context compression framework that enhances both the effectiveness and efficiency of retrieval-augmented generation (RAG) in question answering (QA). Current RAG systems often struggle when retrieval models fail to rank the most relevant documents, leading to the inclusion of more context at the expense of latency and accuracy. While abstractive compression methods can drastically reduce token counts, their token-by-token generation process significantly increases end-to-end latency. Conversely, existing extractive methods reduce latency but rely on independent, non-adaptive sentence selection, failing to fully utilize contextual information. EXIT addresses these limitations by classifying sentences from retrieved documents - while preserving their contextual dependencies - enabling parallelizable, context-aware extraction that adapts to query complexity and retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks show that EXIT consistently surpasses existing compression methods and even uncompressed baselines in QA accuracy, while also delivering substantial reductions in inference time and token count. By improving both effectiveness and efficiency, EXIT provides a promising direction for developing scalable, high-quality QA solutions in RAG pipelines. Our code is available at https://github.com/ThisIsHwang/EXIT
我们引入了ExIT, 这是一种提高检索-强化生成(RAG)相关回答(QA)的实效和效率的采掘背景压缩框架。 当前的RAG系统在检索模型未能对最相关的文件进行排位时往往会挣扎,导致以延缓性和准确性为代价纳入更多的背景,从而牺牲了延缓性和准确性。 虽然抽象的压缩方法可以大大减少象征性计数,但其逐个逐个逐个生成的过程会大大地增加端到端端的延缓度。相反,现有的采掘方法会减少延缓性,但依赖独立、非调整性的判决选择,无法充分利用背景信息。 EXIT通过对检索文件的量刑进行分类,同时保护其背景依赖性,从而克服这些局限性。 使得平行的、有背景意识的提取能够适应查询复杂性和检索质量。 我们对单手和多跳的QA的评估表明,ExIT始终超过现有的压缩方法,甚至没有压缩质量的QAA的基线,同时大幅度减少推论时间和象征性的计数。 EXIT通过提高效力和效率,为开发可扩展性、高质量的TRA/HR QA的TRA的TRA的RGIS号提供了一个有前景/HRQQQA的有希望。
Article 37
Title@2025-05-29 (4): Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
Title: Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering | Satori-SWE: Evolutionäre Test-Zeit-Skalierung für probeneffiziente Software-Engineering | Satori-SWE:样本高效软件工程的进化测试-时间尺度 2505.23604v1 |
Authors: Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan
Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.
语言模型(LMS)在标准化编码基准上表现良好,但与现实世界软件工程任务,例如解决SWE-Bench的GitHub问题,特别是在模型参数低于100B的情况下,在SWE-Bench中解决GitHub问题,特别是在模型参数低于100B的情况下。虽然较小的模型在实践中因其计算成本较低而更可取,但其性能仍具有挑战性。现有方法主要依靠监督的微调(SFT),具有高质量的数据,而这种数据在规模上是昂贵的。另一个办法是测试时间缩放:生成多种产出,使用核查器进行评分,并选择最佳的参数。虽然有效,但这一战略往往需要过多的取样和昂贵的评分,并限制其实际应用。我们建议采用将新一代作为进化过程的样本的测试时间缩放(EvoSUA),通过筛选和变异的输出,EvoSWES-S-S-SB 将产出分配的模型变为自我评估。
Article 38
Title@2025-05-29 (4): STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Title: STeCa: Step-level Trajectory Calibration for LLM Agent Learning | STeCa: Schritt-Level-Trajektorienkalibrierung für LLM Agent Learning | STeCa:LLM代理学习的职级轨迹校准 2502.14276v2 |
Authors: Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
大型语言模型(LLM)的代理机构通过动态地与环境互动,在应对复杂任务方面表现出了希望。现有工作主要侧重于通过专家演示或探索性轨迹抽样学习来进行行为克隆;然而,这些方法往往难以解决长期对等任务,因为亚优化行动会一步步积累,使代理机构偏离正确的任务轨迹。为此,我们强调及时校准的重要性,以及自动为培训代理机构建立校准轨迹的必要性。我们提出了逐步轨迹校准(STeCa),这是LLLM代理机构学习的新颖框架。具体地说,STeCa在探索期间通过一步级的奖励比较确定了亚优性行动。它利用LLM驱动的反射来构建校准轨迹,使代理机构能够从改进的决策进程中学习。我们最后利用这些校准轨迹和成功轨迹来强化培训。广泛的实验表明STeCa大大超越了现有方法。进一步的分析强调,及时校准使代理机构能够以更稳健的方式完成任务。我们的代码和数据在 https/Wng/H.
Article 39
Title@2025-05-29 (4): X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents
Title: X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents | X-TURING: Auf dem Weg zu einem verbesserten und effizienten Turing-Test für Langzeit-Dialogagenten | XTurning:争取对长期对话代理机构进行强化和高效率的图示测试 2408.09853v2 |
Authors: Weiqi Wu, Hongqiu Wu, Hai Zhao
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes \textbf{\textsc{X-Turing}}, which enhances the original test with a \textit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the \textit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the \textit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9\% and 38.9\% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.
图灵测试检查了AIs在自然语言对话中是否表现出类似人的行为。 传统的设置限制每个参与者一次一次发出一个信息, 并且需要不断的人类参与。 这无法反映自然的谈话风格, 妨碍在复杂和长期互动中根据大语言模型(LLLMs)对对话代理器的评价。 本文提出\ textbf=textsc{ X- Ting} , 以此用\ textit{ turst discript} 模式加强最初的测试, 允许使用连续的电文进行更动态的交流。 它通过反复生成模拟代理人与人之间的长期互动来进一步减少人类的工作量, 从而模拟测试过程的大部分是人与人之间的长期互动。 随着 & textitit{ psedo- dialog} 历史, 代理商然后与真正的人进行较短的对话, 与用问卷来评估同一主题的人与人之间的对话。 我们引入了\ text{ X- Turn Pass- Rate} 衡量LMs , 来评估LMs在不同的期间的人类相似性。 而像 GPT-4这样的LMs最初表现很好, 在10.9 和38xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 40
Title@2025-05-29 (4): Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
Title: Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles | Jigsaw-R1: Eine Studie über regelbasiertes Visuelles Verstärkungslernen mit Puzzle-Puzzles | Jigsaw-R1:用Jigsaw谜语进行基于规则的视觉强化学习研究 2505.23590v1 |
Authors: Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko
The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on simple puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: \href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.
基于规则的强化学习(RL)应用于多式大语言模型(MLLM{MLLM}) 带来了独特的挑战,也有可能偏离在纯文本域中发现的结果,特别是用于感知和重度任务。本文以基于规则的视觉RL为结构化的实验框架,使用 jigsaw 拼图谜题来全面研究基于规则的视觉RLL, 揭示一些关键的结论。\ textitt{gg} 我们发现, MLLLMS, 最初接近于随机猜测, 接近于简单拼图, 达到近乎效果的 jurb jurfi 精度精度, 通过微调, 概括到复杂、 看不见的配置。\ text{Text{Tralit} 有关拼图拼图拼图的训练可以引出其他视觉任务, 与具体的任务挂钩。\ text{LLL}}MLMLMS可以学习和没有清晰度的直径化过程。 因此, 它们可以忽略最后答案的思维过程。\ talfurfurfurfural rofus fural dural disal disal rode lading lex 而不是基础的初始化, rodu rodududeal rodu 。
Article 41
Title@2025-05-29 (4): On-Policy RL with Optimal Reward Baseline
Title: On-Policy RL with Optimal Reward Baseline | On-Policy RL mit optimaler Prämienbasis | 具有最佳回报基准的 政策性RL 2505.23585v1 |
Authors: Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei
Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.
强化学习算法对于使大型语言模式与人类偏好相一致并提高其推理能力至关重要。然而,由于政策限制松散,而且由于辅助模式导致计算效率低下,目前的强化学习算法往往因培训不稳定而受到影响。在这项工作中,我们提议采用最佳奖励基线(OPO),即新的简化强化学习算法(OPO),以应对这些挑战。OPO强调精确的政策培训的重要性,这种培训在经验上稳定了培训过程,并加强了探索。此外,OPO还引入了最佳奖励基线,从理论上将梯度差异降到最低。我们评估了OPO的数学推理基准。结果显示,OPO在没有额外模型或正规化条件的情况下,其业绩和培训稳定性较高。此外,OPO实现了较低的政策变化和产出增量,鼓励了更多多样性和较少重复性的反应。这些结果突出OPO是稳定和有效加强大语言模式调整和推理任务的有希望的方向。在 https://github.com/microcol/LMOps/tree/pine/polo/opopoto。
Article 42
Title@2025-05-29 (4): Multi-Domain Explainability of Preferences
Title: Multi-Domain Explainability of Preferences | Multi-Domain-Erklärbarkeit von Präferenzen | 优惠的多功能可解释性 2505.20088v2 |
Authors: Nitay Calderon, Liat Ein-Dor, Roi Reichart
Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.
人类偏好、LLM-as-a-judge(LaJ)和奖赏模式等特惠机制是调整和评价大型语言模式的核心。然而,驱动这些偏好的基本概念仍然没有得到很好的理解。在这项工作中,我们提出一个完全自动化的方法,用以产生基于地方和全球概念的跨多个领域的偏好解释。我们的方法利用LLM来确定区分选择和拒绝反应的概念,并用基于概念的矢量来代表它们。为了模拟概念和偏好之间的关系,我们提议了一个白色的“高端”多域反差模型,既能捕捉到领域一般效应,又能捕捉到特定领域效应。为了评估我们的方法,我们设计了一个涵盖8个挑战性和不同领域的数据集,并解释12个机制。我们的方法取得了很强的偏好预测性,优于基线,同时也可以解释。此外,我们用两种应用驱动环境来评估解释。首先,用LAJ解释的概念来指导LM产出,得出这些法官一贯喜欢的反应。第二,用解释人类偏爱度预测的概念来推动LAJs,用新的范式来解释他们的时代预测。
Article 43
Title@2025-05-29 (4): Evaluating AI capabilities in detecting conspiracy theories on YouTube
Title: Evaluating AI capabilities in detecting conspiracy theories on YouTube | Bewertung von KI-Fähigkeiten bei der Entdeckung von Verschwörungstheorien auf YouTube | 评价大赦国际在YouTube上发现阴谋论的能力 2505.23570v1 |
Authors: Leonardo La Rocca, Francesco Corso, Francesco Pierri
As a leading online platform with a vast global audience, YouTube’s extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.
作为拥有广大全球受众的领先在线平台,YouTube的广博范围也使其易于接收有害内容,包括虚假信息和阴谋理论。本研究探索了使用开放重量大语言模型(LLMs)(LLMs)来识别在YouTube上共享的阴谋理论视频(LLMs),利用贴有标签的数千个视频数据集,我们在零射场上对各种LLMs进行评估,并将其性能与微调的RoBERTA基线进行比较。结果显示,基于文本的LMs取得了高记数但低精确度,导致虚假的正数增加。多模式模型落后于只使用文本的对应方,表明视觉数据整合的好处有限。为了评估真实世界的可应用性,我们评估了无标签数据集的最准确模型,发现RoBERTA在使用更多参数接近LMs时取得了接近LMs的表现。我们的工作突出目前基于LM的在线有害内容检测方法的长处和局限性,强调需要更精确、更强大的系统。
Article 44
Title@2025-05-29 (4): Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
Title: Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models | Segment Policy Optimization: Effektive Segment-Level-Kreditvergabe in RL für große Sprachmodelle | 政策优化优化:大语言模式RL中有效的分部一级信用分配 2505.23564v1 |
Authors: Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.
现有方法主要采用两种对比优势估算方法:Token级别方法(例如PPO)旨在提供细微的优势信号,但由于难以培训准确的批评模型而造成不准确的估计。关于其他极端的轨迹级方法(例如GROP),完全依赖来自最终奖励的粗劣优势信号,导致不精确的信用分配。为解决这些限制,我们提议部分政策优化(SPO),这是一个新的RL框架,在中间颗粒度上利用部分水平优势估算,通过提供比轨迹水平更准确的信用分配信号,并由于培训准确的批评模型而导致估算不准确;关于其他极端的轨级方法(例如GROPO),完全依赖来自最终奖励的粗略优势信号,导致不精确的信用分配。为解决这些限制,我们提议采用部分优势,包括新颖的概率估测战略。我们进一步即时价SPO/MO-GO-MO-MO-MO-S-CRal-GO-C-PO-PO-C-C-PO-GO-C-C-LS-C-C-CO-C-C-PO-PO-C-C-C-C-C-C-C-LO-C-C-C-C-C-C-PO-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-PO-C-C-C-C-C-C-C-C-PAR-C-C-C-C-C-C-C-C-C-C-C-PO-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-PL-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-PAR-C-C-C-C-C-C-C-C-
Article 45
Title@2025-05-29 (4): LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Title: LEXam: Benchmarking Legal Reasoning on 340 Law Exams | LEXam: Benchmarking der rechtlichen Begründung von 340 Rechtsprüfungen | LEXam:340项法律考试的法律依据基准 2505.12864v2 |
Authors: Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/
尽管最近测试时间的扩大有所进展,但大型语言模型(LLMS)的长期法律推理仍然是一项关键挑战。我们引入了LEXam,这是340次法律考试的新基准,涉及不同学科和学位水平的116个法学院课程;数据集包括英语和德语4 886个法律考试问题,包括2 841个长式、开放式问题和2 045个多种选择问题。除了参考答案外,未决问题还附有明确的指导,概述预期的法律推理方法,如问题识别、规则回顾或规则应用。我们对开放式和多种选择问题的评价对目前的LLMS提出了重大挑战;特别是,它们与需要结构化、多步的法律推理的开放问题作斗争。此外,我们的结果强调数据集在区分不同能力模型方面的有效性。采用LLM-as-a-judge模式,并严格地验证人类专家,我们展示如何连贯和准确地评价模型产生的推理步骤。我们的评价设置提供了一种可扩展的方法,用以评估超出简单精确度度度度的衡量标准的质量。项目: https://lexgis-bisgismamuspage.
Article 46
Title@2025-05-29 (4): Understanding Refusal in Language Models with Sparse Autoencoders
Title: Understanding Refusal in Language Models with Sparse Autoencoders | Ablehnung in Sprachmodellen mit Sparse Autoencodern verstehen | 使用 sparse 自动解析器理解语言模式中的拒绝拒绝模式 2505.23556v1 |
Authors: Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.
拒绝是统一语言模式中的一个关键安全行为,然而,内部机制驱动拒绝机制仍然不透明。 在这项工作中,我们利用稀疏的自动编码器对受指导的LLMS的拒绝进行机械性研究,以找出因果调解拒绝行为的潜在特征。我们运用了两种开放源码聊天模式,并对拒绝相关特征进行干预,以评估其对生成的影响,确认其在多个有害数据集中的行为影响。这样就可以对拒绝行为在激活层面的表现进行精细细致的检查,并解决关键研究问题,例如调查上下游潜在关系,了解对抗性侵入性侵入性侵入技术的机制。我们还建立了拒绝特征,以加强线性探测的通用性,从而在分类任务中超越分布式对抗性对立样品。我们在 https://github.com/wj210/refusal_sae中打开了我们的代码。
Article 47
Title@2025-05-29 (4): Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Title: Enhancing Automated Interpretability with Output-Centric Feature Descriptions | Verbesserte Automatisierte Dolmetschbarkeit mit Output-Centric-Feature-Beschreibungen | 加强自动解释与产出中心特点描述的可解释性 2501.08319v2 |
Authors: Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva
Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model’s representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary “unembedding” head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be “dead”.
自动可解释性管道为大型语言模型(LLMs)中的特征所代表的概念,如工厂或句子中的第一个词,产生自然语言描述。这些描述是用激活功能的投入产生的,可能是模型代表空间的一个维度或方向。然而,确定启动投入的成本很高,模型行为中一个特征的机械作用取决于投入如何导致一个特性的激活,以及特征激活如何影响产出。我们通过指导评价发现,当前管道提供的描述未能捕捉该特性对产出的因果关系。为了修正这一点,我们提出了自动生成特征描述的高效、以输出为中心的方法。这些方法使用在特性刺激之后加权较高的符号,或者在直接应用“集合”字典后最高重量符号。我们以产出为中心的描述更好地捕捉模型产出特征的因果关系效应,而不是以输入为中心的描述,但将两者结合起来,可以导致对输入和产出评价的最佳性能。最后,我们表明,产出中心描述可以用来查找先前认为启动特征时“死”的投入。
Article 48
Title@2025-05-29 (4): Translation in the Wild
Title: Translation in the Wild | Übersetzung in der Wildnis | 《野生》翻译 2505.23548v1 |
Authors: Yuri Balashov
Large Language Models (LLMs) excel in translation among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in “incidental bilingualism” (Briakou et al. 2023) in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs’ translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways. I discuss the prospects for testing the “duality” hypothesis empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.
大语言模型(LLMS)在翻译方面表现优异,在零和少见的环境下展示了许多语言配对的竞争性性能。但与专门的神经机器翻译模型不同,LMS没有受过任何与翻译有关的目标的培训。什么解释其非凡的翻译能力?这些能力在培训数据中是否基于“偶然的双语”(Briakou等人,2023年)?教学调试是否对此有所帮助?LLMS是否有能力调整和利用互联网不同角落中不可能适合单一背景窗口的语义相同或类似的单语内容?我根据最近的研究和不断增长的用户经验,对这个专题进行一些反思。我的工作假设是LMS的翻译能力来自两种不同的培训前数据,这些数据可以不同的方式被模型内部化。我讨论了测试“质量”假设的前景及其在深学习时代对重新构思翻译、人文和机器的影响。
Article 49
Title@2025-05-29 (4): Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Title: Probability-Consistent Preference Optimization for Enhanced LLM Reasoning | Wahrscheinlichkeitskonsistente Preference-Optimierung für verbesserte LLM-Reasoning | 增强 LLM 理由说明的优化 2505.23540v1 |
Authors: Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
近来在优化优惠方面取得的进步表明,在提高大型语言模型(LLMs)的数学推理能力方面具有巨大潜力。虽然目前的做法通过基于结果的标准,例如答案的正确性或一致性,利用了高质量的对等优惠数据,但从根本上忽视了应对措施的内部逻辑一致性。为了克服这一点,我们提议采用概率-兼容性-优先优化(PCPO)这一新框架,为选择优惠确定双重量化指标:(1) 表面答案的正确性,(2) 各种答复之间内在的象征性概率一致性。广泛的实验表明,我们的PCPO在各种LMs和基准方面,一贯优于现有的只注重结果的标准方法。我们的代码可在https://github.com/YunqiaoYang/PCPO上公开查阅。
Article 50
Title@2025-05-29 (4): Fast Large Language Model Collaborative Decoding via Speculation
Title: Fast Large Language Model Collaborative Decoding via Speculation | Schnelles Large Language Model Kollaboratives Decodieren über Spekulation | 通过投机进行快速大语言合作示范模式 2502.01662v2 |
Authors: Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang
Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding–where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x-2.23x faster than standard collaborative decoding without compromising generation quality. Our code is available at https://github.com/Kamichanw/CoS/.
大型语言模型(LLM)合作解码技术(LLM)通过将多种模型的输出在每一代阶段结合起来,提高了产出质量,但计算成本很高。在本文中,我们引入了通过投机(COS)协作解码(COS),这是一个在不损害性能的情况下加速协作解码的新框架。受一个小型提案模型依次生成代号的投机解码(LLLM)的启发,而一个更大的目标模型平行核查,我们的方法基于两个主要的洞察力:(1) 核查分配可以是提案和目标模型的混合分布,以及(2) 作为提议方和核查方可以进一步提高效率,对每一种模型进行交替。我们将这种方法推广到n模式之间的合作,理论上证明COS从未比标准的合作解码慢过,通常能更快。广泛的实验显示COS比标准的协作解码速度快1.1x-2.23x比标准的代码在不影响生成质量的情况下更快。我们的代码可以在https://github.com/Kamichaw/COS/上查阅。
Article 51
Title@2025-05-29 (4): CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification
Title: CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification | CLaC bei SemEval-2025 Task 6: Ein Multi-Architektur-Ansatz für die Verifikation von Unternehmensumweltversprechen | SemEval-2025任务6:公司环境承诺核查的多建筑方法 2505.23538v1 |
Authors: Nawar Turk, Eeham Khan, Leila Kosseim
This paper presents our approach to the SemEval-2025 Task~6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.
本文介绍了我们处理SemEval-2025任务~6(PromiseEval)的方法,重点是核查公司ESG(环境、社会和治理)报告中的承诺。我们探讨了三个模型结构,以解决承诺识别、支持证据评估、清晰度评估和核查时间等四个子任务。我们的第一个模型使用ESG-BERT, 配有具体任务分类负责人,而我们的第二个模型则加强这一结构,配有针对每个子任务定制的语言特征。我们的第三个模型采用一个综合子任务模型,配有基于注意的序列集合、变压器演示以及文件元数据和多目标学习。对ML-Proise数据集的英文部分的实验表明我们各模型的逐步改进,我们的综合子任务组合方法取得了0.5268的领先分,超过了所提供的0.5227的基线。我们的工作突出语言特征提取、集中和多目标学习在承诺核查任务中的有效性,尽管由于阶级不平衡和培训数据有限而带来挑战。
Article 52
Title@2025-05-29 (4): Domain-Aware Tensor Network Structure Search
Title: Domain-Aware Tensor Network Structure Search | Domain-Aware Tensor Netzwerkstruktur Suche | 域- 软件显示器网络网络结构搜索 2505.23537v1 |
Authors: Giorgos Iacovides, Wuyang Zhou, Chao Li, Qibin Zhao, Danilo Mandic
Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees.
电线网络(TNS)能够有效地反映高维数据,然而,确定最佳的TN结构,即所谓的高频网络结构搜索(TN-SS)问题,仍然是一项挑战。目前的先进(SOTA)算法在计算上成本很高,因为它们需要广泛的功能评估,而对于现实世界的应用来说,这种评估是令人望而却步的。此外,现有的方法忽视了现实世界数据所固有的宝贵域信息,而且其查明的TN结构缺乏透明度。为此,我们提议了一个新型的TN-SS框架,称为TnLLM,它包含数据域域信息并利用大型语言模型(LLMS)的推理能力直接预测适当的TN结构。拟议的框架涉及一种对域有觉的快速管道,它要求LM根据现实世界关系推导出适当的TN结构结构。我们的方法不仅能够反复优化目标功能,而且还能够为所确定的结构产生域觉悟解释。实验结果表明,TNLLM在S-SS的初始搜索功能上实现了可比较的TN-S-M-LTA的快速搜索功能,而我们使用的域域域域级搜索功能则可以少于SO-MA。
Article 53
Title@2025-05-29 (4): Joint Localization and Activation Editing for Low-Resource Fine-Tuning
Title: Joint Localization and Activation Editing for Low-Resource Fine-Tuning | Gemeinsame Lokalisierungs- und Aktivierungsbearbeitung für Low-Resource Fine-Tuning | 低资源微调联合定位和启动编辑 2502.01179v4 |
Authors: Wen Lai, Alexander Fraser, Ivan Titov
Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing (or steering) techniques, which modify the activations of specific model components. Due to their extremely small parameter counts, these methods show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods. The code for the method is released at https://github.com/wenlai-lavine/jola.
在低资源情景中,标准的PEFT方法的效力有限,仅举几百个例子; 最近在可解释性研究方面的进展促使启动编辑(或指导)技术的出现,这些技术改变特定模型组件的启动。由于这些技术的参数数极小,这些方法显示了对小型数据集的希望。然而,这些方法的性能在很大程度上取决于如何确定编辑的正确模块,而且往往缺乏不同数据集之间的稳定性。在本文件中,我们提议联合定位和激活编辑(JoLA),这种方法共同学习(1) 变换器中头头要编辑(2) 干预是否应当添加、倍增、或同时和(3) 干预参数本身——矢量作为添加的抵消或倍增缩到主输出。我们通过对三个基准的评价,跨越了共同思维推理、自然语言理解和自然语言生成,证明JoLA 一贯地超越了现有方法。该方法的代码在https://github.lain/lain-laime.
Article 54
Title@2025-05-29 (4): Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents
Title: Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents | Auf dem Weg zu logisch klingender natürlicher Sprache mit logisch-erweiterten Sprachmodell-Agenten | 与逻辑增强语言示范代理商一道,争取实现逻辑合理自然语言合理 2408.16081v2 |
Authors: Agnieszka Mensfelt, Kostas Stathis, Vince Trencsenyi
Large language models (LLMs) are increasingly explored as general-purpose reasoners, particularly in agentic contexts. However, their outputs remain prone to mathematical and logical errors. This is especially challenging in open-ended tasks, where unstructured outputs lack explicit ground truth and may contain subtle inconsistencies. To address this issue, we propose Logic-Enhanced Language Model Agents (LELMA), a framework that integrates LLMs with formal logic to enable validation and refinement of natural language reasoning. LELMA comprises three components: an LLM-Reasoner, an LLM-Translator, and a Solver, and employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity. Using game-theoretic scenarios such as the Prisoner’s Dilemma as testbeds, we highlight the limitations of both less capable (Gemini 1.0 Pro) and advanced (GPT-4o) models in generating logically sound reasoning. LELMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement, particularly in GPT-4o. The study also highlights challenges in autoformalization accuracy and in evaluation of inherently ambiguous open-ended reasoning tasks.
大型语言模型(LLMS)作为一般用途解释器的探索越来越多,特别是在代理人的情况下。但是,其产出仍然容易发生数学和逻辑错误。在开放的任务中,这特别具有挑战性,因为没有结构的产出缺乏明确的地面真相,而且可能含有微妙的不一致之处。为了解决这一问题,我们提议了逻辑增强的语言模型(LELMA),这是一个将LLMS与正式逻辑相结合的框架,以便能够验证和完善自然语言推理。LLMA由三个部分组成:LLM-Reasoner、LLM笔译员和Solverer,以及采用自动化法化法将推理转换成逻辑表述,然后用于评估逻辑有效性。我们用游戏理论假设,例如Pater’s Dilemma作为测试台,我们强调能力较弱(Gemini 1.0 Pro)和高级(GPT-4o)两种模型在产生逻辑合理推理方面的局限性。LLMA在通过自我精细,特别是在GPT-4o中,通过自我精确度来改进来提高逻辑推理学的精确度方面,并改进推理理学的推理精确性。该研究还强调了性任务中,还突出的内在推理学上的挑战。
Article 55
Title@2025-05-29 (4): Hijacking Large Language Models via Adversarial In-Context Learning
Title: Hijacking Large Language Models via Adversarial In-Context Learning | Entführen von großen Sprachmodellen über das adversarische In-Context-Lernen | 通过对抗性内书学习劫持大语言模式 2311.09948v3 |
Authors: Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Prashant Khanduri, Dongxiao Zhu
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the preconditioned prompts. Despite its promising performance, crafted adversarial attacks pose a notable threat to the robustness of LLMs. Existing attacks are either easy to detect, require a trigger in user input, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable prompt injection attack against ICL, aiming to hijack LLMs to generate the target output or elicit harmful responses. In our threat model, the hacker acts as a model publisher who leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demos via prompt injection. We also propose effective defense strategies using a few shots of clean demos, enhancing the robustness of LLMs during ICL. Extensive experimental results across various classification and jailbreak tasks demonstrate the effectiveness of the proposed attack and defense strategies. This work highlights the significant security vulnerabilities of LLMs during ICL and underscores the need for further in-depth studies.
理论内学(ICL)已成为一种强有力的范例,通过在先决条件的提示下,利用标记的例子作为示范(演示),利用LLM执行具体的下游任务,使LLM发挥杠杆作用。尽管其表现令人充满希望,但精心策划的对抗性攻击对LLM的强健性构成了显著的威胁。现有的攻击要么容易发现,需要用户投入,或者对ICL缺乏具体性。为解决这些问题,这项工作引入了针对ICL的新型可转移的迅速注射攻击,目的是劫持LLMS,以产生目标产出或引起有害反应。在我们的威胁模式中,黑客充当了利用基于梯度的快速搜索方法学习和通过迅速注射将无法察觉的对抗性后遗症附在文本内演示中的一种示范出版商。我们还提出使用几支干净的演示镜头的有效防御战略,加强LLMs在ICL期间的强健性。各种分类和破监狱任务的广泛实验结果表明拟议的攻击和防御战略的有效性。这项工作突出了LMS在ICL期间的重大安全脆弱性,并强调需要进一步深入研究。
Article 56
Title@2025-05-29 (4): Identity resolution of software metadata using Large Language Models
Title: Identity resolution of software metadata using Large Language Models | Identitätsauflösung von Software-Metadaten mit großen Sprachmodellen | 使用大语言模式的软件元数据的识别分辨率 2505.23500v1 |
Authors: Eva Martín del Pico, Josep Lluís Gelpí, Salvador Capella-Gutiérrez
Software is an essential component of research. However, little attention has been paid to it compared with that paid to research data. Recently, there has been an increase in efforts to acknowledge and highlight the importance of software in research activities. Structured metadata from platforms like bio.tools, Bioconductor, and Galaxy ToolShed offers valuable insights into research software in the Life Sciences. Although originally intended to support discovery and integration, this metadata can be repurposed for large-scale analysis of software practices. However, its quality and completeness vary across platforms, reflecting diverse documentation practices. To gain a comprehensive view of software development and sustainability, consolidating this metadata is necessary, but requires robust mechanisms to address its heterogeneity and scale. This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution, a critical step in assembling a cohesive collection of research software. Such a collection is the reference component for the Software Observatory at OpenEBench, a platform that aggregates metadata to monitor the FAIRness of research software in the Life Sciences. We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions. The proxy achieved high precision and statistical robustness, while also highlighting the limitations of current models and the broader challenges of automating semantic judgment in FAIR-aligned software metadata across registries and repositories.
与研究数据相比,软件是研究的一个基本组成部分。然而,与研究数据相比,对软件的注意很少。最近,人们更加努力承认和强调软件在研究活动中的重要性。生物工具、生物导体和Galaxy ToolShed等平台的结构化元数据为生命科学研究软件提供了宝贵的见解。虽然该元数据最初旨在支持发现和整合,但可以重新用于大规模分析软件实践。然而,该元数据的质量和完整性在平台上各有差异,反映了各种文件做法。为了全面了解软件的开发和可持续性,有必要巩固这一元数据,但需要建立强有力的机制来解决其差异性和规模。本文章对用于软件元数据解析任务的指示调整型大语言模型进行了评价,这是构建统一研究软件库的关键一步。这种收集是OpenEbeench软件观测台的参考部分,该台是一个将元数据汇总成一个平台,用以监测生命科学研究软件的FAIR性。我们用多种模型比对有附加说明的黄金标准进行了基准,需要加以巩固,但需要建立强有力的机制来应对其模棱不全案例和规模和规模的不均匀性,同时,还引入了高额的可靠、高额的统计模型决定。
Article 57
Title@2025-05-29 (4): Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
Title: Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking | Diagnose und Bewältigung von Pitfalls in KG-RAG-Datensätzen: Zu zuverlässigerem Benchmarking | 分析和处理KG-RAG数据集的缺陷:争取更可靠的基准 2505.23495v1 |
Authors: Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
知识图表解答系统(KGQA)依靠高质量的基准来评价复杂的多点推理,然而,尽管这些系统得到广泛使用,但广受欢迎的数据集,如WebQSP和CWQ等,却存在关键性的质量问题,包括不准确或不完整的地面真相说明、结构不当的问题模糊、微不足道或无法回答、过时或不一致。通过对16个广受欢迎的KGQA数据集,包括WebQSP和CWQ进行人工审计,我们发现平均事实正确率仅为57 % 。为了解决这些问题,我们引入了KGQAGen,这是一个系统地解决这些陷阱的LLM-loop框架。KGAG将结构化的知识基础、LLM-指导的生成和象征性的核查结合起来,以产生具有挑战性和可核查的QA实例。我们用KQGAG建立以维基数据为基础的十倍和规模基准基准,并评价一套不同的KG-RAG模型。实验结果显示,甚至将KGG的更严格的能力定位定位定位到KGA的模型。
Article 58
Title@2025-05-29 (4): Spoken Language Modeling with Duration-Penalized Self-Supervised Units
Title: Spoken Language Modeling with Duration-Penalized Self-Supervised Units | Gesprochene Sprachmodellierung mit Dauer-Penalisierten Selbstüberwachten Einheiten | 长期惩罚性自督单位的口语模拟模式 2505.23494v1 |
Authors: Nicol Visser, Herman Kamper
Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren’t always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.
口语模型(SLMs)运行于通过自我监督的语音演示独立化获得的音响单元。虽然这些单元的特征直接影响到性能,但代码大小和单位粗糙(即持续时间)之间的相互作用仍未探索。我们使用简单期限的动态编程(DPDP)方法来调查SLM的性能和单位粗糙性能。新的分析在不同语言层次上进行。在电话和文字层次上,粗糙几乎没有什么好处,只要正确选择代码手册的大小。然而,在生成整个句子时,SLMs与粗粗糙的单元相比表现更好。在词汇和合成语言模型任务中,粗糙的单元在较低的位数上也提供更高的读数。因此我们表明,粗粗的单元并非总是更好,但DPDPD是一个简单而高效的方法,在它们能够受益的任务中获取粗糙的单元。
Article 59
Title@2025-05-29 (4): R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation
Title: R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation | R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation | R2I-Bench: 基准推理-驱动生成文本到图像 2505.23493v1 |
Authors: Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, Lifu Huang
Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating a bitten apple that has been left in the air for more than a week
necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: https://r2i-bench.github.io
理性是真实世界文本到图像(T2I)生成时通常需要的一种基本能力,例如,产生“被咬的苹果,在空气中留置超过一周”(Pa bitten apple)需要理解时间衰减和常识概念。虽然最近的T2I模型在制作光现实图像方面取得了令人印象深刻的进展,但其推理能力仍然不足,评价不充分。为了缩小这一差距,我们引入了R2I-Bench,这是一个全面基准,专门用来严格评估推理驱动的T2I一代。R2I-Bench由精心整理的数据实例组成,涵盖核心推理类别,包括公元、数学、逻辑、构成、数字、因果和概念混合。为了便利精细化评价,我们设计了R2IScore,这是基于实例、注重推理、注重推理的QA型评价问题,评估了三个关键方面:文字图像校正、推理准确性和图像质量。与16个具有代表性的T2I模型进行了广泛的实验,包括一个强有力的编基础框架,利用州级推理和新一代,并用更强的推理、高的推理学模型展示了新一代的建筑系统。I在持续地展示的推理学制中展示模型中展示模型中需要。
Article 60
Title@2025-05-29 (4): Learning to Poison Large Language Models for Downstream Manipulation
Title: Learning to Poison Large Language Models for Downstream Manipulation | Große Sprachmodelle für Downstream-Manipulation zu vergiften | 学习下游操作毒物大语言模式 2402.13459v3 |
Authors: Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Mohammad Amin Roshani, Prashant Khanduri, Douglas Zytko, Dongxiao Zhu
The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where the adversary inserts backdoor triggers into training data to manipulate outputs. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the supervised fine-tuning (SFT) process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various language model tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs’ outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during SFT of LLMs and the necessity of safeguarding LLMs against data poisoning attacks.
大语言模型(LLMS)的出现在语言处理和推理能力方面取得了显著成就。尽管取得了进步,LLMS面临数据中毒袭击的脆弱性,因为对手将后门触发器插入培训数据以操纵产出。这项工作进一步确定了LMS的额外安全风险,为此设计了新的数据中毒袭击,专门利用监管的微调(SFT)程序。我们提出了一个新的梯度引导后门触发学习算法,以有效识别对抗性触发器,确保常规防御在保持内容完整性的同时逃避发现。通过对各种语言模型任务(包括情绪分析、域生成和回答问题)的实验性验证,我们的中毒战略在损害LMS的各种产出方面表现出高成功率。我们进一步提出了两种防范数据中毒袭击的防御战略,包括文体内学习(ICL)和持续学习(CLF),以有效纠正LMs的行为并显著降低性能下降。我们的工作突出了SFTM公司在维持内容完整性方面所面临的重大安全风险,以及保护LMS公司免受数据中毒袭击的必要性。
Article 61
Title@2025-05-29 (4): Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu
Title: Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu | In-Context Machine Translation für Low-Resource-Sprachen verstehen: Eine Fallstudie zu Mandschu | 理解低资源语言的文内机翻译:关于满字的个案研究 2502.11862v2 |
Authors: Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries. Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL). However, the relative importance of each type of resource, e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear. To address this gap, this study systematically investigates how each resource and its quality affect the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an enciphered version of Manchu texts. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help. In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap a conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.
具有大语言模型(LLMS)的内文机翻译(MT)是低资源MT的一个很有希望的方法,因为它可以很容易地利用语法书籍和词典等语言资源,这种资源通常有选择性地纳入迅速使用,以便LLMS能够通过其内文学习能力(ICL)直接进行不经过任何具体培训的翻译。然而,每种类型的资源的相对重要性,例如字典、语法书和检索的平行实例,并不十分明确。为了填补这一空白,本研究系统地调查每一种资源及其质量如何影响翻译业绩,以满洲语言作为案例研究。为了消除LLM参数中编码的Manchu先前的任何知识,并单独列出ICL的效果,我们还试验了曼丘文本的加密版本。我们的结果表明,高质量的词典和良好的平行范例非常有用,但语法几乎无济于事。在一项后续研究中,我们展示了一种有前途的、可言的运用于文字中的MTM:平行数据扩充,作为在常规MTMT模型中加固的一种方法,通过一个有效的综合数据模型产生。
Article 62
Title@2025-05-29 (4): Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Title: Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions | Firm oder Fickle? Bewertung großer Sprachmodelle Konsistenz in sequenziellen Interaktionen | 公司或Fickle?评估大语言模型在序列相互作用中的一致性 2503.22353v2 |
Authors: Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score that captures both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by incorporating model confidence signals into the generation process. Empirical results demonstrate that CARG significantly improves response stability without sacrificing accuracy, underscoring its potential for reliable LLM deployment in critical applications.
大型语言模型(LLMS)在各种任务中表现出非凡的能力,但在高接触领域部署这些模型需要多轮互动的一致表现,本文件为评价和改进LLM反应一致性提出了全面框架,作出了三项关键贡献。首先,我们提出一个新的定位-视觉一致性评分,既反映早期稳定的重要性,又反映多方向互动中恢复模式的重要性。第二,我们提出一个仔细制定的基准数据集,涵盖不同领域和困难程度,具体设计该数据集是为了评估在各种具有挑战性的后续设想下LLM的一致性。第三,我们引入了信任-软件响应生成(CARG),这是一个通过将示范信任信号纳入生成过程而大大提高响应稳定性的框架。经验性结果显示,CARG在不牺牲准确性的情况下大大提高了反应稳定性,强调其在关键应用程序中可靠部署LM的潜力。
Article 63
Title@2025-05-29 (4): Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt
Title: Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt | Überdenken in der langen Kette des Denkens aus der Perspektive des Selbstzweifels | 从自杜卜特的视角重新思考长期思维链中的过度思考问题 2505.23480v1 |
Authors: Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Dacheng Tao
Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking – performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model’s over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.
大型语言模型(LLLMS)在复杂任务方面表现出了令人印象深刻的成绩,这主要是因为采用了长链思维(Long CoT)推理。然而,这些模型往往表现出过度思考 – – 即使在得出正确答案后仍采取不必要的推理步骤。先前的工作主要侧重于通过对长 CoTs 进行基于抽样的观察,对过度思考进行定性分析。相比之下,我们从自我怀疑的角度对过度思考进行定量分析,其特点是过度象征性使用,用于重新验证已经正确的答案。我们发现,自欺欺人极大地助长了过度思考。作为回应,我们引入了简单有效的快速方法,以减少模型过分依赖投入问题,从而避免自欺人。具体地说,我们首先促使模型质疑投入问题的有效性,然后根据评估结果作出简明的答复。在三个数学推理任务和四个缺少的数据集上进行的实验表明,我们的方法大大缩短了答案长度,在四个广泛使用的RLLMS中几乎所有数据集中都取得了显著的改进。进一步的分析表明,我们的方法有效地减少了自我利用的推理学。
Article 64
Title@2025-05-29 (4): Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons
Title: Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons | Bewertung der Leistungsfähigkeit und Fragilität großer Sprachmodelle auf der Selbsteinschätzung für neurologische Chirurgen | 评价神经外科医生自我评估大语言模型的性能和脆弱性 2505.23477v1 |
Authors: Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann
The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models’ (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.
神经外科医生大会(神经外科医生大会)对神经外科医生的自我评估问题进行了广泛使用,神经外科医生们为准备书面的董事会考试做准备。最近,这些问题也成为评估大型语言模型(LLMS)神经外科知识的基准。本研究的目的是评估神经外科外科外科外科外科外科外科外科医生最先进的LLMS(神经外科外科外科外科外科外科外科外科医生大会(CNS-SANS)的自我评估,并评估其对列入分散剂说明的力度。使用28个大型语言模型进行全面评价。这些模型测试了2 904个神经外科外科外科外科外科外科外科外科外科专家考试的问题。此外,该研究还引入了一个分心框架来评估这些模型的脆弱性。这个框架包含了包含临床含义的简单、无关的多词句,用以确定这种分流科外科外科外科外科外科外科的模型在多大程度上降低了标准性。 28个测试的LMS中,其中6个取得了透视结果,而最优秀的里科内科内科外科内科内科内科内科内科外科外科外科外科外科外科内科内科内科内科内科内科内科内科内科内科内科内科内科内科外科内科内科外科外科外科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科内科
Article 65
Title@2025-05-29 (4): Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Title: Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns | Scratic-PRMBench: Benchmarking-Prozess-Reward-Modelle mit systematischen Begründungsmustern | Scorti-PRMBench:有系统说明理由模式的基准进程奖励模式 2505.23474v1 |
Authors: Xiang Li, Haiyang Yu, Xinghua Zhang, Ziyang Huang, Shizhu He, Kang Liu, Jun Zhao, Fei Huang, Yongbin Li
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.
在现实世界的情景中,法学硕士可能运用各种推理模式(如分解)来解决问题,可能因各种推理模式中的错误而受害。因此,理赔模式需要查明在推理过程中各种推理模式下的错误。然而,现有基准主要侧重于通过核实每个中间推理步骤的正确性,对理赔模式进行分解,从而核实每个中间推理步骤的正确性,从而在复杂的推理和解决问题的任务中(例如,具有长相决策作用的法学硕士人员)至关重要。为了缩小这一差距,我们引入了专制-PRMBench,这是在六个推理模式下系统评估理学模型的新基准,这些模式包括变换、变形、变形、变形、核查和一体化,以及2995条推理路径与上述六种推理模式中的缺陷相匹配。通过我们关于作为批评模型而推出的PRM和LMs的实验,我们发现现有理赔模式中的明显缺陷。这些观察突出表明,目前的PRMS在进行关于以系统推理学方式进行PRM模式下的未来推理学模式之下,可以根据各种推理的推理学模式。
Article 66
Title@2025-05-29 (4): BenchmarkCards: Large Language Model and Risk Reporting
Title: BenchmarkCards: Large Language Model and Risk Reporting | BenchmarkCards: Großes Sprachmodell und Risikoberichterstattung | 基准目录:大语言模式和风险报告 2410.12974v2 |
Authors: Anna Sokol, Elizabeth Daly, Michael Hind, David Piorkowski, Xiangliang Zhang, Nuno Moniz, Nitesh Chawla
Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different domains. However, finding suitable benchmarks is difficult given the many available options. This complexity not only increases the risk of benchmark misuse and misinterpretation but also demands substantial effort from LLM users, seeking the most suitable benchmarks for their specific needs. To address these issues, we introduce \texttt{BenchmarkCards}, an intuitive and validated documentation framework that standardizes critical benchmark attributes such as objectives, methodologies, data sources, and limitations. Through user studies involving benchmark creators and users, we show that \texttt{BenchmarkCards} can simplify benchmark selection and enhance transparency, facilitating informed decision-making in evaluating LLMs. Data & Code: https://github.com/SokolAnn/BenchmarkCards
大型语言模型(LLMS)是能够处理不同任务的有力工具。比较和选择适合具体任务的LLMS需要系统化的评估方法,因为模型在不同领域表现出不同的能力。然而,鉴于有许多可供选择的备选办法,很难找到适当的基准。这种复杂性不仅增加了基准误用和误解的风险,而且要求LLM用户作出大量努力,为其具体需要寻找最合适的基准。为了解决这些问题,我们引入了一个直观和经过验证的文件框架,使目标、方法、数据来源和限制等关键基准属性标准化。通过涉及基准创建者和用户的用户研究,我们表明\textt{BenchmarkCard}可以简化基准选择和提高透明度,便利在评估LMS时作出知情决策。数据代码:https://github.com/SokolAnn/BenchmarkCards。
Article 67
Title@2025-05-29 (4): Agentic Knowledgeable Self-awareness
Title: Agentic Knowledgeable Self-awareness | Agentisch sachkundiges Selbstbewußtsein | A. 动态知识自觉意识 2504.03553v2 |
Authors: Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a “flood irrigation” methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent’s self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.
大型语言模型(LLMS)在各种代理规划任务中取得了相当大的成绩,然而,传统代理规划方法采用了一种“洪水灌溉”方法,不加区别地将金轨、外部反馈和领域知识注入代理模型中,这种做法忽略了在决策过程中对情况自我认识的基本人类认知原则,即动态地评估形势需求和在决策中战略性地利用资源的能力。我们提出了一种具有代理知识的自我意识来解决这一差距的新模式,使以LLM为基础的代理能够自主地规范知识的利用。具体地说,我们提出了一种以数据为中心的方法,将具有了解情况的自我认识的代理人应用到像人类那样有知识的自我意识的代理人。具体地说,我们设计了一种超常状况判断标准,以标志该代理人收集培训数据的自我探索轨迹的特殊标志。通过两阶段的培训过程,该代理模型可以在不同情况之间转换,产生特定的特殊标志,以最低的成本实现最佳的规划效果。我们的实验表明,“了解自我”可以超越不同任务和模型上的各种强的基线,而很少使用外部知识。《准则》可在 https://gimb/commus.
Article 68
Title@2025-05-29 (4): UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions
Title: UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions | UAQFact: Bewertung der tatsächlichen Wissensnutzung von LLMs auf unbeantwortbaren Fragen | UAQFact:评估关于无法回答问题LLMs的实情知识利用情况 2505.23461v1 |
Authors: Chuanyuan Tan, Wenbiao Shao, Hao Xiong, Tong Zhu, Zhenhua Liu, Kai Shi, Wenliang Chen
Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs’ performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs’ ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs’ ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.
处理无法解答的问题(UAQ)对LLMs至关重要,因为它有助于防止在复杂情况下作出误导性反应。虽然以前的研究已经建立了若干数据集来评估LLMs在UAQ上的绩效,但这些数据集缺乏事实知识支持,限制了LLMs在处理UAQ时利用事实知识的能力。为了解决这一限制,我们引入了一个新的无法解答的问题数据集UAQFact,这是一个双语数据集,由知识图生成的辅助事实知识构成。基于UAQFact,我们进一步界定了两项新任务,以衡量LMs利用内部和外部事实知识的能力。我们通过多个LLMM系列的实验结果表明,UAQFact提出了重大挑战,因为LMs并不始终如一地运行,即使它们储存了事实知识。此外,我们发现纳入外部知识可以提高绩效,但LMs仍然无法充分利用可能导致错误反应的知识。
Article 69
Title@2025-05-29 (4): Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models
Title: Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models | Teilen und Erobern: Eine hybride Strategie besiegt multimodale große Sprachmodelle | 差异和征服:混合战略失败 多种多模式大语言模式 2412.16555v3 |
Authors: Yanxu Mao, Peipei Liu, Tiehan Cui, Zhaoteng Yan, Congying Liu, Datao You
Large language models (LLMs) are widely applied in various fields of society due to their powerful reasoning, understanding, and generation capabilities. However, the security issues associated with these models are becoming increasingly severe. Jailbreaking attacks, as an important method for detecting vulnerabilities in LLMs, have been explored by researchers who attempt to induce these models to generate harmful content through various attack methods. Nevertheless, existing jailbreaking methods face numerous limitations, such as excessive query counts, limited coverage of jailbreak modalities, low attack success rates, and simplistic evaluation methods. To overcome these constraints, this paper proposes a multimodal jailbreaking method: JMLLM. This method integrates multiple strategies to perform comprehensive jailbreak attacks across text, visual, and auditory modalities. Additionally, we contribute a new and comprehensive dataset for multimodal jailbreaking research: TriJail, which includes jailbreak prompts for all three modalities. Experiments on the TriJail dataset and the benchmark dataset AdvBench, conducted on 13 popular LLMs, demonstrate advanced attack success rates and significant reduction in time overhead.
大型语言模式(LLMS)由于具有强大的推理力、理解力和生成能力,在社会各个领域广泛应用,但是,与这些模式有关的安全问题正在变得日益严重;研究人员探索了作为发现LMS脆弱性的一个重要方法的侵入性袭击,试图通过各种攻击方法诱使这些模式产生有害内容;然而,现有的侵入性方法面临许多限制,如过度的查询计数、越狱模式覆盖面有限、攻击成功率低和简单评估方法;为克服这些限制,本文件建议采用多式联运破狱方法:JMLLMM(JMLLMLM),这种方法综合了在文字、视觉和听力方式上全面侵入性袭击的多种战略;此外,我们为多式破狱研究提供了一套新的综合数据集:TriJail,其中包括所有三种模式的越狱前提示。对TriJail数据集和基准数据集AdvBench的实验,对13个流行LMLM公司进行了实验,显示攻击成功率较高和时间过低。
Article 70
Title@2025-05-29 (4): GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
Title: GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning | GSQ-Tuning: Group-Shared Exponents integer in einer voll quantifizierten Schulung für LLMs On-Device-Fine-Tuning | GSQ-Turning:为在线设计精微调LLM女士提供全面量化培训的集团共享指数整数 2502.12913v3 |
Authors: Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang
Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point (FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to BF16-based fine-tuning while significantly reducing 1.85x memory usage. Moreover, compared to FP8, our method can reduce 5x power consumption and 11x chip area with same performance, making large-scale model adaptation feasible on edge devices.
大型语言模型(LLMS)的微调技术取得了显著成果。然而,传统的LLM微调方法面临重大挑战:它们需要大型浮点计算,在处理敏感数据时引起隐私问题,对资源限制的边缘设备不切实际。虽然参数-有效精美微调(PEFT)技术减少了可训练参数,但对浮点计算方法的依赖使得对边端硬件的精确性与边端硬件产生根本的不兼容性。在这项工作中,我们引入了一个新型的LLM微调框架,消除了在感应和培训(称为GSQ-Tuning)中进行浮点操作的需要。其核心是群体共享集价 Integer格式,该格式有效地代表了使用各参数组共享的整数格式的模型参数。当它们与类似LORA的适应器相结合时,可以使完全基于整流点的微调既具有记忆性又具有计算效率。我们的方法达到了与基于BF16的微调的精确性能,同时大大减少了1.85x记忆使用。此外,与可操作性平级的平方标准的S-11级吸能装置相比,我们的方法可以降低高能区平位的平段的平段的性能。
Article 71
Title@2025-05-29 (4): CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning
Title: CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning | CodePMP: Skalierbares Präferenzmodell Vorschulung für großsprachliche Modellaufklärung | 守则PMP:可缩放的特惠模式大语言示范理由预培训模式 2410.02229v2 |
Authors: Huimu Yu, Xing Wu, Haotian Xu, Debing Zhang, Songlin Hu
Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.
大型语言模式(LLMS)在自然语言理解和生成方面取得了显著进展,其驱动力是可升级的预培训前和高级微调,然而,通过强化从人类反馈(RLHF)学习,增强LLMS的推理能力,由于缺少高质量的优惠数据,仍然具有挑战性,因为高品质的优惠数据是劳动密集型的,对评分和奖励模式(RM)的微调至关重要。为了缓解这一问题,我们引入了可升级的优惠模式CodPMP,这是一个可升级的优惠模式预培训(PMP)管道,它利用大量来自公开的高质量源码的合成编码参考配对。 DCPMP通过大规模综合编码首选模式前培训大型综合编码首选模式(GSM8K,MATH)和逻辑推理任务(ReClor, LogiQA2.0),我们不断显示LMS的推理表现的重大改进,并强调可扩展的优惠模式预培训对于高效率的建模的重要性。
Article 72
Title@2025-05-29 (4): Rethinking Regularization Methods for Knowledge Graph Completion
Title: Rethinking Regularization Methods for Knowledge Graph Completion | Überdenken von Regularisierungsmethoden für Wissensgraphenvervollständigung | 重新思考知识图完成正规化方法 2505.23442v1 |
Authors: Linyu Li, Zhi Jin, Yuanpeng He, Dongming Jin, Haoran Duan, Zhengwei Tao, Xuan Zhang, Jiandong Li
Knowledge graph completion (KGC) has attracted considerable attention in recent years because it is critical to improving the quality of knowledge graphs. Researchers have continuously explored various models. However, most previous efforts have neglected to take advantage of regularization from a deeper perspective and therefore have not been used to their full potential. This paper rethinks the application of regularization methods in KGC. Through extensive empirical studies on various KGC models, we find that carefully designed regularization not only alleviates overfitting and reduces variance but also enables these models to break through the upper bounds of their original performance. Furthermore, we introduce a novel sparse-regularization method that embeds the concept of rank-based selective sparsity into the KGC regularizer. The core idea is to selectively penalize those components with significant features in the embedding vector, thus effectively ignoring many components that contribute little and may only represent noise. Various comparative experiments on multiple datasets and multiple models show that the SPR regularization method is better than other regularization methods and can enable the KGC model to further break through the performance margin.
近些年来,知识图的完成(KGC)由于对提高知识图的质量至关重要,所以引起了相当大的注意。研究人员不断探索各种模型。然而,以往的多数努力都忽略了从更深的视角利用正规化,因此没有充分利用其潜力。本文件重新思考了在KGC应用正规化方法的问题。通过对各种KGC模型的广泛经验研究,我们发现,经过精心设计的正规化不仅减轻了过度和减少差异,而且使这些模型能够突破其原始性能的上限。此外,我们引入了一种新的稀有常规化方法,将基于等级的选择性聚变概念嵌入KGC正规化器中。核心思想是选择性地惩罚那些在嵌入矢量中具有重要特征的成分,从而实际上忽略了许多很少起作用的成分,而且可能只是代表噪音。关于多个数据集和多个模型的各种比较实验表明,SPR正规化方法比其他正规化方法要好,能够使KGC模型进一步突破性差。
Article 73
Title@2025-05-29 (4): DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Title: DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? | DeepSeek vs. o3-mini: Wie gut können LLMs mit Vernunft bewerten MT und Zusammenfassung? | DeepSeek对 o3-min:如何合理解释LLMs评价MT和总结? 2504.08120v2 |
Authors: Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger
Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored. This study systematically compares reasoning LLMs with non-reasoning counterparts across machine translation and text summarization evaluation tasks. We evaluate eight models spanning state-of-the-art reasoning models (DeepSeek-R1, OpenAI o3), their distilled variants (8B-70B parameters), and equivalent non-reasoning LLMs. Experiments on WMT23 and SummEval benchmarks reveal architecture and task-dependent benefits: OpenAI o3-mini models show improved performance with increased reasoning on MT, while DeepSeek-R1 and generally underperforms compared to its non-reasoning variant except in summarization consistency evaluation. Correlation analysis demonstrates that reasoning token usage correlates with evaluation quality only in specific models, while almost all models generally allocate more reasoning tokens when identifying more quality issues. Distillation maintains reasonable performance up to 32B parameter models but degrades substantially at 8B scale. This work provides the first assessment of reasoning LLMs for NLG evaluation and comparison to non-reasoning models. We share our code to facilitate further research: https://github.com/NL2G/reasoning-eval.
有理性的大型语言模型(LLMS)在逻辑任务方面非常出色,然而,它们对于评价自然语言生成的效用仍未得到探讨。本研究系统地将推理LLMS与机器翻译和文本摘要评价任务中无理性的对应方比较。我们评价八个模型,这些模型包括最先进的推理模型(DeepSeek-R1, OpenAI o3),它们的蒸馏变异(8B-70B参数),以及等值的不合理LMS。WMT23和SummEval基准实验显示结构和任务独立的效益:O3-mini模型显示,随着对MT的推理增加,而DeepSeek-R1和一般的不完善性与非理性的变异体相比有所改进。Correl分析表明,推理符号的使用仅在特定模型中与评价质量相关,而几乎所有模型一般在确定更高质量的问题时都分配更多的推理符号。蒸馏将合理的性表现维持在32B参数模型上,但在8B尺度上大幅降解。这项工作为对NLMSimmus的推理学进行第一次评估,用于NLG2号/Msirisiralisalisalisalisalisalisalisalisal etmentmentment etmental etmental etmental etmental et et etmental etmental ams to to to to to to to to amus to amationalisalisal etmental amational etmental etmental etmental amding to to to to to to to amations
Article 74
Title@2025-05-29 (4): LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
Title: LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding | LLM als Effektiver Streaming-Prozessor: Überbrückung von Streaming-Batch-Mismatches mit Gruppenpositionskodierung | LLM 有效流化处理程序: 将流流-批量错误与群居位置编码连接起来 2505.16983v2 |
Authors: Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.
大型语言模型(LLMS)主要为批量处理而设计。现有的使LLMS适应流流的方法要么依赖于昂贵的重新编码或具有有限可缩放性的专门结构。这项工作确定了在使批量导向的LMS适应流流流的过程中,有三个关键不匹配之处:(1) 投入注意,(2) 输出注意,(3) 位置-ID错配。虽然通常认为后两个错配需要经常重新编码,但我们的分析显示,只有输入-保护错配显著影响性能,表明重新编码产出在很大程度上是不必要的。为了更好地了解与共同假设的这一差异,我们首次全面分析了LLMS在流流流中的位置编码的影响,表明在源和目标背景下保持相对位置比维持绝对顺序更为关键。受上述分析的驱动,我们引入了在批量结构上建立的集体编码模式,以加强流与批量模式的一致性。关于跨语言和跨模式任务的广泛实验表明,我们的方法优于现有方法。我们的方法不需要建筑修改,在流流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流/流的流/流/流的流的版本中,我们现有格式中,我们可用的系统等的系统等的系统格式中,我们可用的代码都可以使用的代码都可用的代码是可用的代码。
Article 75
Title@2025-05-29 (4): SPRI: Aligning Large Language Models with Context-Situated Principles
Title: SPRI: Aligning Large Language Models with Context-Situated Principles | SPRI: Ausrichtung großer Sprachmodelle mit kontext-situierten Prinzipien | SPRI:使大语言模式与上下文原则相一致 2502.03397v2 |
Authors: Hongli Zhan, Muneeza Azmat, Raya Horesh, Junyi Jessy Li, Mikhail Yurochkin
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at https://github.com/honglizhan/SPRI-public.
在这项工作中,我们介绍了一个需要最低限度或不需人类努力的框架,该框架旨在为每项投入查询实时生成合成的SFT数据,从而大大改进真实性。 我们评估SPRI的三项任务,并表明1) SPRI可以在一项复杂的特定领域任务中制定原则,从而导致作为专家设计的业绩;2) SPRI产生的原则导致出现超越LLM-as-a-judge框架的特有图理;3) 利用 SPRI生成合成SFT数据。
Article 76
Title@2025-05-29 (4): DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
Title: DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation | DynaCode: Dynamischer Code Benchmark für die Bewertung großer Sprachmodelle in der Codegenerierung | DynCode:在代码生成过程中评价大语言模型的动态复杂度-软件编码基准 2503.10452v2 |
Authors: Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, Kaidi Xu
The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode’s ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code. Our benchmark and evaluation code are available at https://github.com/HWH-2000/DynaCode.
大型语言模型(LLMS)的快速进步大大改善了其在代码生成任务方面的绩效,然而,现有的代码基准仍然静止不变,由固定数据集组成,有预设的问题,因此在培训期间很容易被记忆化,因为LLMS回忆了具体的测试案例,而没有概括到新的问题,导致数据污染和不可靠的评价结果。为了解决这些问题,我们引入DynaCode,这是一个动态、复杂、有识的基准,可以克服静态数据集的局限性。DynCode系统评估LMS,使用复杂、能见度指标,包括代码复杂性和呼叫系统结构。DynCode实现了大规模的多样性,在四种不同的代码复杂程度(称为单位)和16种调用图)中产生了多达189万个独特的嵌套代码问题。12个最新的LMMS结果显示,平均性能下降16.8%至45.7%,而MBPP+是一个固定的代码生成基准,其性能随着复杂性的提高而逐渐下降。这显示了DynCode能够有效地区分LMS。此外,我们通过调用电话图图图来了解我们LMMCMSD/CSD的模型/CSyaldealdealde
Article 77
Title@2025-05-29 (4): Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
Title: Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition | Probeneffiziente menschliche Bewertung großer Sprachmodelle durch maximalen Diskrepanzwettbewerb | 通过最大差异竞争对大语言模式进行抽样有效人力评价 2404.08008v2 |
Authors: Kehua Feng, Keyan Ding, Hongzhi Tan, Kede Ma, Zhihua Wang, Shuangquan Guo, Yuzhou Cheng, Ge Sun, Guozhou Zheng, Qiang Zhang, Huajun Chen
Reliable evaluation of large language models (LLMs) is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers “gold-standard” model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .
对大型语言模型(LLMS)的可靠评价受到两大挑战的阻碍:客观指标往往不能反映人类对自然语言的看法,而详尽的人类标签则极其昂贵。在这里,我们提议根据Meximum差异(MAD)竞争原则,对LLMS进行抽样有效的人类评价。我们的方法自动和适应性地选择一套紧凑的投入指示,最大限度地扩大LLM对口答复之间的语义差异。然后,人类评价者对这些配对反应进行三种选择性强迫选择,然后用Elo等级汇总为全球排名。我们采用我们的方法,将八种广泛使用的LLMS对四大任务进行比较:科学知识理解、数学推理、创造性和功能性写作、代码生成和解释。实验结果表明,我们的抽样有效评价方法恢复了“古老标准”模型的排名,并有少数MAD选定的指示,揭示了每个LM的长处和短处,并提供了细微的洞察见解,以指导未来的LM发展。代码可在https://github.com/weiji-Feng/MAD-Eval查阅。
Article 78
Title@2025-05-29 (4): The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence
Title: The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence | Das Warmup-Dilemma: Wie sich Lernratenstrategien auf die Konvergenz von Sprach-Text-Modellen auswirken | 暖化困境:学习速率战略如何影响演讲到文字模式模式汇合 2505.23420v1 |
Authors: Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
Training large-scale models presents challenges not only in terms of resource requirements but also in terms of their convergence. For this reason, the learning rate (LR) is often decreased when the size of a model is increased. Such a simple solution is not enough in the case of speech-to-text (S2T) trainings, where evolved and more complex variants of the Transformer architecture – e.g., Conformer or Branchformer – are used in light of their better performance. As a workaround, OWSM designed a double linear warmup of the LR, increasing it to a very small value in the first phase before updating it to a higher value in the second phase. While this solution worked well in practice, it was not compared with alternative solutions, nor was the impact on the final performance of different LR warmup schedules studied. This paper fills this gap, revealing that i) large-scale S2T trainings demand a sub-exponential LR warmup, and ii) a higher LR in the warmup phase accelerates initial convergence, but it does not boost final performance.
培训大型模式不仅在资源需求方面而且在其趋同方面都提出了挑战。为此原因,当模型规模扩大时,学习率(LR)往往会降低,这种简单的解决方案在语音到文字(S2T)培训方面是不够的,因为在这种培训中,由于变换器结构的演进和更为复杂的变式(例如,变换或分解)的性能较好而使用。作为一种变通办法,OWSM设计了双线LR暖和双线性,在升级到第二阶段更高价值之前,第一阶段的学习率(LR)往往降低到很小的价值。虽然这一解决方案在实践中效果良好,但与替代解决方案相比并不充分,而且所研究的对不同LR暖化时间表的最后绩效的影响也不充分。这份文件填补了这一空白,表明,一)大规模S2T培训需要次级的LR暖化,二)在暖化阶段的更高LR加速初步趋同,但并没有促进最后的绩效。
Article 79
Title@2025-05-29 (4): SWE-bench Goes Live!
Title: SWE-bench Goes Live! | SWE-Bench geht live! | SWE -BECHE GOES 现场直播! 2505.23419v1 |
Authors: Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present \textbf{SWE-bench-Live}, a \textit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
解决问题的任务,即模型产生修补真实世界错误的补丁,已经成为评价大型语言模型(LLMs)能力的关键基准。SWE-bench及其变体已成为该领域的标准,但受到关键限制:自最初发布以来,它们一直没有更新,覆盖了一套狭窄的储存库,并严重依赖人工操作,例如建筑和环境设置。这些因素阻碍可缩缩缩缩,并引入了过度装配和数据污染的风险。在这项工作中,我们提出了用于评估大型语言模型(LLLMS)能力的一个关键基准。虽然SWE-bench及其变体已成为这一领域的标准。我们最初发布的任务有1,319项来自自2024年以来创建的真正的GitHub问题,涉及93个储存库。每项任务都有一个专门的Docker图像,以确保可重新执行。 对我们基准的核心是\method,一个自动化的曲解管道,将整个过程从创建到环境设置,消除手动的瓶颈,促进可缩缩缩缩和不断更新。我们在SWE基准下,我们评估了一系列的SWE-rvial-ro-ral-lade-rode-rode-la-la-de-lade-lax-lax-lax-lax-lade Stal-lax-lax Stal-lax-lax-lax-lax-lax-lax-lax-lax-s-s-s-s-s-lax-lax-lax-lax-s-to-to-to-to-to-to-to-to-to-to-to-sil-to-to-sil-laxxxx-sil-sil-sil-s-s-sfervical-s-s-sil-sil-sil-sf-sf-lax-lax-lax-s-s-s-s-s-s-s-lax-lax-s-s-s-lax-lax-s-s-s-s-s-s-s-s-s-s-sl-sl-lautx-laxxxxxx-S-s-s-sl-s-
Article 80
Title@2025-05-29 (4): On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists
Title: On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists | On-Device Collaborative Language Modeling über eine Mischung aus Generalisten und Spezialisten | 通过通识主义者和专家混合组合的在线合作语言建模 2409.13931v4 |
Authors: Dongyang Fan, Bettina Messmer, Nikita Doikov, Martin Jaggi
On-device LLMs have gained increasing attention for their ability to enhance privacy and provide a personalized user experience. To facilitate private learning with scarce data, Federated Learning has become a standard approach. However, it faces challenges such as computational resource heterogeneity and data heterogeneity among end users. We propose CoMiGS ($\textbf{Co}$llaborative learning with a $\textbf{Mi}$xture of $\textbf{G}$eneralists and $\textbf{S}$pecialists), the first approach to address both challenges. A key innovation of our method is the bi-level optimization formulation of the Mixture-of-Experts learning objective, where the router is optimized using a separate validation set to ensure alignment with the target distribution. We solve our objective with alternating minimization, for which we provide a theoretical analysis. Our method shares generalist experts across users while localizing a varying number of specialist experts, thereby adapting to users’ computational resources and preserving privacy. Through extensive experiments, we show CoMiGS effectively balances general and personalized knowledge for each token generation. We demonstrate that CoMiGS remains robust against overfitting-due to the generalists’ regularizing effect-while adapting to local data through specialist expertise. We open source our codebase for collaborative LLMs.
提高隐私能力和提供个性化用户经验的能力日益受到重视。为了便利私人利用稀缺数据进行私人学习,联邦学习协会已成为一种标准做法,但它面临着计算资源差异和终端用户数据差异等挑战。我们提议使用美元(textbf{Co}Co}$xtural leaudial learning with $\ textbf{G}$xture,我们用一个理论分析来解决我们的目标。我们的方法与用户的一般专家共享,同时将不同数量的专家本地化,从而适应用户的计算资源并保护隐私。通过广泛的实验,我们的方法的一项关键创新是双级优化制定混合-Explants学习目标,即使用单独的校正组合优化路由器,以确保与目标分配保持一致。我们用一个交替最小化的方法来解决我们的目标。我们的方法与用户的普通专家专家专家专家分享,从而适应用户的计算资源并保护隐私。我们通过广泛的实验,展示CoMIGS公司有效平衡普通和个体化知识,以适应每一代的正常数据源。我们继续展示我们普通专家的开放数据库。
Article 81
Title@2025-05-29 (4): LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline
Title: LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline | LLMs können qualitativ hochwertige Simultane Machine Translation so effizient wie Offline erreichen | LLM Can 能够像离线那样高效率地实现高质量同声机翻译 2504.09570v2 |
Authors: Biao Fu, Minpeng Liao, Kai Fan, Chengxi Li, Liang Zhang, Yidong Chen, Xiaodong Shi
When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt “Translate the following sentence from [src lang] into [tgt lang]:”. However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks, and preserves the original abilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.
当提供完整的源句时,大型语言模型(LLMS)在离线机器翻译中表现优异,即使简单快速地“将以下句子从[src 朗 转换为[tgt 朗 ” 。然而,在许多真实情况下,源牌以流流方式到达,同时机器翻译(SiMT)是需要的,因此,只读解码的LLMS的效率和性能受到自动递减性质的极大限制。为使LLMS能够像离线翻译那样高效率地实现高质量的SimMT,我们提议了一个新的模式,其中包括为SimMT建立监管的微调数据,以及新的培训和推断战略。要复制SimMT、源和目标标牌的象征性投入/输出流,将其重新排列成一个相互脱节的顺序,通过特殊标记将其分开,因其自动递减性能要求,从而使强大的LMSMS能够根据不同的升调速度学习和适应操作,同时保持高效的自动递增分解。实验性结果显示,即使SFT的原始的翻译水平不超出了我们总体能力,也保持了SMT系统的特定数据。
Article 82
Title@2025-05-29 (4): From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs
Title: From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs | Von Parametern zu Prompts: Den Factuality Gap zwischen fein getunen LLMs verstehen und abschwächen | 从参数到提示:了解并缩小微量贷款商之间的实际质量差距 2505.23410v1 |
Authors: Xuan Gong, Hanbo Huang, Shiyu Liang
Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.
事实知识的提取旨在明确提取在经过培训的语文模型中用作下游任务应用的知识参数。虽然先前的工作一直在调查对大型语文模型(LLMs)真实性有监督的微调数据的影响,但其机制仍然不易理解。我们通过系统实验重新审视这一影响,特别侧重于在对已知和未知知识进行微调时产生的事实质量差距。我们的调查结果表明,在推论阶段,无论是在分配外(OOOD)设置下,还是通过使用适当的文本内学习(ICL)提示(即少见的学习和思维链(CoT)),都可以缩小这一差距。我们从知识图表的角度从理论上证明了这一现象,表明测试时间的及时性可能会减少甚至掩盖微调数据的影响,并在知识提取中起到主导作用。最后,我们的结果说明了微调数据与测试时间的及时性之间的相互作用,表明ICL可以有效地弥补微调数据的缺陷,并强调需要重新考虑使用ICL(CL)来评估数据选择方法的有效性。
Article 83
Title@2025-05-29 (4): EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
Title: EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse | EFIM: Effizientes Servieren von LLMs zur Erfüllung von Aufgaben mit verbesserter KV Cache Reuse | EFIM:以改进的KV缓存再利用高效率地为完成任务的LLMs服务 2505.21889v2 |
Authors: Tianyu Guo, Hande Dong, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao, Xianwei Zhang
Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse. Although the transformed prompt can solve the inefficiency, it exposes subtoken generation problems in current LLMs, where they have difficulty generating partial words accurately. Therefore, we introduce a fragment tokenization training method which splits text into multiple fragments before tokenization during data processing. Experiments on two representative LLMs show that LLM serving with EFIM can lower the latency by 52% and improve the throughput by 98% while maintaining the original infilling capability. EFIM’s source code is publicly available at https://github.com/gty111/EFIM.
大型语言模型( LLMS) 通常用于完成任务, 包括预测或生成特定文本中缺失的信息。 这些任务通常需要与类似背景进行多重互动。 为了减少重复的历史符号的计算, 交叉请求键值缓存再利用( KV) 技术, 存储和再再利用中间计算, 已经成为多轮互动服务中的关键方法。 但是, 在完成任务时, KV 缓存再利用经常受到快速格式结构的阻碍, 它通常包含与插入点相对的前缀和后缀。 具体地说, 前缀或后缀部分的 KV 缓存通常会随着其他部分( 后缀或前缀) 的生成而逐渐失效。 为了解决这个问题, 我们建议 FIM , 将FIM 的快速格式转换为释放 KV 缓存再利用的性能。 尽管转换的快速模式可以解决效率问题, 但它暴露当前 LLMMSMS 的子生成问题, 其中它们很难准确生成部分词。 因此, 我们引入了一种碎片化培训方法, 将文本分割到 PLMMM 。 在质%M 之前, 将 将 的 复制能力 改进到 多部 。 通过 格式 。 在 MA 演示 上 将 将 将 将 将 的 的 的 的 以 以 将 MAFLMFMFM 的 的 以 的 以 以 以 以 以 以 以 以 以 将 以 以 以 将 以 以 以 以 以 将 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 的 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以
Article 84
Title@2025-05-29 (4): VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining
Title: VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining | VietASR: Erzielen von vietnamesischen ASR auf Branchenebene mit 50-Stunden-Daten und großformatigen Sprachvorschulungen | 越南:在越南工业一级实现有50小时标签数据和大型演讲预科培训的有50小时标签的数据的越南ASR 2505.21527v2 |
Authors: Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen
Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR systems on real-world data. Our code and models will be open-sourced to facilitate research in low-resource ASR.
自动语音识别(ASR)取得了显著进展,但在很大程度上依赖大量标签数据,越南语等低资源语言缺少这类数据;虽然Whisper、USM和MMS等现有系统取得了有希望的业绩,但其效力在培训费用、延迟度和可获取性方面仍然不足;为解决这些问题,我们提议ViASR,这是一个新的ASR培训管道,利用大量无标签数据和少量标签数据;通过多点标识ASR带有偏见的自我监督学习大规模无标签数据集,越南ASR为增强ASR绩效提供了具有成本效益和实用的解决方案;实验表明,对70 000小时无标签数据进行预先培训,对仅50小时的标签数据进行微调,产生轻量但强大的ASR模型;它比Whiper大V3和关于现实世界数据的商业ASR系统先进。我们的代码和模型将开放来源,以便利低资源ASR的研究。
Article 85
Title@2025-05-29 (4): Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models
Title: Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models | Adaptive Jailbreaking-Strategien basierend auf dem semantischen Verständnis von Fähigkeiten großer Sprachmodelle | 基于大语言模型的语义理解能力 2505.23404v1 |
Authors: Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin
Adversarial attacks on Large Language Models (LLMs) via jailbreaking techniques-methods that circumvent their built-in safety and ethical constraints-have emerged as a critical challenge in AI security. These attacks compromise the reliability of LLMs by exploiting inherent weaknesses in their comprehension capabilities. This paper investigates the efficacy of jailbreaking strategies that are specifically adapted to the diverse levels of understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models, a novel framework that classifies LLMs into Type I and Type II categories according to their semantic comprehension abilities. For each category, we design tailored jailbreaking strategies aimed at leveraging their vulnerabilities to facilitate successful attacks. Extensive experiments conducted on multiple LLMs demonstrate that our adaptive strategy markedly improves the success rate of jailbreaking. Notably, our approach achieves an exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)
通过绕过其内在安全和道德限制的侵入技术方法(LLMS),对大语言模型的反方攻击,绕过其内在安全和伦理限制,已成为AI安全中的一项重大挑战。这些攻击利用了LLMs内在的理解能力方面的弱点,损害了LLMs的可靠性。本文件调查了专门适应不同LMs所显示的不同理解程度的侵入战略的效力。我们提议了基于大语言模型的语义理解能力的适应性监狱破碎战略,这是一个新颖的框架,根据LMs的语义理解能力将其分为I类和II类。我们针对每一类别设计了有针对性的破门战略,旨在利用它们的脆弱性促进成功攻击。对多个LLMs进行的广泛实验表明,我们的适应性战略显著提高了破门的成功率。值得注意的是,我们的方法在GPT-4监狱破门的GPT-4o(2025年5月29日释放)中取得了非常高的98.9%的成功率。
Article 86
Title@2025-05-29 (4): Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms
Title: Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms | Re-Ranking mit großen Sprachmodellen zur Minderung der Exposition gegenüber schädlichen Inhalten auf Social Media-Plattformen | 利用大型语言模式,在社交媒体平台上减少接触有害内容 2501.13977v3 |
Authors: Rajvardhan Oak, Muhammad Haroon, Claire Jo, Magdalena Wojcieszak, Anshuman Chhabra
Social media platforms utilize Machine Learning (ML) and Artificial Intelligence (AI) powered recommendation algorithms to maximize user engagement, which can result in inadvertent exposure to harmful content. Current moderation efforts, reliant on classifiers trained with extensive human-annotated data, struggle with scalability and adapting to new forms of harm. To address these challenges, we propose a novel re-ranking approach using Large Language Models (LLMs) in zero-shot and few-shot settings. Our method dynamically assesses and re-ranks content sequences, effectively mitigating harmful content exposure without requiring extensive labeled data. Alongside traditional ranking metrics, we also introduce two new metrics to evaluate the effectiveness of re-ranking in reducing exposure to harmful content. Through experiments on three datasets, three models and across three configurations, we demonstrate that our LLM-based approach significantly outperforms existing proprietary moderation approaches, offering a scalable and adaptable solution for harm mitigation.
社会媒体平台利用机器学习(ML)和人工智能(人工智能(AI)的动力推荐算法,最大限度地提高用户参与的程度,从而导致无意中接触有害内容。目前,依靠经过广泛人文说明数据培训的分类人员进行温和努力,努力调整和适应新的伤害形式。为了应对这些挑战,我们提议在零发和几发环境中采用新颖的重新排序方法,使用大语言模型(LLLMs),动态评估和重新排序内容序列,有效减轻有害内容的接触,而无需广泛的标签数据。除了传统的排名指标外,我们还采用两种新的衡量标准来评估重新排序以减少接触有害内容的成效。通过三个数据集、三个模型和三个组合的实验,我们证明我们的LLM方法大大优于现有专有的调控方法,为减轻伤害提供了可扩缩和适应性的解决办法。
Article 87
Title@2025-05-29 (4): DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
Title: DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding | DREAM: Entwurf mit raffinierten Target-Features und Entropie-Adaptive Cross-Attention Fusion für multimodale spekulative Dekodierung | DREAM: 与改良目标特征和多模式投机下限的 Entropy-Adpy-Adpic 交叉注意聚变一起起草 2505.19201v2 |
Authors: Yunhai Hu, Tianhua Xia, Zining Liu, Rahul Raman, Xingyu Liu, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM.git
光学解码(SD)已成为在大型语言模型中加速自动递增生成(LLMs)的有力方法,但将其纳入视觉语言模型(VLMs)仍未得到充分探讨。我们引入了DREAM,这是为VLMs定制的新型投机解码框架,它结合了三项关键创新:(1) 一种基于交叉注意的机制,将目标模型的中间特征注入到改进的模型草案中,(2) 基于注意力激素的适应性中间特征选择,以指导有效的模型培训草案,(3) 视觉象征性压缩,以减少模型草案的延缓性。DREAM使得高效、准确和平行的多式联运解码,并显著地改进了吞化过程。在包括LalaVA、Pixtral、SmolVLM和Gemma3在内的一系列最新的广受欢迎的VLLMSMs中进行的实验显示,在传统解码上加速了3.6x的速度,大大超出SD基准,在一系列多式基准的透视通过和投机性草案接受长度方面明显超出SD。该守则公布于:http://github.com/SA-LAmbY.
Article 88
Title@2025-05-29 (4): ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation
Title: ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation | ReflectionCoder: Aus Reflexionssequenz lernen für verbesserte Einmal-Code-Generierung | 思考编码:从强化一次性代码生成的反思序列中学习 2405.17057v2 |
Authors: Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Aojun Zhou, Junting Pan, Hongsheng Li
Code generation plays a crucial role in various tasks, such as code auto-completion and mathematical reasoning. Previous work has proposed numerous methods to enhance code generation performance, including integrating feedback from the compiler. Inspired by this, we present ReflectionCoder, a novel approach that effectively leverages reflection sequences constructed by integrating compiler feedback to improve one-off code generation performance. Furthermore, we propose reflection self-distillation and dynamically masked distillation to effectively utilize these reflection sequences. Extensive experiments on three benchmarks, i.e., HumanEval (+), MBPP (+), and MultiPL-E, demonstrate that models fine-tuned with our method achieve state-of-the-art performance. Beyond the code domain, we believe this approach can benefit other domains that focus on final results and require long reasoning paths. Code and data are available at https://github.com/SenseLLM/ReflectionCoder.
代码生成在诸如代码自动完成和数学推理等各种任务中发挥着关键作用。 先前的工作提出了许多提高代码生成性能的方法, 包括整合来自汇编者的反馈。 受此启发, 我们展示了“ 反省 Coder ” 这一新的方法, 有效地利用通过整合汇编者反馈而构建的反射序列来改进一次性代码生成性能。 此外, 我们提出了自我蒸馏和动态掩码蒸馏,以有效利用这些反射序列。 在三个基准上进行的广泛实验, 即 HumanEval (+)、 MBPP (+) 和 MultiPL- E , 表明模型与我们的方法相完善, 实现了最新性能。 我们认为, 在代码领域以外, 这种方法可以惠及其他侧重于最终结果并需要漫长推理路径的领域。 代码和数据可在 https://github.com/SenselLM/ReflectionCoder查阅 。
Article 89
Title@2025-05-29 (4): BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages
Title: BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages | BRIGHER: Die Lücke in Text-Emotions-Erkennungs-Datensätzen für 28 Sprachen bohren | 消除28种语言在载人附加说明的文本情感识别识别数据集方面的差距 2502.11926v4 |
Authors: Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Alexander Panchenko, Andrew Piper, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad
People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition–an umbrella term for several NLP tasks–impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which often lack high-quality annotated datasets. In this paper, we present BRIGHTER–a collection of multi-labeled, emotion-annotated datasets in 28 different languages and across several domains. BRIGHTER primarily covers low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances labeled by fluent speakers. We highlight the challenges related to the data collection and annotation processes, and then report experimental results for monolingual and crosslingual multi-label emotion identification, as well as emotion intensity recognition. We analyse the variability in performance across languages and text domains, both with and without the use of LLMs, and show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
全世界人民以微妙和复杂的方式使用语言表达情感。虽然对于国家语言方案的若干任务 – – 影响国家语言方案内外的各种应用来说,情绪识别 – – 总括术语 – – 影响国家语言方案内外的各种应用,但这一领域的大多数工作都侧重于高资源语言。这导致研究工作和拟议解决方案,特别是资源不足的语言,往往缺乏高质量的附加说明的数据集。本文介绍BraightER – – 以28种不同语言和多个领域收集的多标签、有情感附加说明的数据集。Braighter主要涵盖非洲、亚洲、东欧和拉丁美洲的低资源语言,有流利语言的标注。我们强调与数据收集和批注进程有关的挑战,然后报告单一语言和跨语言多标签情感识别的实验结果,以及情感强度识别。我们分析了语言和文本领域业绩的差异,包括不使用LLMS,并表明BraightER数据集是解决基于文字的情感识别差距的有意义的一步。
Article 90
Title@2025-05-29 (4): GWQ: Gradient-Aware Weight Quantization for Large Language Models
Title: GWQ: Gradient-Aware Weight Quantization for Large Language Models | GWQ: Gradient-Aware Weight Quantization für große Sprachmodelle | GWQ: 大语言模型的渐变软件重量 2411.00850v4 |
Authors: Yihua Shao, Yan Gu, Siyu Chen, Haiyang Liu, Zixian Zhu, Zijian Ling, Minxi Yan, Ziyang Yan, Chenyu Zhang, Michele Magno, Haotong Qin, Yan Wang, Jingcai Guo, Ling Shao, Hao Tang
Large language models (LLMs) show impressive performance in solving complex language tasks. However, its large number of parameters presents significant challenges for the deployment. So, compressing LLMs to low bits can enable to deploy on resource-constrained devices. To address this problem, we propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers, requiring only a minimal amount of calibration data for outlier detection. GWQ retains the top 1\% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit. We widely evaluate GWQ on different task include language modeling, grounding detection, massive multitask language understanding and vision-language question and answering. Results show that models quantified by GWQ performs better than other quantization method. During quantization process, GWQ only need one calibration set to realize effective quant. Also, GWQ achieves 1.2x inference speedup in comparison to the original model and effectively reduces the inference memory.
大型语言模型(LLMS)在解决复杂的语言任务方面表现出令人印象深刻的成绩。然而,它的大量参数对部署提出了巨大的挑战。因此,将LLMS压缩到低位位位上可以使资源受限制的装置得到部署。为了解决这个问题,我们提议了低位重量四分法(GWQ),即低位重量四分法(GWQ),这是利用梯度使外层局部化的首个量化方法,只需要最低限度的校准数据来进行外层检测。GWQ在FP16精确度上优先保留顶端的1外端值,而剩余的非外层重量则储存在低位。我们广泛评价GWQ的不同任务包括语言建模、地基探测、大型多任务语言理解和视觉语言问题及回答。结果显示,GWQ量化的模型比其他四分法方法效果更好。在四分法过程中,GWQ只需要一个校准装置来实现有效的夸度。此外,GWQ在与原始模型相比,实现了1.2x的推力速度,并有效地减少了内存。
Article 91
Title@2025-05-29 (4): Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation
Title: Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation | Threading the Needle: Rewebing Chain-of-Thought Begründung zu erklären, Human Label Variation | 针线串列: 重新编织尝试链 解释人类标签变化的原因 2505.23368v1 |
Authors: Beiduo Chen, Yang Janet Liu, Anna Korhonen, Barbara Plank
The recent rise of reasoning-tuned Large Language Models (LLMs)–which generate chains of thought (CoTs) before giving the final answer–has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.
最近由理性调整的大型语言模型(LLMS)的兴起引起了人们的极大关注,为深入了解人类标签变异提供了新的机会,这些变异是指多个说明者对同一数据实例的标签存在似是而非的差别。 先前的工作表明,LLM产生的解释有助于将模型预测与人类标签分布相统一,但通常采用相反的范式:根据给定的答案作出解释。相比之下,CTs提供了一条前方推理路径,在提出答案之前,可能隐含每个答案选项的理由。因此,我们提出了一个新的基于LLM管道,用语言基础的谈话分解器丰富了该管道的内容,以便从 CoTs提取对每个答案选项的支持和反对意见。 我们还提出了一个基于等级的HLV评价框架,将答案的排序置于精确分数之上,而不是直接比较标签分布。我们的方法超越了直接生成的方法以及三个数据集的基线,并显示与人类的排序方法更加一致,突出了我们的方法的有效性。
Article 92
Title@2025-05-29 (4): Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs
Title: Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs | Graph of Records: Steigerung der retrieval Augmented Generation für Langkontext-Zusammenfassung mit Graphen | 记录图图:用图表进行长文本摘要的推进检索增量生成器 2410.11001v2 |
Authors: Haozhen Zhang, Tao Feng, Jiaxuan You
Retrieval-augmented generation (RAG) has revitalized Large Language Models (LLMs) by injecting non-parametric factual knowledge. Compared with long-context LLMs, RAG is considered an effective summarization tool in a more concise and lightweight manner, which can interact with LLMs multiple times using diverse queries to get comprehensive responses. However, the LLM-generated historical responses, which contain potentially insightful information, are largely neglected and discarded by existing approaches, leading to suboptimal results. In this paper, we propose $\textit{graph of records}$ ($\textbf{GoR}$), which leverages historical responses generated by LLMs to enhance RAG for long-context global summarization. Inspired by the $\textit{retrieve-then-generate}$ paradigm of RAG, we construct a graph by establishing an edge between the retrieved text chunks and the corresponding LLM-generated response. To further uncover the intricate correlations between them, GoR features a $\textit{graph neural network}$ and an elaborately designed $\textit{BERTScore}$-based objective for self-supervised model training, enabling seamless supervision signal backpropagation between reference summaries and node embeddings. We comprehensively compare GoR with 12 baselines across four long-context summarization datasets, and the results indicate that our proposed method reaches the best performance ($\textit{e.g.}$, 15%, 8%, and 19% improvement over retrievers w.r.t. Rouge-L, Rouge-1, and Rouge-2 on the WCEP dataset). Extensive experiments further demonstrate the effectiveness of GoR.
Retrieval- 放大生成(RAG) 通过注入非参数事实知识,使大语言模型{ LLMs (LLMs) 注入了非参数事实知识。与长文本LLMs相比,RAG被视为一种更简便和轻量化的有效总和工具,它可以与LLMs多次互动,使用不同的查询来获得全面答复。然而,LLLM 生成的历史响应,包含潜在的深刻信息,在很大程度上被现有方法所忽视和抛弃,导致低于最佳结果。在本文中,我们提议$\textit{记录图}$($\ textb{RRR}$),利用LLMS的历史性回应来提高RAG的长文本全球总和化。受 $\ textitalite{reat- generate} 模式的启发,我们通过在回收的文本块和相应的LLMRRRRS建立边缘关系来构建一个图表。GRF_Retrietrealation, 将一个基于 net netnal netroduction netroal 网络的 Net netw} $ $, 和一个精细化的自我化的模型, SIal deal dealational deal deal dealationalational deal dislational deal dislational disl dald the thewegald slod the weal be weald the wealdaldald supaldald supald sild.
Article 93
Title@2025-05-29 (4): Discriminative Policy Optimization for Token-Level Reward Models
Title: Discriminative Policy Optimization for Token-Level Reward Models | Diskriminative Politikoptimierung für Token-Level-Reward-Modelle | 东京级奖励模式的区别对待政策优化 2505.23363v1 |
Authors: Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.
与成果奖励模式相比,流程奖励模式(PRMs)为优化政策模式提供了更细微的监督,将其定位为在复杂的推理任务中提高LLMs能力的一种有希望的方法;最近的努力通过将奖励模式纳入基因模型的培训,将奖励模式纳入基因模型,从象征性生成概率得出奖励分数;然而,归正语言建模和奖励模式之间的冲突可能导致不稳定,导致信用分配不准确;为了应对这一挑战,我们重新审视象征性的奖励分配,从语言生成中分离奖赏模型,作为提高LLMs在复杂推理任务中的能力的一个有希望的方法;通过优化一种称为Q-RM(Q-RM)的歧视性政策(Q-RM);我们理论上表明,Q-RM(Q-RM)明确从优惠数据中学习象征性水平的功劳,而不用微细微的注解说明。 在我们的实验中, ARM(Q-RM)始终超越所有现有基准方法。例如,在PPO/REINFORC值算法中,Q-RM(Q-RM-RM)通过优化象征性的平时,在PRM-RM-RM-RM-RM-RM-rB-S-rB-S-S-S-S-S-S-S-S-S-S-S-S-S-rBS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Serg-Cxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。
Article 94
Title@2025-05-29 (4): Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability
Title: Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability | Sind Generative Modelle unterbewusst? Bessere Qualitätsschätzung mit erhöhter Modellwahrscheinlichkeit | 产生型号是否缺乏自信?更好的质量估算与促进型号的模型概率 2502.11115v2 |
Authors: Tu Anh Dinh, Jan Niehues
Quality Estimation (QE) is estimating quality of the model output during inference when the ground truth is not available. Deriving output quality from the models’ output probability is the most trivial and low-effort way. However, we show that the output probability of text-generation models can appear underconfident. At each output step, there can be multiple correct options, making the probability distribution spread out more. Thus, lower probability does not necessarily mean lower output quality. Due to this observation, we propose a QE approach called BoostedProb, which boosts the model’s confidence in cases where there are multiple viable output options. With no increase in complexity, BoostedProb is notably better than raw model probability in different settings, achieving on average +0.194 improvement in Pearson correlation to ground-truth quality. It also comes close to or outperforms more costly approaches like supervised or ensemble-based QE in certain settings.
质量估计( QE) 是在没有地面真实信息时估计模型输出的质量。 从模型输出概率中得出输出质量是最微不足道和低努力的方式。 然而, 我们显示文本生成模型的输出概率可能显得不自信。 在每一个输出步骤中, 概率分布都可能很多, 使概率分布分布更加分散。 因此, 概率较低并不一定意味着产出质量较低 。 基于此观察, 我们提议了一种名为“ 推力Prob ” 的 QE 方法, 这种方法可以增强模型对多种可行输出选项案例的信心。 在不增加复杂性的情况下, 推力Prob 明显优于不同环境中的原始模型概率, 平均+0. 194 改善Pearson 与地图质量的关系。 在某些环境中, 它也接近或超过成本更高的方法, 如监督或基于元素的 QE 。
Article 95
Title@2025-05-29 (4): mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Title: mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus | mOSCAR: Ein multimodaler, mehrsprachiger und multimodaler Korpus auf Dokumentebene | MOSCAR: 大型多语种和多模式文件级公司 2406.08707v2 |
Authors: Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. (2022) showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 303M documents, 200B tokens and 1.15B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model trained on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs. The dataset is released under the Creative Commons CC BY 4.0 license and can be accessed here: https://huggingface.co/datasets/oscar-corpus/mOSCAR
虽然大多数多式大语言模型(mLLMS)仅就大量文本图像数据进行培训,但大多数多式和多式数据集要么只就类标题数据进行培训,Alayrac等人(2022年)则表明,额外培训它们有关文本和图像的分流序列,可以导致产生文体内学习能力,不过,它们使用的数据集M3W(M3W)不是公开的,只有英文版。曾尝试复制其结果,但所发布的数据集只有英文版。相比之下,目前的多语种和多式数据集要么只由类标题数据或中等尺度或完全私有数据组成。这限制了对世界上7 000种其他语言的MLLM研究。因此,我们最了解的是,它们使用的是首个大规模多语种和多式文件汇编,它们涵盖163种语言、303M文件、200B表示器和1.15B图像。我们仔细地进行了一套过滤和评价步骤,以确保MSCAR能够足够安全、多样和高质量的读取。我们额外训练了两种多式的S-ROS(HI)的多语文/CAR) 模型,用以证明一个经过训练的模型,并展示了各种数据。
Article 96
Title@2025-05-29 (4): Nosey: Open-source hardware for acoustic nasalance
Title: Nosey: Open-source hardware for acoustic nasalance | Nosey: Open-Source-Hardware für akustische Nasalance | 鼻鼻:用于音响鼻鼻腔的开源硬件 2505.23339v1 |
Authors: Maya Dewhurst, Jack Collins, Justin J. H. Lo, Roy Alderton, Sam Kirkham
We introduce Nosey (Nasalance Open Source Estimation sYstem), a low-cost, customizable, 3D-printed system for recording acoustic nasalance data that we have made available as open-source hardware (http://github.com/phoneticslab/nosey). We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.
我们首先介绍Nusey(纳萨雷斯开放源码估计值Ystem),这是一个低成本的、可定制的、3D打印的系统,用于记录我们作为开放源码硬件提供的声纳数据(http://github.com/phoneticslab/nosey),我们首先概述硬件鼻吸系统背后的动机和设计原则,然后比较Nosey和一个商业鼻吸装置。Nosey显示,鼻吸食得分一直高于商业装置,但声学环境之间的对比程度在系统之间是相似的。我们还审查了硬件定制便利测试的方法,例如对麦克风和不同建筑材料进行比较。我们的结论是,Nosey是商业鼻吸装置的一种灵活和具有成本效益的替代方法,并提出了一些方法方面的考虑,用于数据收集。
Article 97
Title@2025-05-29 (4): Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors
Title: Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors | Weder Stochastic Parroting noch AGI: LLMs lösen Aufgaben durch kontextorientierte Extrapolation von Trainingsdaten Priors | 既不是蒸蒸碎剖析,也不是AGI:通过根据培训数据前期进行的背景差异外推法解解解任务LLMs Solve任务 2505.23323v1 |
Authors: Harish Tayyar Madabushi, Melissa Torgbi, Claire Bonial
In this position paper we raise critical awareness of a realistic view of LLM capabilities that eschews extreme alternative views that LLMs are either “stochastic parrots” or in possession of “emergent” advanced reasoning capabilities, which, due to their unpredictable emergence, constitute an existential threat. Our middle-ground view is that LLMs extrapolate from priors from their training data, and that a mechanism akin to in-context learning enables the targeting of the appropriate information from which to extrapolate. We call this “context-directed extrapolation.” Under this view, substantiated though existing literature, while reasoning capabilities go well beyond stochastic parroting, such capabilities are predictable, controllable, not indicative of advanced reasoning akin to high-level cognitive capabilities in humans, and not infinitely scalable with additional training. As a result, fears of uncontrollable emergence of agency are allayed, while research advances are appropriately refocused on the processes of context-directed extrapolation and how this interacts with training data to produce valuable capabilities in LLMs. Future work can therefore explore alternative augmenting techniques that do not rely on inherent advanced reasoning in LLMs.
在这份立场文件中,我们提高了对LLM能力现实观点的批判性认识,即LLM能力不考虑极端的替代观点,即LLM是“随机鹦鹉”或拥有“显性”的先进推理能力,由于它们无法预测的出现,因此构成生存威胁。我们的中层观点是,LLMS从其培训数据的先前数据中推断出,类似于内文学习的机制能够将适当的信息从适当的信息定位到外推。我们称之为“文本导外推法”。 在这种观点下,虽然现有文献证实了LLMS的推理能力,但这种能力远远不止是随机对称的,但这种能力是可预见、可控制的,不能表明与人的高级认知能力相近的高级推理,而不能随额外培训而无限的伸缩。因此,对无法控制的机构的出现所担心被消除了,而研究进展又适当地重新聚焦于环境导向外推法的过程,以及这种与培训数据的互动方式如何产生宝贵的LMS能力。因此,未来的工作可以探索替代的扩大LMMS的内在推理法。
Article 98
Title@2025-05-29 (4): DReSD: Dense Retrieval for Speculative Decoding
Title: DReSD: Dense Retrieval for Speculative Decoding | DResD: Dense Retrieval für spekulative Dekodierung | DRESD: 用于投机性代号的高级检索值 2502.15572v2 |
Authors: Milan Gritta, Huiyin Xue, Gerasimos Lampouras
Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (REST), which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).
光学解码模式(SD)通过使用高效的模型草案来加速大语言模型(LLM)的生成,从而加速了大语言模型(LLM)的生成。 由LLM在一个前方呼叫中验证, 减少悬浮, 同时保留其输出。 我们侧重于基于检索的SD, 模型草案从一个非参数数据储存处获取下一个符号。 在字符串表面形式上运行的Sprassy检索(REST)目前是主导模式, 因其简单和可缩放性而成为主导模式。 但是, 由于其使用短背景和精确的字符串匹配, 其有效性有限。 相反, 我们引入了“ 光学检索” , 这个新框架使用背景化符号嵌入的近邻搜索来检索与SD最有语义相关性的代号序列。 广泛的实验显示, DRESDD达到( 平均) 87% 更高的接受率、 65% 更接受的代号以及19% 的生成速度, 与稀薄检索相比(REST) , 更快。
Article 99
Title@2025-05-29 (4): Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Title: Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO | Proximalisierte Preference-Optimierung für unterschiedliche Feedback-Typen: Eine zersetzte Perspektive auf DPO | 多种反馈类型最佳优化:对残疾人组织拆解的视角 2505.23316v1 |
Authors: Kaiyang Guo, Yinchuan Li, Zhitang Chen
Direct alignment methods typically optimize large language models (LLMs) by contrasting the likelihoods of preferred versus dispreferred responses. While effective in steering LLMs to match relative preference, these methods are frequently noted for decreasing the absolute likelihoods of example responses. As a result, aligned models tend to generate outputs that deviate from the expected patterns, exhibiting reward-hacking effect even without a reward model. This undesired consequence exposes a fundamental limitation in contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO) – the seminal direct alignment method – and demonstrate that its loss theoretically admits a decomposed reformulation. The reformulated loss not only broadens applicability to a wider range of feedback types, but also provides novel insights into the underlying cause of likelihood underdetermination. Specifically, the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and reinstating its complete version effectively resolves the underdetermination issue. Leveraging these findings, we introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types, eliminating likelihood underdetermination through an efficient approximation of the complete regularizer. Comprehensive experiments show the superiority of PRO over existing methods in scenarios involving pairwise, binary and scalar feedback.
直接调整方法通常最优化大语言模型(LLMs),与偏好和偏差反应的可能性形成对比。虽然这些方法在引导LLMs与相对偏好相匹配方面十分有效,但在引导LLMs与相对偏好相匹配方面却经常被注意到,这些方法有助于减少实例反应的绝对可能性。因此,一致模式往往产生与预期模式不同的产出,即使没有奖励模式,也表现出奖励-打包效应。这一不理想的后果暴露了对比性调整的根本局限性,我们将其描述为确定中的可能性。在这项工作中,我们重新审视直接偏好优化(DPO) – – 原始直接调整方法 – – 并表明其损失在理论上认可了一种脱节制的重新组合。重订的损失不仅扩大了对一系列反馈类型的适用性,而且还对确定中可能性的潜在原因提供了新的见解。具体地说,DPOPO执行标准隐含了在重新确定损失时的简化,并恢复其完整版本有效地解决了确定中的问题。我们利用这些发现的结果,我们引入了PRimimizal化的优化统一方法,以统一的方式调整了各种综合的概率,从而消除了目前在确定中的风险。
Article 100
Title@2025-05-29 (4): Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments
Title: Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments | Verbesserung der Genauigkeit der Markerbewertung durch ordinelles Vertrauensmodellierung in Bildungsbewertungen | 通过在教育评估中建立常规信任模型,加强标标码的准确度 2505.23315v1 |
Authors: Abhirup Chakravarty, Mark Brenchley, Trevor Breakspear, Ian Lewin, Yan Huang
A key ethical challenge in Automated Essay Scoring (AES) is ensuring that scores are only released when they meet high reliability standards. Confidence modelling addresses this by assigning a reliability estimate measure, in the form of a confidence score, to each automated score. In this study, we frame confidence estimation as a classification task: predicting whether an AES-generated score correctly places a candidate in the appropriate CEFR level. While this is a binary decision, we leverage the inherent granularity of the scoring domain in two ways. First, we reformulate the task as an n-ary classification problem using score binning. Second, we introduce a set of novel Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) loss functions that incorporate the ordinal structure of CEFR labels. Our best-performing model achieves an F1 score of 0.97, and enables the system to release 47% of scores with 100% CEFR agreement and 99% with at least 95% CEFR agreement -compared to approximately 92% (approx.) CEFR agreement from the standalone AES model where we release all AM predicted scores.
自动读取系统( AES) 中的一个关键道德挑战是确保分数只有在达到高可靠性标准时才释放出来。 信任建模通过给每个自动分分分配一个可靠性估计尺度, 以信任分的形式对每个自动分进行。 在本研究中, 我们将信任估测设定为分类任务: 预测 AES 生成的得分是否正确地将候选人置于适当的 CEFR 级别上。 虽然这是一个二进制决定, 我们以两种方式利用评分域固有的颗粒性。 首先, 我们使用分数宾点将任务重新表述为n- 分类问题。 第二, 我们推出一套包含 CEFR 标签的方形结构的新型 Kernelweighted Ordinal Categorical Entropy (KWOCCE) 损失函数。 我们最优秀的模型达到F1 0.97 分, 使系统能够以100% CEFR协议和99%的得分数发放47%, 至少95%的CEFRFR 协议 — 约92% (approxx) 。
Article 101
Title@2025-05-29 (4): Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
Title: Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction | Datensatz-Featurierung: Enthüllen natürlicher Sprach-Features durch unüberwachte Daten-Rekonstruktion | Dataset Featuriz化:通过未受监督的数据重建发现自然语言特征 2502.17541v2 |
Authors: Michal Bravansky, Vaclav Kubon, Suhas Hariharan, Robert Kirk
Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to human-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets.
解释数据是现代研究的核心。大型语言模型(LLMS)在提供对数据的自然语言解释方面显示了希望,但简单的特征提取方法,例如促进往往不能为各种数据集提供准确和多功能的描述,对颗粒和规模缺乏控制。为解决这些局限性,我们提议了一套域不可知的数据集编造方法,在保持与人类标签相类似的紧凑和描述性特征的同时,对提取的特征的数量进行精确控制。我们的方法优化了信息二进制特征的选择,方法是评价LLM公司利用这些特征重建原始数据的能力。我们在数据集建模任务和两个案例研究中显示了其有效性:(1) 构建一个破狱策略特征,该特征集中地捕捉了更大规模人造袭击的效果和多样性;(2) 将发现符合人类喜好、达到与人造特征相近的准确性和稳健性特征的特征自动化。此外,我们表明管道的尺寸是有效的,随着额外特征的改进而得到抽样,使之适合大型和多样化数据集。
Article 102
Title@2025-05-29 (4): Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs
Title: Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs | Generalisierte Category Discovery in Event-Centric Kontexten: Latent Pattern Mining mit LLMs | 事件发生时发现的情况:利用LLMM公司进行原型采矿 2505.23304v1 |
Authors: Yi Luo, Qiwen Wang, Junqi Yang, Luyao Tang, Zhenghao Lin, Zhenzhe Ying, Weiqiang Wang, Chen Lin
Generalized Category Discovery (GCD) aims to classify both known and novel categories using partially labeled data that contains only known classes. Despite achieving strong performance on existing benchmarks, current textual GCD methods lack sufficient validation in realistic settings. We introduce Event-Centric GCD (EC-GCD), characterized by long, complex narratives and highly imbalanced class distributions, posing two main challenges: (1) divergent clustering versus classification groupings caused by subjective criteria, and (2) Unfair alignment for minority classes. To tackle these, we propose PaMA, a framework leveraging LLMs to extract and refine event patterns for improved cluster-class alignment. Additionally, a ranking-filtering-mining pipeline ensures balanced representation of prototypes across imbalanced categories. Evaluations on two EC-GCD benchmarks, including a newly constructed Scam Report dataset, demonstrate that PaMA outperforms prior methods with up to 12.58% H-score gains, while maintaining strong generalization on base GCD datasets.
普通分类发现(GCD)旨在使用仅包含已知类别的部分标签数据对已知类别和新出现类别进行分类。尽管在现有基准方面业绩良好,但当前文本的GCD方法在现实环境中缺乏足够的验证。我们引入了以长而复杂的叙述和高度不平衡的分类分布为特征的事件中心GCD(EC-GCD),这提出了两大挑战:(1) 由主观标准造成的不同集群和分类组;(2) 少数类的分类不公。为了解决这些问题,我们提议采用PAM(PAM)这一框架,利用LLMS来提取和完善事件模式,以改进集群类的校准。此外,排名过滤-采矿管道确保了不同类别原型的均衡代表性。对EC-GCD(EC-GCD)两个基准的评估,包括新建的Sham报告数据集,表明PAM(Sam Report)比先前方法优于12.58%的H核心收益,同时在基本GCD数据集上保持强的概括性。
Article 103
Title@2025-05-29 (4): Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs
Title: Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs | Dateneffiziente Meta-Modelle zur Auswertung kontextbasierter Fragen und Antworten in LLMs | 评价LLMM基于背景的问答的元模型 2505.23299v1 |
Authors: Julia Belikova, Konstantin Polev, Rauf Parchiev, Dmitry Simakov
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states – such as activation tracing and representation analysis – show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.
大型语言模型(LLMS)和检索增强的一代(RAG)系统越来越多地用于工业应用,但其可靠性仍然受到检测幻觉方面的挑战的阻碍。尽管受监督的利用LLM隐藏状态(如启动追踪和陈述分析)的LLM最新方法(SOTA)显示前景,但依赖广泛附加说明的数据集限制了现实应用中的可缩放性。本文件通过调查减少两个SOTA幻觉检测框架( “ 回溯镜 “ )的培训数据要求的可行性,解决了数据的关键瓶颈问题: “ 回溯镜 “ ,它分析注意力头动态,以及基于预测的方法,解码内部模型代表。我们提出了一种方法,将有效的分类算法与降低维度技术相结合,以尽量减少抽样规模的需求,同时保持竞争性的绩效。对标准化问答RAG基准的评估表明,我们的方法的性能与强有力的专利LMM基线相当,只有250个培训样本。这些结果突出表明了工业部署的轻度、数据效率范式潜力,特别是在注意性强的情景下。
Article 104
Title@2025-05-29 (4): EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian
Title: EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian | EmoBench-UA: Ein Benchmark-Datensatz für Emotionserkennung in der Ukraine | EmoBenich-UA:乌克兰情感检测基准数据集 2505.23297v1 |
Authors: Daryna Dementieva, Nikolay Babakov, Alexander Fraser
While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the Toloka.ai platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.
虽然乌克兰国家语言方案在许多文本处理任务方面取得了进展,但情绪分类仍是一个探索不足的领域,迄今还没有公布基准。在这项工作中,我们引入了Emobench-UA,这是乌克兰文本中第一个情感检测附加说明的数据集。我们的批注计划是从以前的以英语为中心的情感检测工作(Mohammad等人,2018年;Mohammad,2022年)指南中改编的。数据集是利用Toloka.ai平台通过众包创建的,确保了高质量的批注过程。然后,我们评估了从语言基线、从英语翻译的合成数据到大型语言模型(LLMS)等所收集的数据集的一系列方法。我们的调查结果突出了乌克兰语等非主流语言的情感分类挑战,并强调了进一步发展乌克兰语特定模式和培训资源的必要性。
Article 105
Title@2025-05-29 (4): How Does Response Length Affect Long-Form Factuality
Title: How Does Response Length Affect Long-Form Factuality | Wie wirkt sich die Response-Länge auf die Langform-Faktizität aus? | 反应时间长度如何影响长期事实质量 2505.23295v1 |
Authors: James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng
Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
大型语言模型(LLMs)被广泛用于长式文本的生成,但是,答复中的事实错误会损害其可靠性。尽管人们日益关注LLM的实际情况,但答复时间长度对事实质量的影响仍未得到充分探讨。在这项工作中,我们系统地调查这种关系,首先采用自动和双级长式事实质量评估框架,在符合成本效益的情况下与人的注释取得高度一致。我们利用这一框架,进行有控制的实验,发现较长的答复显示事实准确性较低,证实存在时间偏差。为了解释这一现象,我们从经验上研究了三种假设:错误传播、长背景和事实耗竭。我们的结果显示,在模型逐渐耗尽更可靠的知识的情况下,事实用尽是造成实际退化的主要原因,而不是其他两种假设。
Article 106
Title@2025-05-29 (4): Multi-Modal Framing Analysis of News
Title: Multi-Modal Framing Analysis of News | Multi-Modal Framing Analyse der Nachrichten | 新闻多模式结构分析 2503.20960v3 |
Authors: Arnav Arora, Srishti Yadav, Maria Antoniak, Serge Belongie, Isabelle Augenstein
Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-) language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.
在计算社会科学中,政治传播的自动框架分析是一项流行的任务,用于研究作者如何选择一个专题的方方面面来设计其接受范围。迄今为止,这种研究范围很窄,使用一套固定的预设框架,只注重文本,忽视了文本的视觉背景。特别是为了在新闻中进行设计,这留下了关于编辑选择的宝贵信息,其中不仅包括书面文章,也包括相附照片。为了克服这些限制,我们提出了一个方法,用大型(视觉)语言模型进行规模的多式多标签框架分析。我们用构思理论作为我们工作的基础,我们从图像中提取潜含的含义,用来传递一个特定点,并通过比较所使用的相应框架来对比文本。我们还确定了高度偏向性的主题框架,在先前的质量工作中发现了针对具体问题的框架分析。我们展示了对文本和新闻图像进行可扩展的综合框架分析的方法,为理解媒体偏见提供了更完整的图片。
Article 107
Title@2025-05-29 (4): ScEdit: Script-based Assessment of Knowledge Editing
Title: ScEdit: Script-based Assessment of Knowledge Editing | ScEdit: Script-basierte Bewertung von Wissensbearbeitung | ScEdit: 基于脚本的知识编辑评估 2505.23291v1 |
Authors: Xinye Li, Zunwen Zheng, Qian Zhang, Dekai Zhuang, Jiabao Kang, Liyan Xu, Qingbin Liu, Xi Chen, Zhiying Tu, Dianhui Chu, Dianbo Sui
Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark – ScEdit (Script-based Knowledge Editing Benchmark) – which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based (“What”-type question) evaluation to action-based (“How”-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
知识编辑(KE)受到越来越多的关注,但当前的KE任务仍然相对简单。在目前的评价框架下,许多编辑方法取得了极高的分数,有时甚至接近完美。然而,很少有研究将KE纳入现实世界应用情景(例如最近对LLM-as-agent的兴趣)。为支持我们的分析,我们引入了一个新的基于脚本的基准 – – ScEdit(基于知识编辑基准) – – 包括反事实和时间编辑。我们整合了象征性和文本一级的评价方法,全面分析了现有的KE技术。基准将传统的基于事实的评价(“是什么”类型的问题)扩大到基于行动的评价(“如何”类型的问题)。我们观察到,所有KE方法在既定的衡量标准上表现有下降,在文本级别指标上面临挑战,这表明一项艰巨的任务。我们的基准可在https://github.com/asdfo123/ScEdit查阅。
Article 108
Title@2025-05-29 (4): Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency
Title: Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency | Unsicherheit Quantifizierung für LLMs durch Minimum Bayes Risiko: Vertrauensüberbrückung und Konsistenz | 通过最低贝谷风险对LLMs的不确定性量化: 建立互信和一致性 2502.04964v4 |
Authors: Roman Vashurin, Maiya Goloburda, Albina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, Maxim Panov
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches to boost UQ performance. However, they sometimes fail to outperform much simpler baseline methods. Our work discusses the fundamental approach to constructing uncertainty measures that directly links uncertainty with the minimum Bayes risks achieved by LLM decoding. Building on these findings, we propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency, leading to a family of efficient and robust UQ methods. We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.
大语言模型(LLMs)的不确定性量化(UQ)方法包含多种方法,其中主要有两大类特别突出:以信息为基础的方法,侧重于以象征性概率表示的模型信心,以一致性为基础,评估通过反复抽样产生的多种产出之间的语义关系;最近的一些方法结合了这两种方法,以提高UQ的绩效;然而,这些方法有时未能超过简单得多的基线方法;我们的工作讨论了建立不确定性措施的基本方法,这些方法直接将不确定性与LLM解码所实现的最低海湾风险联系起来。基于这些调查结果,我们提出了一种将模型信心与产出一致性结合起来的新办法,从而形成一个高效和稳健的UQ方法的组合。我们的调查揭示了LLMs作为概率模型的独特性,这有助于解释为什么这些UQ方法在某些任务中不完善的原因。根据这些调查结果,我们提出了一种将模型信心和产出一致性结合起来的新方法,从而形成一个高效和稳健的UQ方法的组合。我们评估了我们在不同任务中采用的方法,例如问题解答、抽象的合成和机器的翻译方法,展示了超度的状态。
Article 109
Title@2025-05-29 (4): MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Title: MathArena: Evaluating LLMs on Uncontaminated Math Competitions | MathArena: Bewertung von LLMs auf nicht kontaminierten Math-Wettbewerben | Matharena:评估未受污染数学竞赛的LLMs 2505.23281v1 |
Authors: Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev
The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as SMT 2025 – published well after model release dates – demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On USAMO 2025, even top models score below 25%, far behind their performance on final-answer tasks. So far, we have evaluated 30 models across five competitions, totaling 149 problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.
大型语言模型(LLMS)的推理能力快速提高导致数学基准的显著改善。然而,许多最常用的评价数据集(例如AIME 2024)在网上广泛提供,因此很难解开潜在记忆化的真正推理;此外,这些基准并不评价对许多数学任务至关重要的校对能力。为了解决这个问题,我们引入了基于以下关键洞察力的新基准MathArena:经常性数学竞赛提供了高质量的、具有挑战性的问题流,可用于对LLMS进行实时评价。一旦出现新的问题,就立即对模型进行评估,从而有效消除污染风险。利用这个框架,我们发现AIME 2024 中存在污染的强烈迹象。尽管如此,对2025年SMT(公布后)等较激烈的竞争的评价表明,高业绩模型具有令人印象深刻的推理能力。MathArena也是证据编写能力的第一个基准。在USMO 2025年,甚至顶级模型的得分都低于25,远远低于其最后答案任务的业绩,我们有效地消除了污染风险。在AIME 2024中发现污染的明显的迹象。然而,我们已经评估了五种竞争中的最新进展。
Article 110
Title@2025-05-29 (4): Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective
Title: Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective | Sentinel: Aufmerksamkeitsprobierung von Proxy-Modellen für LLM-Kontextkompression mit verstehender Perspektive | 哨兵:注意从理解角度观察LLM背景压缩的代理模型 2505.23277v1 |
Authors: Yong Zhang, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.
Retrieval- 缩放生成(RAG) 将大型语言模型(LLMs)与外部环境结合,但检索到的段落往往冗长、吵闹或超过输入限值。现有的压缩方法通常需要专门压缩模型的监督培训、增加成本和减少可移动性。我们提议Sentinel, 轻量级的句级压缩框架, 将背景过滤重新设定为基于关注的理解任务。 Sentinel 探测器不是培训压缩模型, 而是使用轻量级分类器将注意力从一个现成的0. 5B 代理LM 中分解出来, 以确定判刑相关性。 简而言之, 我们发现, 查询- 文本相关性估计是跨模型尺度的一致的, 与0. 5B 近似大型模型的行为密切吻合。 在LongBench基准上, Sentinel 达到最多5美元的时间压缩, 同时匹配 7B 缩压系统的 QA 。 我们的结果表明, 本地关注信号使快速、 有效、 和 问题认知背景压缩。 代码可在 https://github.com/yzhangchuck/ Sentinel 上查阅 。
Article 111
Title@2025-05-29 (4): The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text
Title: The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text | Der arabische KI-Fingerabdruck: Stylometrische Analyse und Erkennung von großen Sprachmodellen Text | 阿拉伯文 AI 指纹:大语言模型文本的tytyllogimics 分析和探测 2505.23276v1 |
Authors: Maged S. Al-Shaibani, Moataz Ahmed
Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9\% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.
大型语言模型(LLMS)在生成人文文本方面达到了前所未有的能力,对学术、社会媒体和学术界等关键领域的信息完整性提出了微妙而重大的挑战,使得复杂的错误信息运动得以进行,损害保健指导,并促进有针对性的宣传。这一挑战变得十分严峻,特别是在诸如阿拉伯语等探索不足和低资源语言方面。本文对阿拉伯机器生成的文本进行了全面调查,审查了多种代代战略(仅以标题生成、内容认知生成和文本完善),在学术和社会媒体领域(ALLAM、Jais、Llama和GPT-4),对信息完整性提出了微妙但又很严峻的挑战。我们的系统特征分析显示,在各种背景下,人类写作的文字模式与机器生成的阿拉伯文文本有区别。尽管LLMS具有人性特征,但我们证明,其阿拉伯文产出中具有可探测的特征在不同背景下差异很大。基于这些洞察的、基于BERT的检测模型模型在正式背景下(至99.9F1核心)取得了非常强的业绩,在模型结构中具有很强的精确性。我们的跨模型结构中,我们最精确的语言模型分析显示的人类文字结构中最精确的语言模式,我们最精确的版本分析显示,在生成的版本中,我们最接近的版本分析中,从结构中报告了我们最接近的版本的版本的版本的版本的版本的版本的文献结构中,在构建的版本分析显示了了我们所研判的版本的版本的版本的版本的版本中,在构建中,在文献中,在文献结构中,对历史结构中,对文献中报告了我们的数据。
Article 112
Title@2025-05-29 (4): BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes
Title: BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes | BioVL-QR: Egozentrischer biochemischer Vision- und Sprachdatensatz mit Micro-QR-Codes | BioVL-QR:使用微质变码的Egocent 生物化学视觉和语言数据集 2404.03161v3 |
Authors: Tomohiro Nishimoto, Taichi Nishimura, Koki Yamamoto, Keisuke Shirai, Hirotaka Kameko, Yuto Haneji, Tomoya Yoshida, Keiya Kajimura, Taiyu Cui, Chihiro Nishiwaki, Eriko Daikoku, Natsuko Okuda, Fumihito Ono, Shinsuke Mori
This paper introduces BioVL-QR, a biochemical vision-and-language dataset comprising 23 egocentric experiment videos, corresponding protocols, and vision-and-language alignments. A major challenge in understanding biochemical videos is detecting equipment, reagents, and containers because of the cluttered environment and indistinguishable objects. Previous studies assumed manual object annotation, which is costly and time-consuming. To address the issue, we focus on Micro QR Codes. However, detecting objects using only Micro QR Codes is still difficult due to blur and occlusion caused by object manipulation. To overcome this, we propose an object labeling method combining a Micro QR Code detector with an off-the-shelf hand object detector. As an application of the method and BioVL-QR, we tackled the task of localizing the procedural steps in an instructional video. The experimental results show that using Micro QR Codes and our method improves biochemical video understanding. Data and code are available through https://nishi10mo.github.io/BioVL-QR/
本文介绍BioVL-QR,这是一个生物化学视觉和语言数据集,由23个以自我为中心的实验视频、相应的协议和视觉和语言校正组成。了解生化视频的一个主要挑战是如何探测设备、试剂和容器,因为环境杂乱和无法分辨的物体。以前的研究假定人工标注是昂贵和费时的。为了解决这个问题,我们把重点放在微量QR代码上。然而,由于物体操纵造成的模糊和隐蔽,仅使用微量级QR代码的物体的探测仍然困难。为了克服这一困难,我们提议一种将微量 QR 代码探测器与现成的手物体探测器相结合的物体标签方法。作为这种方法和BioVL-QR的应用,我们完成了在教学视频中将程序步骤本地化的任务。实验结果显示,使用微量定量代码和我们的方法可以改善生物化学视频理解。数据和代码可通过 https://nishi10mo.github.io/BioVL-RQ/Q/
Article 113
Title@2025-05-29 (4): Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs
Title: Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs | Entfernt Machine Unlearning wirklich Modellwissen? Ein Rahmen für die Prüfung von Unlearning in LLMs | 机器取消学习是否真正删除了示范知识? 审计框架是否在LLMM中取消学习? 2505.23270v1 |
Authors: Haokun Chen, Yueqi Zhang, Yuan Bi, Yao Zhang, Tong Liu, Jinhe Bi, Jian Lan, Jindong Gu, Claudia Grosser, Denis Krompass, Nassir Navab, Volker Tresp
In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.
近年来,大语言模型(LLMS)取得了显著进步,引起了研究界的极大关注,其能力主要归功于大型结构,需要大规模数据集的广泛培训,然而,这类数据集往往包含来自公共互联网的敏感或版权内容,引起对数据隐私和所有权的关切。《一般数据保护条例》(GDPR)等监管框架赋予个人要求删除这类敏感信息的权利。这促使开发了旨在将特定知识从模型中去除而无需花费昂贵的再培训的机读算法。尽管取得了这些进步,但由于LLMS的内在复杂性和基因性质,评价未学习算法的功效仍然是一项挑战。在这项工作中,我们引入了一个非学习评价综合审计框架,由三个基准数据集、六个未学习算法和五个快速审计方法组成。我们通过使用各种审计算法,评估不同不学习战略的有效性和稳健性。为了探索超越快速审计的替代方法,我们提出了一种创新技术,即利用中间启动过动,解决仅依赖投入和产出的审计方法的局限性。
Article 114
Title@2025-05-29 (4): Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Title: Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem? | Token Pruning in multimodalen großen Sprachmodellen: Lösen wir das richtige Problem? | 在多式大语言模式中的 Token Prurning:我们是否解决了正确的问题? 2502.11501v2 |
Authors: Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang
Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.
多式大型语言模型(MLLMs)在跨模式理解和生成方面表现显著,但仍有巨大的推论成本。最近,为解决这一问题,提出了大量工程建议,用象征性裁剪来解决该问题,确定MLLMs中的多余标牌,然后为降低计算和KV储存成本而提取这些标牌,从而导致不经培训而大幅加速。虽然这些方法声称提高了效率,但有关其基本设计和评估的关键问题仍然没有得到回答:为什么许多现有方法甚至与天真的随机随机标牌选择相比,效果不佳?基于关注的评分是否足以可靠地识别多余的标牌?语言信息在象征性裁剪期间真正有用吗?什么使得象征性重要性和重复之间的交易得当?目前的评价协议是全面和不带偏见的?对以前对这些问题的研究的无知妨碍了象征性标码运行的长期发展。在本文中,我们逐个回答这些问题,为未来象征性标标码方法的设计提供了深刻的洞察力。
Article 115
Title@2025-05-29 (4): A Reality Check on Context Utilisation for Retrieval-Augmented Generation
Title: A Reality Check on Context Utilisation for Retrieval-Augmented Generation | Ein Realitätscheck auf Kontext-Auslastung für retrieval-Augmented Generation | 关于回收-提款人一代的上下文利用情况的现实检查 2412.17031v2 |
Authors: Lovisa Hagström, Sara Vera Marjanović, Haeun Yu, Arnav Arora, Christina Lioma, Maria Maistro, Pepa Atanasova, Isabelle Augenstein
Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
REG 帮助解决语言模型(LM)内嵌入的参数知识的局限性。 在现实世界环境中,检索到的信息在复杂程度上各不相同,然而,对LM环境利用的大多数调查都局限于合成文本。我们引入DRUID(检索到不可恢复、不足和难以理解的背景数据),以真实世界查询和对姿态进行人工附加说明的背景。该数据集基于自动索赔核查的原型任务,而自动检索真实世界证据至关重要。我们将DRUID与合成数据集(CounterFact、DiaceQA)进行比较,发现人工数据集往往不能代表实际检索背景的复杂性和多样性。我们显示,合成数据包含在实际检索数据中罕见的夸大背景特征,导致环境利用结果膨胀,如我们的新ACU分所测量的那样。此外,先前的工作主要侧重于单个背景特性,以解释环境利用情况、单一背景属性与合成数据集(CUUM-DARU)之间的相关性和AU DU-SU-SUD-SUS-SUR-SUD-SU-SUD-SULUD-SUD-SULVILADSUDSIRD-S-SUDSIRDS-SUDSUDSUDSIRDSUDSU 的SUDSUDSUDSUDSI 的SUDS-S-S-SUDSDS-S-S-S-S-S-SUDSUDSDS-S-SUDSUDSDSUDSUDSUDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSIDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSDSUFSUDSUDSUDSUDSDSUDSDSDSDSDSDSDSDSUDSUDSUDSDSDSDSDSDSDSDSDSDSDSDSIFSU ASDSIFSU
Article 116
Title@2025-05-29 (4): Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs
Title: Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs | Strukturverstärkte Protein-Instruktions-Tuning: Auf dem Weg zu einem allgemeinen Protein-Verständnis mit LLMs | 结构强化的蛋白质指导指示图示:争取与LLMs达成一般用途的蛋白性了解 2410.03553v3 |
Authors: Wei Wu, Chao Wang, Liyi Chen, Mingze Yin, Yiheng Zhu, Kun Fu, Jieping Ye, Hui Xiong, Zheng Wang
Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach incorporates a novel structure-aware module into pLMs to enrich their structural knowledge, and subsequently integrates these enhanced pLMs with large language models (LLMs) to advance protein understanding. In this framework, we propose a novel instruction tuning pipeline. First, we warm up the enhanced pLMs using contrastive learning and structure denoising. Then, caption-based instructions are used to establish a basic understanding of proteins. Finally, we refine this understanding by employing a mixture of experts (MoEs) to capture more complex properties and functional information with the same number of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experiments on both open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.
Proteins作为基本的生物分子,在生物过程中发挥着核心作用,包括代谢反应和DNA复制。准确预测其属性和功能对于生物应用至关重要。最近开发的蛋白质语言模型(pLMs),经过监督的微调,为解决这一问题提供了有希望的解决办法。然而,微调模型是针对特定下游预测任务而设计的,实现通用蛋白理解仍然是一个挑战。在本文件中,我们引入了结构强化的蛋白质指导(SEPIT)框架,以弥合这一差距。我们的方法将新的结构认知模块纳入pLMs,以丰富其结构知识,随后将这些强化的PLMs与大语言模型(LLMS)结合,以推进蛋白质理解。在此框架内,我们提出一个新的指令调整管道。首先,我们利用对比性学习和结构分解结构来暖化增强的PLMs。然后,基于字幕的指示用于建立对蛋白质的开放式基本理解。最后,我们通过使用专家混合物(MoE)来完善这一理解其结构,用大型语言模型和功能性数据模拟,以显示最复杂的蛋白质数据,从而构建了我们最精细的SLILMsalalal-alal-al-al-al-al-al-al-al-alde的高级的模拟,从而可以演示到最高级数据,从而显示我们最高级的高级的高级的模拟的模拟的模拟的高级数据。
Article 117
Title@2025-05-29 (4): Skywork Open Reasoner 1 Technical Report
Title: Skywork Open Reasoner 1 Technical Report | Skywork Open Reasoner 1 Technischer Bericht | ” 天窗开放理由1 “ 技术报告 2505.22312v2 |
Authors: Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, Yahui Zhou
The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.
DeepSeek-R1的成功突显了加强学习(RL)在提高大型语言模型(LLMs)推理能力方面的重要作用。 在这项工作中,我们展示了Skywork-OR1,这是长搜索链(COT)模型的一种有效和可扩展的RL执行。在DeepSeek-R1-Distry模型系列的基础上,我们的REL方法取得了显著的绩效收益,使32B模型的AME24、AIME25和LiveCodeBench的平均准确率从57.8%提高到72.8%(+15.0%),7B模型的推理能力从43.6%提高到57.5%(+13.9%)。我们的Skywork-OR1-32B模型在AIME24和AIME25基准方面超过了DeepStual-R1和Qwen3-32B,同时在LiveCode Bench、Skywork-OR1-7B和Skywork-OR1-Math-7B模型中, 展示了类似规模的竞争性推理推论能力。我们进行了全面的推算研究,我们进行了全面研究,并验证了对关键数据流数据流流流数据流的精度研究,并验证了基础的精度的精度研究。
Article 118
Title@2025-05-29 (4): Tensor Product Attention Is All You Need
Title: Tensor Product Attention Is All You Need | Tensor Produkt-Achtung ist alles, was Sie brauchen | 色素产品 关注是所有你需要的 2501.06425v4 |
Authors: Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew C Yao
Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor Product Attention Transformer,(T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA’s memory efficiency and computational efficiency at the decoding stage enable processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.
用于处理较长输入序列的扩缩语言模型通常需要大量关键值缓存(KV),从而在推断过程中产生大量的记忆管理。在本文件中,我们提议Tensor产品注意(TPA),这是一个新式关注机制,它使用高分分解来代表查询、键和紧凑的值,在推论时间大大缩小了KV缓存的大小。通过将这些表达方式纳入上下文低级别组件,并与扶轮定位嵌入器(ROPE)无缝结合,TPA在存储效率的同时提高了模型质量。在TPA的基础上,我们引入了Tensor产品注意变换器(T6),这是一个用于序列建模的新模型架构。通过对语言模型任务进行广泛的经验性评估,我们证明T6超过或匹配标准变换器基线的性能,包括多处注意(MAHA)、多处注意(MQA)、集体-Query 注意(GQA)和多处迟应注意(MLA)等各种指标,包括过硬度和一系列既定评价基准。值得注意的是,TPAA的存储序列可控系统在Sqlabal Scal Scal Scal Scal Procal commal Procal Procal commal Procal competion commal commal competion commal competion competional compeal competion commal commal commal commal commal commal commal commal commal commal commal commal commal comm commal commal commal commal commal comm comm comm comm comm comm comm commal commal commal commal commal commal commal comm comm comm comm comm comm comm comm comm comm comm comm comm comm comm comm comm commcal comm com
Article 119
Title@2025-05-29 (4): Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers
Title: Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers | Automatische Konstruktion mehrerer Klassifizierungsdimensionen für die Verwaltung von Ansätzen in wissenschaftlichen Papieren | 科学文件中管理方法的多重分类方面自动构建 2505.23252v1 |
Authors: Bing Ma, Hai Zhuge
Approaches form the foundation for conducting scientific research. Querying approaches from a vast body of scientific papers is extremely time-consuming, and without a well-organized management framework, researchers may face significant challenges in querying and utilizing relevant approaches. Constructing multiple dimensions on approaches and managing them from these dimensions can provide an efficient solution. Firstly, this paper identifies approach patterns using a top-down way, refining the patterns through four distinct linguistic levels: semantic level, discourse level, syntactic level, and lexical level. Approaches in scientific papers are extracted based on approach patterns. Additionally, five dimensions for categorizing approaches are identified using these patterns. This paper proposes using tree structure to represent step and measuring the similarity between different steps with a tree-structure-based similarity measure that focuses on syntactic-level similarities. A collection similarity measure is proposed to compute the similarity between approaches. A bottom-up clustering algorithm is proposed to construct class trees for approach components within each dimension by merging each approach component or class with its most similar approach component or class in each iteration. The class labels generated during the clustering process indicate the common semantics of the step components within the approach components in each class and are used to manage the approaches within the class. The class trees of the five dimensions collectively form a multi-dimensional approach space. The application of approach queries on the multi-dimensional approach space demonstrates that querying within this space ensures strong relevance between user queries and results and rapidly reduces search space through a class-based query mechanism.
首先,本文件用自上而下的方式确定方法模式,通过四种不同的语言层次(语义层次、话语层次、合成层次和词汇层次)来完善模式。科学论文中的方法是建立在方法模式基础上的。此外,利用这些模式确定了分类方法的五个层面。本文建议使用树结构来代表步骤和测量不同步骤之间的相似性,并采用基于树结构的类似性衡量方法,侧重于合成层次的相似性。建议采用收集相似性衡量方法,以比较方法之间的相似性。提议采用底栖群落算法,通过将每种方法的组成部分或类别及其最相似的方法组成部分或类别合并起来,在每个类别中绘制。此外,使用这些模式确定了分类方法的五个层面。本文提议使用树结构来代表步骤并衡量不同步骤之间的相似性。本文建议,以基于树木结构的类似性衡量方法的多个层面为中心点。 将每个类别中的用户层次搜索方法的通用比例在课堂内部和每个类别内部使用的多层次查询方法中,将使用空间层次方法的组合式分析方法。
Article 120
Title@2025-05-29 (4): SOTOPIA-$Ω$: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents
Title: SOTOPIA-$Ω$: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents | SOTOPIA-$Ω$: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents | SOTOPIA-美元/美元/美元:在评估社会代理人后进行动态战略注射学习和社会指导 2502.15538v3 |
Authors: Wenyuan Zhang, Tianyun Liu, Mengxiao Song, Xiaodong Li, Tingwen Liu
Despite the abundance of prior social strategies possessed by humans, there remains a paucity of research dedicated to their transfer and integration into social agents. Our proposed SOTOPIA-$\Omega$ framework aims to address and bridge this gap, with a particular focus on enhancing the social capabilities of language agents. This framework dynamically injects multi-step reasoning strategies inspired by negotiation theory and two simple direct strategies into expert agents, thereby automating the construction of a high-quality social dialogue training corpus. Additionally, we introduce the concept of Social Instruction Following (S-IF) and propose two new S-IF evaluation metrics that complement social capability. We demonstrate that several 7B models trained on high-quality corpus not only significantly surpass the expert agent (GPT-4) in achieving social goals but also enhance S-IF performance. Analysis and variant experiments validate the advantages of dynamic construction, which can especially break the agent’s prolonged deadlock.
尽管人类以前拥有丰富的社会战略,但是仍然缺乏专门研究,专门研究这些战略的转移和融入社会行为主体的问题。我们提议的SOTOPIA-$\Omega$框架旨在解决和弥合这一差距,特别侧重于提高语言行为主体的社会能力。这个框架以动态的方式将谈判理论和两项简单的直接战略所启发的多步推理战略注入专家代理人,从而使建立高质量的社会对话培训资料库的工作自动化。此外,我们引入了“社会指导跟踪”概念,并提出了两项新的S-IF评估指标,以补充社会能力。我们证明,若干经过培训的7B模式不仅大大超越了专家代理人(GPT-4)实现社会目标的能力,而且还加强了S-IF的业绩。分析和变式实验证实了动态建设的优势,特别是能够打破该代理人长期僵局的优势。
Article 121
Title@2025-05-29 (4): Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts
Title: Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts | Autonome Datenauswahl mit Zero-shot Generative Klassifikatoren für mathematische Texte | 具有数学文本零光生成分类器的自动数据选择 2402.07625v6 |
Authors: Yifan Zhang, Yifan Luo, Yang Yuan, Andrew C Yao
We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot “generative classifiers” to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model’s logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.
我们推出自动数据选择(AutoDS) , 这是一种将基本语言模型本身作为零光“ 遗传分类器” 来自动翻译高质量数学文本的方法。 与以前要求人为说明或培训专用数据过滤器的方法不同, AutoDS 完全依靠模型的登录来确定某一特定通道是否具有数学上的信息和教育性。 通过将AutoDS纳入持续的培训前管道,我们大大提升了具有挑战性的数学基准(MATH、GSM8K和BBH)的下游性能,同时使用远比以往少得多的符号。 从目前来看,我们的方法在强化基线的预培训标语效率上取得了双重改进,强调了在加强数学推理过程中自行选择数据的潜力。 我们发行了我们经过校准的AutoMathText数据集,以促进未来在自动特定域数据曲线上的研究。 AutoMatext数据集可在https://huggingface.co/datasts/math-ai/AutoMatthText上查阅。 代码可在 https://github.com/yfanzhah- promatthTextText查阅。
Article 122
Title@2025-05-29 (4): ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering
Title: ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering | ChartMind: Ein umfassender Benchmark für komplexe multimodale Chart-Fragebeantwortung | 图表Mind:复杂现实世界多式联运图表问题回答综合基准 2505.23242v1 |
Authors: Jingxuan Wei, Nan Xu, Junnan Zhu, Yanni Hao, Gaowei Wu, Bihui Yu, Lei Wang
Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.
图表解答(CQA)已成为评价视觉语言模型推理能力的关键多式联运任务。早期方法通过注重视觉特征或利用大规模培训前的优势,显示了有希望的业绩,但大多数现有评价依赖僵硬的产出格式和客观指标,从而忽视了对实用图表分析的复杂、现实世界的要求。在本文件中,我们引入了ChartMind,这是为现实世界环境中复杂的CQA任务设计的新基准。ChartMind涵盖七个任务类别,包括多种语言背景,支持开放的文本产出,并包含不同的图表格式,缩小现实世界应用与传统学术基准之间的差距。此外,我们提出了一个符合环境的、但却是模型的、不可知的框架(ChartLLM),重点是提取关键背景要素,减少噪音,并提高多式联运大型语言模型的推理准确性。对“CartMind”和三个具有代表性的公共基准的14个主流多式联运模型的广泛评价显示我们的框架大大超越了CQA前三个共同模式:遵循指令、OCR-hanced和连锁概念格式,缩小了现实应用与传统学术基准之间的差距。此外,我们建议了制定更稳健的世界研究图表的重要性。
Article 123
Title@2025-05-29 (4): PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts
Title: PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts | PolyMath: Mathematische Vernunft in multilingualen Kontexten bewerten | 多语制:多语种背景下的数学理由评估 2504.18428v3 |
Authors: Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei, Baosong Yang, Rui Wang, Chenshu Sun, Feitong Sun, Jiran Zhang, Junxuan Wu, Qiqian Cang, Yichang Zhang, Fei Huang, Junyang Lin, Fei Huang, Jingren Zhou
In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.
本文介绍多语种数学推理基准PolyMath,该基准涵盖18种语言和4种简单易懂的难度。我们的基准确保了难度的全面性、语言多样性和高质量的翻译,使其成为推理有限责任公司时代高度歧视性的多语种数学基准。我们对先进的LMS进行全面评价,发现即使是Qwen-3-235B-A22B-Thinking和Gemini-2.5-pro,也只达到54.6和52.2基准分,在最高水平下达到大约40%的精确度。 从语言角度看,我们的基准揭示了LMS在多语种推理方面的几个关键挑战:(1) 现行LMMS的判断性能差异很大;(2) 投入-产出语言的一致性在推理LMS中很低,可能与业绩相关;(3) 现有LMS的思维长度因语言而有很大差异。 此外,我们证明,控制指示中的输出语言有可能影响推理性,特别是一些低资源语言的成绩,表明提高LMS的多语种能力的有希望的方向。
Article 124
Title@2025-05-29 (4): Pandora’s Box or Aladdin’s Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models
Title: Pandora’s Box or Aladdin’s Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models | Pandora’s Box oder Aladdin’s Lampe: Eine umfassende Analyse, die die Rolle des RAG-Geräuschs in großen Sprachmodellen aufzeigt | Pandora的盒子或Aladdin的灯光:全面分析RAG噪音在大语言模型中的作用 2408.13533v3 |
Authors: Jinyang Wu, Shuai Zhang, Feihu Che, Mingkuan Feng, Pengpeng Shao, Jianhua Tao
Retrieval-Augmented Generation (RAG) has emerged as a crucial method for addressing hallucinations in large language models (LLMs). While recent research has extended RAG models to complex noisy scenarios, these explorations often confine themselves to limited noise types and presuppose that noise is inherently detrimental to LLMs, potentially deviating from real-world retrieval environments and restricting practical applicability. In this paper, we define seven distinct noise types from a linguistic perspective and establish a Noise RAG Benchmark (NoiserBench), a comprehensive evaluation framework encompassing multiple datasets and reasoning tasks. Through empirical evaluation of eight representative LLMs with diverse architectures and scales, we reveal that these noises can be further categorized into two practical groups: noise that is beneficial to LLMs (aka beneficial noise) and noise that is harmful to LLMs (aka harmful noise). While harmful noise generally impairs performance, beneficial noise may enhance several aspects of model capabilities and overall performance. Our analysis offers insights for developing more robust, adaptable RAG solutions and mitigating hallucinations across diverse retrieval scenarios. Code is available at https://github.com/jinyangwu/NoiserBench.
在最近的研究中,RAG模型将RAG模型扩大到复杂的噪音情况,但这些探索往往局限于有限的噪音类型,并假定噪音本身对LLMS有害,有可能偏离现实世界的检索环境和限制实际适用性。在本文件中,我们从语言角度界定了7种不同的噪音类型,并建立了一个噪音RAG基准(NoiserBench),这是一个涵盖多种数据集和推理任务的综合评价框架。通过对8个具有不同结构和规模的代表性LLMS进行实证评估,我们发现这些噪音可以进一步分为两个实用的类别:对LLMS(aka有益噪音)有利的噪音和对LLMS有害的噪音(ka有害噪音)。虽然有害噪音通常会损害性能,但有利的噪音可能会增强模型能力和总体性能的若干方面。我们的分析为在各种检索设想中开发更稳健、更适应的RAG解决方案和减轻幻觉提供了深刻的见解。我们可在https://github.com/jinangwu/Noiserbech。
Article 125
Title@2025-05-29 (4): MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration
Title: MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration | MCTSr-Zero: Selbstreflektierende Psychologische Beratung Dialoge Generation über Prinzipien und Adaptive Exploration | MMCTSr-Zero:通过原则和适应性探索进行自我反应心理辅导对话 2505.23229v1 |
Authors: Hao Lu, Yanchi Gu, Haoyuan Huang, Yulin Zhou, Ningxin Zhu, Chen Li
The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict “correctness” criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is “domain alignment”, which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates “Regeneration” and “Meta-Prompt Adaptation” mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero’s effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.
蒙特卡洛树搜索(MCTS)与大语言模型(LLM)的整合在结构化、面向问题的任务中表现出显著的成功。然而,将这些方法应用于开放式对话,例如心理咨询,这带来了独特的挑战。与客观正确性的任务不同,治疗性对话的成功取决于主观因素,如同情性接触、道德守法和符合人类偏好,对此,严格的“正确性”标准定义不当。现有的面向结果的MCTS方法因此可以产生错误的反应。为此,我们引入了MCTS-Zero(MCTS-Zero),这是为开放性、以人为中心的对话设计的MCTS框架。它的核心创新是“基本一致 ” , 将MCTS搜索目标从预定的最终状态转向符合目标领域原则的谈话轨迹(例如,咨询中的同情 ) 。此外, MCTS-Zero将“再生”和“Met-Sy-Propressal 适应”机制结合起来,通过让不断的MCTS(M-MTS)来考虑根本上不同的初始对话战略。我们评估了MTER-C-C-CRal-Zal-Cal-Cal-Cal-Cal-Cal-Cal-Cal-Syaldealalalalalalalalalal ,我们用了一个高数据分析高的数学对话,我们使用了一个持续的数据分析工具,我们用来进行多式数据分析。我们用来为多层次数据分析。
Article 126
Title@2025-05-29 (4): HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
Title: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model | HiDe-LlaVA: Hierarchische Entkopplung zur kontinuierlichen Instruktionstuning von multimodalen Großsprachenmodellen | HIDE-LLALAVA:多式大语言模式连续教学制导的等级脱钩 2503.12941v2 |
Authors: Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.
教学调整被广泛用来改进经过事先训练的多式大语言模型(MLLM),方法是对它进行关于具体任务数据集的培训,以便更好地了解人类的指令;然而,在现实世界情景中,不可能同时收集所有可能的指令数据集;因此,使MLLM能够不断进行教学调整,对于保持其适应性至关重要;然而,现有方法往往以业绩增益来交换记忆效率,从而大大降低总体效率;在本文件中,我们建议,根据在就不同数据集进行的培训中枢 Kernel对齐(CKA) 不同模型层的变异,建立任务扩展和任务一般融合框架;此外,我们分析现有基准中的信息渗漏情况,并提出新的、更具挑战性的基准,以合理评估不同方法的绩效;全面试验显示,我们的方法与现有“最先进”方法相比,业绩有显著改进。代码和数据集公布在https://github.com/Ghy0501/Hide-LLAVA。
Article 127
Title@2025-05-29 (4): Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage
Title: Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage | Bidirektionale Ketten von Gedanken- und Belohnungsmechanismen zusammenführen Eine Methode zur Verbesserung von Frage-Antwort-Fähigkeiten von großen Sprachmodellen für chinesisches immaterielles Kulturerbe | 利用思想和奖赏机制的双向双向两向链 提高中国非物质文化遗产大语言模式的回答问题能力的方法 2505.08167v3 |
Authors: Ruilin Liu, Zhixiao Zhao, Jieqiong Li, Chang Liu, Dongbo Wang
The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism. This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage. The proposed method enables the model to not only perform forward reasoning but also enhances the accuracy of the generated answers by utilizing reverse questioning and reverse reasoning to activate the model’s latent knowledge. Additionally, a reward mechanism is introduced during training to optimize the decision-making process. This mechanism improves the quality of the model’s outputs through structural and content evaluations with different weighting schemes. We conduct comparative experiments on ICH-Qwen, with results demonstrating that our method outperforms 0-shot, step-by-step reasoning, knowledge distillation, and question augmentation methods in terms of accuracy, Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the paper highlights the effectiveness of combining the bidirectional chains of thought and reward mechanism through ablation experiments. In addition, a series of generalizability experiments are conducted, with results showing that the proposed method yields improvements on various domain-specific datasets and advanced models in areas such as Finance, Wikidata, and StrategyQA. This demonstrates that the method is adaptable to multiple domains and provides a valuable approach for model training in future applications across diverse fields.
大型语言模型(LLMS)的迅速发展为推进特定领域的LLMS提供了重要的支持和机会。然而,利用无形文化遗产(ICH)数据对这些大型模型进行微调,必然会面临偏见、不正确的知识继承和灾难性的遗忘等挑战。为了解决这些问题,我们建议采用新的培训方法,将双向思维链和奖赏机制结合起来。这种方法以ICH-Qwen(一个专门为无形文化遗产领域设计的大型语言模型)为基础。拟议方法使模型不仅能够进行前瞻性推理,而且通过利用反向盘问和反向推理来激活模型的潜在知识来提高所产生答案的准确性。此外,在培训期间引入了一个奖励机制,以优化决策过程。这个机制通过不同加权办法的结构和内容评价提高模型产出的质量。我们在ICH-Qwen(ICH-Qwen)上进行了比较实验,结果表明我们的方法在准确性、逐步推理、知识蒸馏和扩增方法方面,在精确性、Bleu-4和红色-L(Reg-L)等领域中,这一方法在纸质分析方法上展示了跨层次的升级方法。
Article 128
Title@2025-05-29 (4): Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Title: Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking | Reasoning-to-Defend: Sicherheitsbewusste Reasoning kann große Sprachmodelle von Jailbreaking verteidigen | 理由到理由:安全意识理由能够捍卫从破室中使用大语言的模型 2502.12970v2 |
Authors: Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
Large Reasoning Models (LRMs) have demonstrated impressive performances across diverse domains. However, how safety of Large Language Models (LLMs) benefits from enhanced reasoning capabilities against jailbreak queries remains unexplored. To bridge this gap, in this paper, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates a safety-aware reasoning mechanism into LLMs’ generation. This enables self-evaluation at each step of the reasoning process, forming safety pivot tokens as indicators of the safety status of responses. Furthermore, in order to improve the accuracy of predicting pivot tokens, we propose Contrastive Pivot Optimization (CPO), which enhances the model’s perception of the safety status of given dialogues. LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their safety capabilities defending jailbreak attacks. Extensive experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances. This highlights the substantial potential of safety-aware reasoning in improving robustness of LRMs and LLMs against various jailbreaks.
大型理性模型(LRMs)在不同领域表现出了令人印象深刻的成绩,然而,大型语言模型(LLMs)的安全如何受益于针对越狱询问的强化推理能力,仍未得到探讨;为了缩小这一差距,我们在本文件中提议, “ 理性到防御 “ (R2D)这一新的培训模式,将安全意识推理机制纳入LLMs的下一代,从而能够在推理过程的每一个阶段进行自我评价,形成安全脉冲标志,作为答复安全状态的指标;此外,为了提高预测分流标志的准确性,我们提议 “ 反向优化 “ (CPO),以强化模型对特定对话安全状态的认识;LLOMs在推理过程中积极调整其应对战略,大大加强其防范越狱袭击的安全能力;广泛的实验表明,R2D有效地减轻了各种袭击,改善了总体安全,同时保持了最初的绩效。这突出表明,安全意识推理在提高LRMMs和LMs对不同破室的稳健度方面有很大的潜力。
Article 129
Title@2025-05-29 (4): DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
Title: DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models | DiagnoseArena: Benchmarking Diagnostic Reasoning für große Sprachmodelle | 诊断阿勒纳:大语言模型诊断依据基准 2505.14107v4 |
Authors: Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang
The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AI’s diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.
由于现有医学基准在评估高级诊断推理方面有局限性,我们提出了诊断阿纳纳,这是一个全面而具有挑战性的基准,旨在严格评估专业水平诊断能力。诊断阿纳由1 113对分解病人病例和相应的诊断组成,共涉及28个医学专业,来自10个顶级医学期刊发表的临床案例报告。该基准是通过一个细致的建设管道开发的,包括由AI系统和人类专家进行多轮筛选和审查,并进行彻底检查以防止数据泄漏。我们的研究显示,即使是最先进的推理模型(o3, o1)和DeepSeek-R1,也分别只达到51.12%、31.09%和17.79%的准确度。这一发现凸显了当前大型语言模型在面临临床诊断推理挑战时的显著普遍化瓶颈。我们通过诊断系统及人类专家的多轮筛选和审查,并为防止数据泄漏进行彻底检查。我们的研究显示,即使是最先进的推理模型(o3, o1)和Deep Seek-R1,也分别达到51.12 %、31.09%和17.79%的准确度。这一精确度。这一发现显示,在面临临床诊断推理学/IR的挑战挑战时,我们的目标是通过更深入的诊断性推理学/诊断工具提供更深入的诊断工具。
Article 130
Title@2025-05-29 (4): MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
Title: MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration | MMBoundary: MLLM-Wissensgrenzen-Bewusstsein durch vernünftige Schritt-Vertrauens-Kalibrierung | MMMMMMMMMM MMMMMMMM:通过合理步骤信任校准提高MLLM知识边界认识 2505.23224v1 |
Authors: Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, Yi R., Fung
In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.
近年来,多式联运大型语言模型(MLLM)取得了长足进展,但继续面临多式联运推理的内在挑战,这需要多层次(例如认识、推理)和多层次自我调整(例如多步推理链)的先进推理。先前关于模型信心估计的工作往往侧重于培训和校准的总体反应,但未能评估对每个推理步骤的信心,导致不可取的幻觉雪球。在这项工作中,我们介绍了MMMMMoundary,这是一个通过推理步骤信心校准提高MLLM知识边界意识的新框架。为此,我们建议纳入补充文本和跨模式自我调整信号,以估计MLLM推理过程每一步骤的信心。除了监督对最初信心表示暖化的自评信心估计信号的调整外,我们还引入了一个强化学习阶段,并赋予多重奖励功能,以进一步调整模型知识,调整每推理步骤的信心,加强推理链的自我校正。为了实现这一目标,我们建议纳入补充文本和跨模式自我调整信号,以估计MLLLMARM过程的每一步骤的每个步骤,在MMBMMMMBM大大超越了现有标准的改进度上,在降低数据格式和调整中,在标准中大大改进了标准方面,在标准方面,在改进了标准调整了标准调整了标准调整了标准。
Article 131
Title@2025-05-29 (4): KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search
Title: KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search | KBQA-o1: Agentische Wissensdatenbank Frage beantworten mit Monte Carlo Baumsuche | KBQA- o1: 用于蒙特卡洛树搜索的代理知识库问题解答 2501.18922v3 |
Authors: Haoran Luo, Haihong E, Yikai Guo, Qika Lin, Xiaobao Wu, Xinyu Mu, Wenhao Liu, Meina Song, Yifan Zhu, Luu Anh Tuan
Knowledge Base Question Answering (KBQA) aims to answer natural language questions with a large-scale structured knowledge base (KB). Despite advancements with large language models (LLMs), KBQA still faces challenges in weak KB awareness, imbalance between effectiveness and efficiency, and high reliance on annotated data. To address these challenges, we propose KBQA-o1, a novel agentic KBQA method with Monte Carlo Tree Search (MCTS). It introduces a ReAct-based agent process for stepwise logical form generation with KB environment exploration. Moreover, it employs MCTS, a heuristic search method driven by policy and reward models, to balance agentic exploration’s performance and search space. With heuristic exploration, KBQA-o1 generates high-quality annotations for further improvement by incremental fine-tuning. Experimental results show that KBQA-o1 outperforms previous low-resource KBQA methods with limited annotated data, boosting Llama-3.1-8B model’s GrailQA F1 performance to 78.5% compared to 48.5% of the previous sota method with GPT-3.5-turbo. Our code is publicly available.
知识基础问题解答(KBQA)旨在用大型结构化知识库回答自然语言问题。尽管在大型语言模型(LLMS)上取得了进步,KBQA在对KB认识不足、效力和效率不平衡以及高度依赖附加说明数据等方面仍然面临挑战。为了应对这些挑战,我们提议KBQA-o1, 这是一种与蒙特卡洛树搜索(MCTS)合作的新颖的代理KBQA方法。它引入了一种基于ReA的代理程序,用于与KB环境探索相继的逻辑生成。此外,它使用由政策和奖励模式驱动的超常搜索方法MCTS,即由政策和奖励模式驱动的超常搜索方法,以平衡代理探索的性能和搜索空间。随着超常探索,KBA-o1生成了高质量的说明,以便通过渐进微调进一步改进。实验结果表明, KBA-o1比以往的低资源KBQA方法(MTS)比以往的低,数据有限,将Llama-3.1-8B模型的GrailQA F1的F1性工作表现提高到78.5 %,而我们先前的代码为48.5%。
Article 132
Title@2025-05-29 (4): Reducing Tool Hallucination via Reliability Alignment
Title: Reducing Tool Hallucination via Reliability Alignment | Reduzieren der Werkzeughalluzination durch Zuverlässigkeitsanpassung | 通过可靠性调整减少工具幻觉 2412.04141v3 |
Authors: Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu
Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations, where models either select inappropriate tools or misuse them, pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types, tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination-aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool-use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions.
大型语言模型(LLMS)扩大了其能力,超越了语言生成,与外部工具互动,使自动化和现实世界应用成为可能;然而,工具幻觉,如果模型选择不适当的工具或滥用这些工具,则构成重大挑战,导致任务执行错误、计算成本增加和系统可靠性降低;为了系统地解决这一问题,我们定义工具幻觉并将其分为两大类,即工具选择幻觉和工具使用幻觉;为了评估和缓解这些问题,我们引入了RelyToolBench,它整合了专门测试案例和新颖的衡量标准,以评估幻觉任务的成功和效率。最后,我们提议了一个可靠性调整框架,即扩大工具使用行动空间,以包括不精确的行动,允许LLMS推迟工具的使用,寻求澄清,或动态调整工具选择。我们通过广泛的实验,证明Relign显著减少工具幻觉,提高任务可靠性,提高LM工具互动的效率。
Article 133
Title@2025-05-29 (4): Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces
Title: Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces | Verbesserung der parallelen Programmleistung mit LLM-Optimierern über Agent-System-Schnittstellen | 通过代理-系统接口改进与LLM优化器的平行方案绩效 2410.15625v3 |
Authors: Anjiang Wei, Allen Nie, Thiago S. F. X. Teixeira, Rohan Yadav, Wonchan Lee, Ke Wang, Alex Aiken
Modern scientific discovery increasingly relies on high-performance computing for complex modeling and simulation. A key challenge in improving parallel program performance is efficiently mapping tasks to processors and data to memory, a process dictated by intricate, low-level system code known as mappers. Developing high-performance mappers demands days of manual tuning, posing a significant barrier for domain scientists without systems expertise. We introduce a framework that automates mapper development with generative optimization, leveraging richer feedback beyond scalar performance metrics. Our approach features the Agent-System Interface, which includes a Domain-Specific Language (DSL) to abstract away the low-level complexity of system code and define a structured search space, as well as AutoGuide, a mechanism that interprets raw execution output into actionable feedback. Unlike traditional reinforcement learning methods such as OpenTuner, which rely solely on scalar feedback, our method finds superior mappers in far fewer iterations. With just 10 iterations, it outperforms OpenTuner even after 1000 iterations, achieving 3.8X faster performance. Our approach finds mappers that surpass expert-written mappers by up to 1.34X speedup across nine benchmarks while reducing tuning time from days to minutes.
现代科学发现日益依赖高性能计算来进行复杂的建模和模拟。 改进平行程序性能的一个关键挑战是高效地绘制处理器和数据到记忆的处理器和数据的工作,这一过程由复杂、低层次的系统代码(即映射器)所决定。 开发高性能绘图师需要数日人工调整,这对没有系统专长的域科学家构成了巨大的障碍。 我们引入了一个框架,使成像开发自动成像,使其具有基因化优化,使更丰富的反馈超过缩微性能度量度尺度。 我们的方法特征是代理系统-系统界面,包括一个DSL(DSL)来抽取系统代码的低度复杂度,并定义结构搜索空间,以及AutoGuide(一个将原始执行输出解释为可操作反馈的机制) 。 与OpenTuner(OpenTuner)等传统的强化学习方法不同, 我们的方法仅依靠缩放反馈, 其发现高级地图师在更小得多的迭。 我们的方法在10次的外, 它比OpenTuster(OnTustry-TultalTustr)更接近于1000次后, 实现3.X更快的功能。 我们的方法从超过专家写地图数日,同时将速度调整到1.34时间调整至1.34时间到1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 134
Title@2025-05-29 (4): System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts
Title: System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts | System-1.5 Reasoning: Traversal in Sprach- und Latentenräumen mit dynamischen Shortcuts | 系统-1.5 理由:具有动态快捷键的语言和隐藏空间的变化 2505.18962v2 |
Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
Chain-of-thought (CoT) reasoning enables large language models (LLMs) to move beyond fast System-1 responses and engage in deliberative System-2 reasoning. However, this comes at the cost of significant inefficiency due to verbose intermediate output. Recent latent-space reasoning methods improve efficiency by operating on hidden states without decoding into language, yet they treat all steps uniformly, failing to distinguish critical deductions from auxiliary steps and resulting in suboptimal use of computational resources. In this paper, we propose System-1.5 Reasoning, an adaptive reasoning framework that dynamically allocates computation across reasoning steps through shortcut paths in latent space. Specifically, System-1.5 Reasoning introduces two types of dynamic shortcuts. The model depth shortcut (DS) adaptively reasons along the vertical depth by early exiting non-critical tokens through lightweight adapter branches, while allowing critical tokens to continue through deeper Transformer layers. The step shortcut (SS) reuses hidden states across the decoding steps to skip trivial steps and reason horizontally in latent space. Training System-1.5 Reasoning involves a two-stage self-distillation process: first distilling natural language CoT into latent-space continuous thought, and then distilling full-path System-2 latent reasoning into adaptive shortcut paths (System-1.5 Reasoning). Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional CoT fine-tuning methods while accelerating inference by over 20x and reducing token generation by 92.31% on average.
思维链(CoT)推理使大型语言模型(LLMs)超越了快速的系统-1反应,并参与了审议系统-2推理。然而,由于杂交中间输出,系统-1.5推理导致效率严重低下。最近的潜空推理方法通过在隐蔽国家操作而不解码成语言,提高了效率,但对所有步骤都一视同仁,未能区分关键推理与辅助步骤之间的关键推理,导致计算资源的不优化使用。在本文件中,我们提议了系统-1.5推理框架,即传统推理框架,通过潜空的捷径在推理步骤之间动态分配计算。具体地说,系统-1.5推理引入两种动态捷径。模型深度快捷(DS)通过轻量调的调整分解器分解非临界符号,在垂直深度上提高效率,同时通过更深层的变速的G-25S(SS) 跨解码步骤再利用隐藏的状态,以省微步骤和横向推理。培训系统-1.5推理涉及两阶段的自我推理过程:在不断的推理中,通过不断的推理,在不断的推理中,将高级的推理,将自然语言-推理方法展示-推理,同时推理,将逻辑-推理,将S-推理,将S-推理,将逻辑-推理,将逻辑-推理学-推理学-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-C-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理-推理
Article 135
Title@2025-05-29 (4): FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning
Title: FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning | FCMR: Robuste Bewertung der finanziellen Cross-Modal Multi-Hop Reasoning | FCMR: 对跨模式、多渠道金融理由的有力评价 2412.12567v3 |
Authors: Seunghee Kim, Changhyeon Kim, Taeuk Kim
Real-world decision-making often requires integrating and reasoning over information from multiple modalities. While recent multimodal large language models (MLLMs) have shown promise in such tasks, their ability to perform multi-hop reasoning across diverse sources remains insufficiently evaluated. Existing benchmarks, such as MMQA, face challenges due to (1) data contamination and (2) a lack of complex queries that necessitate operations across more than two modalities, hindering accurate performance assessment. To address this, we present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark created to analyze the reasoning capabilities of MLLMs by urging them to combine information from textual reports, tables, and charts within the financial domain. FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation. In particular, problems at the Hard level require precise cross-modal three-hop reasoning and are designed to prevent the disregard of any modality. Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model (Claude 3.5 Sonnet) achieving only 30.4% accuracy on the most challenging tier. We also conduct analysis to provide insights into the inner workings of the models, including the discovery of a critical bottleneck in the information retrieval phase.
虽然最近的多式联运大型语言模型(MLLM)在这类任务中表现出了希望,但它们在不同来源中执行多点推理的能力仍然没有得到充分的评价。现有基准,如MMQA,由于以下原因面临挑战:(1)数据污染和(2)缺乏复杂的查询,需要以两个以上模式开展业务,从而妨碍了准确的业绩评估。为了解决这个问题,我们介绍了金融交叉模式多点推理(FCMR),这是为分析MLLM的推理能力而建立的一个基准,通过敦促它们将文本报告、表格和图表中的信息综合起来,来分析MLLMS的推理能力。FCMR分为三个困难级别,即 “ 简单 “ 、 “ 中度 “ 和 “ 硬促动 “ 分步评估。特别是,硬层次的问题需要精确的跨模式三点推理,并旨在防止无视任何模式。 对这一新基准的实验表明,即使是最先进的MLLMS斗争能力,通过最先进的模型(Claude 3.5 Sonnet)仅达到30.4%的水平,包括最具有挑战性深度的探索阶段的工作分析。
Article 136
Title@2025-05-29 (4): Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection
Title: Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection | Multimodale Inverse Aufmerksamkeit Netzwerk mit Intrinsic Discriminant Feature Exploitation für gefälschte Nachrichten Erkennung | 多式反向关注网络,利用内在差异性地貌特征利用假新闻探测 2502.01699v2 |
Authors: Tianlin Zhang, En Yu, Yi Shao, Jiande Sun
Multimodal fake news detection has garnered significant attention due to its profound implications for social security. While existing approaches have contributed to understanding cross-modal consistency, they often fail to leverage modal-specific representations and explicit discrepant features. To address these limitations, we propose a Multimodal Inverse Attention Network (MIAN), a novel framework that explores intrinsic discriminative features based on news content to advance fake news detection. Specifically, MIAN introduces a hierarchical learning module that captures diverse intra-modal relationships through local-to-global and local-to-local interactions, thereby generating enhanced unimodal representations to improve the identification of fake news at the intra-modal level. Additionally, a cross-modal interaction module employs a co-attention mechanism to establish and model dependencies between the refined unimodal representations, facilitating seamless semantic integration across modalities. To explicitly extract inconsistency features, we propose an inverse attention mechanism that effectively highlights the conflicting patterns and semantic deviations introduced by fake news in both intra- and inter-modality. Extensive experiments on benchmark datasets demonstrate that MIAN significantly outperforms state-of-the-art methods, underscoring its pivotal contribution to advancing social security through enhanced multimodal fake news detection.
由于对社会保障的深刻影响,多式假新闻探测已经引起人们的极大关注。虽然现有办法有助于理解跨式一致性,但往往未能利用模式特定的表现方式和明显的差异性。为了解决这些限制,我们提议建立一个多式反向关注网络(MIAN),这是一个新颖的框架,根据新闻内容探索内在的歧视性特征,以推动假新闻探测。具体地说,MIAN引入了一个等级学习模块,通过地方对全球和地方对地方的互动,通过地方对地方对地方对地方对地方对地方对地方对地方的互动,从而产生强化的单一形式表现方式,改进对假新闻的识别。此外,跨式互动模块采用共同注意机制,在完善的单式表达方式之间建立和模式依赖性,促进各模式之间无缝的相互融合。为了明确提取不一致特征,我们提议一个反向关注机制,有效地突出假消息在内部和现代新闻中出现的相互冲突的模式和语义偏差。关于基准数据集的广泛实验表明,MIAN大大超越了其通过改进的多式联运方式对改进其关键信息探测方式的贡献。
Article 137
Title@2025-05-29 (4): BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
Title: BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning | BioProBench: Umfassender Datensatz und Benchmark im Biologischen Protokoll Verständnis und Vernunft | BioProBench:生物议定书理解和理由的综合数据集和基准 2505.07889v2 |
Authors: Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Li Yuan, Yonghong Tian
Biological protocols are fundamental to reproducibility and safety in life science research. While large language models (LLMs) perform well on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning. While there are several benchmark tasks involving protocol question answering, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs. Experimental results reveal that some models perform well on basic understanding tasks (e.g., \sim70% PQA-Acc., >64% ERR F1), but struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons show diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, BioProBench, through its task design and experimental findings, systematically reveals the fundamental challenges for current LLMs in procedural knowledge understanding, deep adaptability to specific domains, reliability of structured reasoning, and handling of sophisticated precision and safety constraints, providing key directions for future AI in the field of scientific experiment automation. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/BioProBench/BioProBench.
虽然大型语言模型(LLMS)在一般任务方面表现良好,但它们对这些高度专业化、准确性和内在程序性文本的系统评价仍然有限。在这项工作中,我们介绍了生物协议理解和推理的第一个大规模、多任务的基准BioProBench。虽然有一些涉及协议问题解答的基准任务,但BioProBench提供了一套由五项核心任务组成的综合组合:协议问题解答、步骤排序、错误校正、议定书生成和议定书解释,从而得以对程序生物文本的LLMS进行全面评价。在27K原协议上,它产生了近556K高品质的结构化实例。我们评价了12个主流开放/封闭源LMS。实验结果表明,一些模型在基本理解任务上表现良好(例如,\sim70% PQA-Acc.,>64% ERR F1),但与深度推理和结构化的生成任务如订购和发电等。此外,模型比较显示不同业绩:某些公开源模型采用IMS 系统化的系统化模型, 基础性数据分析, IMISLMLS 基础设计结论 。
Article 138
Title@2025-05-29 (4): $T^5Score$: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets
Title: $T^5Score$: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets | $T^5Score$: Eine Methode zur automatischen Bewertung der Qualität von LLM Generated Multi-Document Topic Sets | $T$5STR$:自动评估LLM生成的多文件专题集质量的方法 2407.17390v3 |
Authors: Itamar Trainin, Omri Abend
Using LLMs for Multi-Document Topic Extraction has recently gained popularity due to their apparent high-quality outputs, expressiveness, and ease of use. However, most existing evaluation practices are not designed for LLM-generated topics and result in low inter-annotator agreement scores, hindering the reliable use of LLMs for the task. To address this, we introduce $T^5Score$, an evaluation methodology that decomposes the quality of a topic set into quantifiable aspects, measurable through easy-to-perform annotation tasks. This framing enables a convenient, manual or automatic, evaluation procedure resulting in a strong inter-annotator agreement score. To substantiate our methodology and claims, we perform extensive experimentation on multiple datasets and report the results.
利用LLMs促进多文件专题采掘项目最近因其明显的高质量产出、清晰度和使用方便性而受到欢迎,但是,大多数现有的评价做法不是针对LLM产生的专题设计的,导致通知人之间协议得分偏低,从而妨碍将LLMs可靠地用于这项任务。为了解决这个问题,我们引入了5Score$(T%5Score$)的评价方法,该方法将一个专题的质量分解成可量化的方面,通过易于执行的说明任务可以衡量。这一框架使得一个方便的、手工的或自动的评价程序能够产生一个强有力的部门间协议得分。为了证实我们的方法和主张,我们广泛试验了多个数据集并报告结果。
Article 139
Title@2025-05-29 (4): ExpeTrans: LLMs Are Experiential Transfer Learners
Title: ExpeTrans: LLMs Are Experiential Transfer Learners | ExpeTrans: LLMs sind erfahrene Transfer-Lerner | Expetrary: LLMs 是经验性转移学习者 2505.23191v1 |
Authors: Jinglong Gao, Xiao Ding, Lingxiao Zou, Bibo Cai, Bing Qin, Ting Liu
Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.
最近的研究提供了大型语言模型(LLMs),通过速战速决提供文字任务解决经验,以提高其绩效;然而,以往的方法依靠大量人力或时间收集每项任务的经验,但鉴于用户向LLMs询问的任务类型种类越来越多,这种收集是不切实际的。 为解决这一问题,我们设计了一个自主的经验转让框架,以探讨LLMs是否可以将人类认知智能自动地从现有源任务中将经验从现有源任务转移到新遇到的目标任务中。这不仅允许在不花费大量以往方法的成本的情况下获取经验,而且还为LMs的普遍化提供了一条新途径。 13套数据集的实验结果表明,我们的框架有效地改善了LMs的业绩。此外,我们提供了对框架中每个模块的详细分析。
Article 140
Title@2025-05-29 (4): Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration
Title: Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration | Erfahrungsübergreifendes Lernen auf LLM-basierter Multi-Agent-Kollaboration | 关于基于LLM的多机构合作的跨任务跨任务经验学习 2505.23187v1 |
Authors: Yilong Li, Chen Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent’s individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.
语言模型型大型多试剂系统(MAS)在通过协作推理和机构间评析解决复杂任务方面取得了显著进展,然而,现有办法一般都是孤立地处理每项任务,导致重复计算和对结构相似的任务进行有限的概括化。为了解决这个问题,我们引入了多试剂跨任务体验学习(MAEL),这是一个新颖的框架,使LLM驱动的代理商具有明确的跨任务学习和经验积累能力。我们把任务解决工作流程建在一个图形结构多试剂合作网络上,使代理商通过明确的连通性传播信息和协调。在经验学习阶段,我们量化任务解决工作流程中每个步骤的质量,并将由此产生的奖励与相应的投入和产出一起储存到每个代理商的个人经验库中。在推断过程中,代理商检索高回报、任务相关的经验,作为少见的例子,以提高每个推理步骤的有效性,从而能够更准确和高效地进行多试剂合作。关于各种数据集的实验结果显示MAEL使代理商能够从以往的任务经验中学习如何有效实现更快的趋同和提出更高质量的解决办法。
Article 141
Title@2025-05-29 (4): Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Title: Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement | Unüberwachte Bewertung auf Word-Level-Qualität für maschinelle Übersetzung durch die Linse der Annotatoren (Dis)Vereinbarung | 未经监督的通过标注员的镜头进行机器翻译的字级质量估计 2505.23183v1 |
Authors: Gabriele Sarti, Vilém Zouhar, Malvina Nissim, Arianna Bisazza
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.
字级质量估计(WQE)旨在自动识别机器翻译产出中的细微错误,并发现许多用途,包括协助翻译员编辑后的工作。现代WQE技术往往费用昂贵,包括推动大型语言模型或对大量人类标签数据进行临时培训。在这项工作中,我们调查高效的替代方法,利用语言模型解释性和不确定性量化的最新进展,查明翻译模型内部工作错误。在涉及14个跨12个翻译方向的评价工作中,我们用多种人类标签来量化人类标签差异对衡量业绩的影响。我们的结果突出表明了未开发的未监督的衡量标准的潜力、在面临标签不确定性时监督方法的缺陷以及单名员评价做法的易碎。
Article 142
Title@2025-05-29 (4): Improving Continual Pre-training Through Seamless Data Packing
Title: Improving Continual Pre-training Through Seamless Data Packing | Verbesserung der kontinuierlichen Vorschulung durch nahtloses Datenpaket | 通过无缝无缝数据包装改进持续培训前培训 2505.22018v2 |
Authors: Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang
Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.
培训前的持续培训在提高模型性能方面显示出巨大的潜力,特别是在具体领域的情景中。在持续培训前,最常用的包装数据方法是整合输入文本,并将它们分为固定长度序列。这种方法虽然简单而有效,但往往会导致过度脱节和背景不连续,从而妨碍模型性能。为解决这些问题,我们探索数据工程的潜力,以加强持续培训前的绩效,特别是其对模型性能和效率的影响。我们建议采用无缝包装(SP)这一新的数据包装战略,目的是更有效地保存背景信息,提高模型性能。我们的方法在第一阶段采用滑动窗口技术,使连续序列中重叠的标志同步,确保更好的连续性和背景一致性。在第二阶段,我们采用“FirFi-Fit-Decasing”算法,将短于目标序列长度的卷子包装,从而最大限度地减少挂接和转接。我们提出的各种模型结构和系统域的 “ 经验评估 “ 展示了我们的方法的有效性,在所有环境中99%的基线方法比业绩好。代码可在 https/GISPinmak/INBs.com查阅http。
Article 143
Title@2025-05-29 (4): Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification
Title: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification | Infinite-Instruct: Synthesizing Scaling Code instruction Daten mit bidirektionaler Synthese und statischer Verifikation | 无限指令:以双向合成和静态核查将缩放码指示数据与双向合成和静态核查结合起来 2505.23177v1 |
Authors: Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao
Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, “Reverse Construction” transforms code snippets into diverse programming problems. Then, through “Backfeeding Construction,” keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset
传统代码指令数据合成方法的多样性有限,逻辑也差强人意。 我们引入了“ 无限- Instruct ” (Infinite-Instruct) , 用于合成高质量问答的自动框架, 目的是提高大型语言模型(LLMs)的代码生成能力; 该框架侧重于改进综合问题的内部逻辑, 以及合成代码的质量。 首先, “ 反向建设” 将代码片段转化为不同的编程问题。 然后, 通过“ 背心构建 ” , 将编程问题的关键词编成一个知识图表, 以更强有力的内部逻辑将它们重建成编程问题。 最后, 一个跨语言静态代码分析管道过滤器的无效样本,以确保数据质量。 实验显示, 在主流代码生成基准上,我们经过精细调的模型在7B参数模型中实现了21.70%的平均性能改进, 32B参数模型中实现了36.95%的平均性能改进。 使用不到十分十分之一的指示调整数据, 我们取得了与Quen-2.5- Coder- Instruct。 Instrattt 提供了一种可测量的LLLLLS- fredal- slistreflistdal 版本的数据分析。
Article 144
Title@2025-05-29 (4): Map&Make: Schema Guided Text to Table Generation
Title: Map&Make: Schema Guided Text to Table Generation | Map&Make: Schema-Leittext zur Tabellenerstellung | Mag&Make: 生成表格的图表向导文本 2505.23174v1 |
Authors: Naman Ahuja, Fenil Bardoliya, Chitta Baral, Vivek Gupta
Transforming dense, detailed, unstructured text into an interpretable and summarised table, also colloquially known as Text-to-Table generation, is an essential task for information retrieval. Current methods, however, miss out on how and what complex information to extract; they also lack the ability to infer data from the text. In this paper, we introduce a versatile approach, Map&Make, which “dissects” text into propositional atomic statements. This facilitates granular decomposition to extract the latent schema. The schema is then used to populate the tables that capture the qualitative nuances and the quantitative facts in the original text. Our approach is tested against two challenging datasets, Rotowire, renowned for its complex and multi-table schema, and Livesum, which demands numerical aggregation. By carefully identifying and correcting hallucination errors in Rotowire, we aim to achieve a cleaner and more reliable benchmark. We evaluate our method rigorously on a comprehensive suite of comparative and referenceless metrics. Our findings demonstrate significant improvement results across both datasets with better interpretability in Text-to-Table generation. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to superior performance and validate the practicality of our framework in structured summarization tasks.
将密集、详细、非结构化的文本转换成可解释和概括的表格,也称为文本到表格的生成,这是信息检索的一项基本任务。 但是,目前的方法遗漏了要提取的复杂信息的方式和内容;它们也缺乏从文本中推断数据的能力。 在本文中,我们引入了多种方法,即地图和Make,它“分解”文本,它“将”文字“分解”文字变成假设原子语句。这有利于颗粒分解以提取潜在的系统图案。然后,这个方法被用来填充反映原始文本中质量细微和定量事实的表格。我们的方法要根据两个具有挑战性的数据集进行测试,即Rotowire,以其复杂和多表的图案形式闻名;Livesum,它需要数字汇总。我们通过仔细地识别和纠正罗托维尔语语语句中的幻觉错误,我们的目标是达到一个更清洁和更可靠的基准。我们严格地评价了我们的方法,以一套全面的比较和不参考性指标为标准。我们的调查结果表明,两个数据集中都取得了显著的改进的结果,在文本到表生成中的更好解释性方面,我们通过详细的精确性分析来验证。此外,我们通过一个完整的业绩分析和精确性分析,我们的工作,我们为精确地验证。
Article 145
Title@2025-05-29 (4): ZIPA: A family of efficient models for multilingual phone recognition
Title: ZIPA: A family of efficient models for multilingual phone recognition | ZIPA: Eine Familie von effizienten Modellen für mehrsprachige Telefonerkennung | ZIPA:一套有效的多语言电话识别模式 2505.23170v1 |
Authors: Jian Zhu, Farhan Samir, Eleanor Chodroff, David R. Mortensen
We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.
我们首先将IPAPack++(IPAPack++)这一大型多语种语音资料库(17 132小时的正常电话抄录)和一套新颖的评价集(记录了不为人知的语言和社会话变异)推出。利用大型培训数据,ZIPA(包括Transer(ZIPA-T)和基于CTC(ZIPA-CR)的变体)利用高效的Ziped 骨干和优于现有电话识别系统,少得多的参数。通过在111,000小时的假标签的多语种数据上进行吵闹学生培训,进一步扩展规模,取得了进一步的改进。虽然ZIPA在基准方面成绩强劲,但错误分析显示在社会话多样性建模方面长期存在局限性,并强调了未来研究的挑战。
Article 146
Title@2025-05-29 (4): Tell, Don’t Show: Leveraging Language Models’ Abstractive Retellings to Model Literary Themes
Title: Tell, Don’t Show: Leveraging Language Models’ Abstractive Retellings to Model Literary Themes | Tell, Don’t Show: Die abstrakten Retellings von Sprachmodellen nutzen, um literarische Themen zu modellieren | Tell, don’t show: 利用语言模型对示范文学主题的抽象引用 2505.23166v1 |
Authors: Li Lucy, Camilla Griffiths, Sarah Levine, Jennifer L. Eberhardt, Dorottya Demszky, David Bamman
Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to “show, don’t tell.” We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives’ surface forms into higher-level concepts and themes. By running LDA on LMs’ retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method’s outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.
文学挑战词汇方法,因为叙事语言侧重于隐性感官细节,而不是抽象描述或解说:我们建议作者们“展示,不要告诉”。我们提议了Retell,一种简单、方便的文献主题模型方法。在这里,我们用资源效率高、具有基因特征的语言模型(LMS)来说明所显示的段落,从而将描述的表面形式转化为更高层次的概念和主题。通过在LMS的复写词中运行LDA,我们可以得到比单独运行LDA或直接要求LMS列出主题更准确和更加丰富的专题。为了调查我们的文化分析方法的潜力,我们将我们的方法产出与高中英语艺术书籍中种族/文化特征案例研究的专家指导说明进行比较。
Article 147
Title@2025-05-29 (4): Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach
Title: Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach | Temporale Beziehungsextraktion in klinischen Texten: Ein Span-basierter Graph Transformer-Ansatz | 临床文本中的时间关系抽取时间关系:基于泛泛面的图形变形器方法 2503.18085v2 |
Authors: Rochana Chaturvedi, Peyman Baghershahi, Sourav Medya, Barbara Di Eugenio
Temporal information extraction from unstructured text is essential for contextualizing events and deriving actionable insights, particularly in the medical domain. We address the task of extracting clinical events and their temporal relations using the well-studied I2B2 2012 Temporal Relations Challenge corpus. This task is inherently challenging due to complex clinical language, long documents, and sparse annotations. We introduce GRAPHTREX, a novel method integrating span-based entity-relation extraction, clinical large pre-trained language models (LPLMs), and Heterogeneous Graph Transformers (HGT) to capture local and global dependencies. Our HGT component facilitates information propagation across the document through innovative global landmarks that bridge distant entities. Our method improves the state-of-the-art with 5.5% improvement in the tempeval $F_1$ score over the previous best and up to 8.9% improvement on long-range relations, which presents a formidable challenge. We further demonstrate generalizability by establishing a strong baseline on the E3C corpus. This work not only advances temporal information extraction but also lays the groundwork for improved diagnostic and prognostic models through enhanced temporal reasoning.
从非结构化文本中抽取时空信息对于使事件背景化和产生可操作的洞察力至关重要,特别是在医疗领域。我们利用研究周密的2012年I2B2《时际关系挑战》来应对提取临床事件及其时间关系的任务。由于复杂的临床语言、长的文件和稀疏的注释,这项任务具有内在挑战性。我们引入了GRAPHTREX,这是将基于跨实体关系的提取、临床预先培训的大型语言模型(LPLMS)和异质图形变异器(HGT)整合在一起的新方法,以捕捉本地和全球依赖性。我们HGT部分不仅通过连接遥远实体的创新的全球里程碑促进在文件中的信息传播,而且通过强化的时空推理推理为改进诊断和预测模型打下基础。
Article 148
Title@2025-05-29 (4): Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs
Title: Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs | Zu konsequent, um zu erkennen: Eine Studie über selbstkonsistente Fehler in LLMs | 过于一致,无法检测:LLMM中自相矛盾错误的研究 2505.17656v2 |
Authors: Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, Xueqi Cheng
As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methshods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improved methods. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.
由于大型语言模型(LLMs)往往产生合理但不正确的内容,发现错误对于确保真实性越来越重要,但是,现有的检测方法往往忽略一个我们称之为自相矛盾错误的关键问题,即LLMs反复在多个随机抽样中产生同样的不正确反应。这项工作正式定义了自相矛盾的错误,并评价了这些错误的主要检测方法。我们的调查揭示了两个主要结论:(1) 与前后不一致的错误不同,这些错误的频率随着LLM规模的扩大而大大降低,自我一致错误的频率仍然稳定,甚至增加。 (2) 所有四种检测方法都为发现自相矛盾的错误而挣扎。这些发现揭示了当前检测方法的严重局限性,并强调了改进方法的必要性。由于观察到LLMs之间自相矛盾的错误往往不同,我们提出了一个简单而有效的跨模型调查方法,将外部验证器LM的隐藏的国家证据结合在一起。我们的方法极大地提高了三个LM家族自我一致错误的性。
Article 149
Title@2025-05-29 (4): Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models
Title: Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models | Cross-Domain Zweisprachige Lexikoninduktion über vorgebildete Sprachmodelle | 通过预先培训语言模式的跨域双语双语双语 2505.23146v1 |
Authors: Qiuyu Ding, Zhiqiang Cao, Hailong Cao, Tiejun Zhao
Bilingual Lexicon Induction (BLI) is generally based on common domain data to obtain monolingual word embedding, and by aligning the monolingual word embeddings to obtain the cross-lingual embeddings which are used to get the word translation pairs. In this paper, we propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries. Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI. This way, we introduce the Code Switch(Qin et al., 2020) firstly in the cross-domain BLI task, which can match differit is yet to be seen whether these methods are suitable for bilingual lexicon extraction in professional fields. As we can see in table 1, the classic and efficient BLI approach, Muse and Vecmap, perform much worse on the Medical dataset than on the Wiki dataset. On one hand, the specialized domain data set is relatively smaller compared to the generic domain data set generally, and specialized words have a lower frequency, which will directly affect the translation quality of bilingual dictionaries. On the other hand, static word embeddings are widely used for BLI, however, in some specific fields, the meaning of words is greatly influenced by context, in this case, using only static word embeddings may lead to greater bias. ent strategies in different contexts, making the model more suitable for this task. Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.
双语语言感化( BLI) 通常以通用域数据为基础, 以获得单语词嵌入, 并调整单语词嵌入, 以获得用于获取双词翻译的跨语言嵌入。 在本文中, 我们提议使用 BLI 的新任务, 即使用通用域和目标域的单语版, 以提取特定域的双语字典。 在预科培训模式的能力的驱动下, 我们提议一种方法, 以BLI 最新工作为基础, 获得更好的文字嵌入。 这样, 我们首先在跨多语版 BLI 任务中引入代码切换( Qin et al., 2020) , 以获得跨语化嵌入的跨语言嵌入嵌入。 从表1, 经典有效的 BLI 和 Vecmap 方法, 在医学数据集上的表现比Wikiki 数据集要差得多得多。 一方面, 专门域数据集比通用域模型的缩入( Qinal liveral lide) 更精确, 在通用域域域域域中, 使用特定语言的LI 则显示特定语言, 的LILI 。
Article 150
Title@2025-05-29 (4): ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation
Title: ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation | Parammute: Unterdrückende wissenskritische FFNs für treue retrieval-erweiterte Generation | 分量:制止知识-关键FFFF,以用于忠实检索-养殖一代 2502.15543v2 |
Authors: Pengcheng Huang, Zhenghao Liu, Yukun Yan, Haiyan Zhao, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong
Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG. All code will be released via GitHub.
大型语言模型(LLMS)与检索增强的一代(RAG)相结合,通过将产出以外部证据为基础,提高了事实质量;然而,这些模型仍然容易被不忠实的一代所利用,尽管其相关性和准确性与检索的背景相矛盾; 旨在提高忠诚的现有方法主要侧重于加强外部环境的利用,但往往忽视内部参数知识在生成过程中的持续影响; 在这项工作中,我们调查不忠的一代背后的内部机制,并查明在此类情况下过度激活的中到深的反馈前沿网络(FFNs)的一组。 我们基于这一洞察力,建议通过FFFFFFM 禁止(ParamMute)进行分解(ParamMute)进行分解,这是一个框架,通过抑制不忠的与FFFFNFF的启动,将模型调整为检索知识的模型。 为了评估我们的方法,我们引入Cofaithful QA,这是专门用来评价内部知识与准确外部证据相冲突的情景中的忠实性的一个基准。 实验结果显示,Pammut 大大加强了CfafifQA 和CFASALIDILA的所有基础基础基础基础基础基础,实现这些基准的大幅降低。
Article 151
Title@2025-05-29 (4): Enhancing Large Language Models’Machine Translation via Dynamic Focus Anchoring
Title: Enhancing Large Language Models’Machine Translation via Dynamic Focus Anchoring | Verbesserung der Übersetzung großer Sprachmodelle durch Dynamic Focus Anchoring | 通过动态焦点拼接加强大语言模型的“Machine ”翻译 2505.23140v1 |
Authors: Qiuyu Ding, Zhiqiang Cao, Hailong Cao, Tiejun Zhao
Large language models have demonstrated exceptional performance across multiple crosslingual NLP tasks, including machine translation (MT). However, persistent challenges remain in addressing context-sensitive units (CSUs), such as polysemous words. These CSUs not only affect the local translation accuracy of LLMs, but also affect LLMs’ understanding capability for sentences and tasks, and even lead to translation failure. To address this problem, we propose a simple but effective method to enhance LLMs’ MT capabilities by acquiring CSUs and applying semantic focus. Specifically, we dynamically analyze and identify translation challenges, then incorporate them into LLMs in a structured manner to mitigate mistranslations or misunderstandings of CSUs caused by information flattening. Efficiently activate LLMs to identify and apply relevant knowledge from its vast data pool in this way, ensuring more accurate translations for translating difficult terms. On a benchmark dataset of MT, our proposed method achieved competitive performance compared to multiple existing open-sourced MT baseline models. It demonstrates effectiveness and robustness across multiple language pairs, including both similar language pairs and distant language pairs. Notably, the proposed method requires no additional model training and enhances LLMs’ performance across multiple NLP tasks with minimal resource consumption.
大型语言模型显示了多种跨语言国家语言语言翻译(MT)任务的超常表现,包括机器翻译(MT),然而,在应对多语种语言敏感单位(CSUs)(CSUs)(多式单词)方面仍然存在持续的挑战。这些CSU不仅影响LLM的当地翻译准确性,而且影响LLMs对判决和任务的理解能力,甚至导致翻译失败。为了解决这一问题,我们提出了一个简单而有效的方法,通过获得CSUs和运用语义重点,提高LLMS的MS MT能力。具体地说,我们动态分析和确定翻译挑战,然后以结构化的方式将其纳入LLMS(LS),以缓解信息平缓造成的CSUs的错误或误解。这些CSUS不仅影响LMS的当地翻译准确性,而且还影响LMS对当地翻译LMs的理解能力,甚至导致翻译失败。为了解决这一问题,我们提出了一种简单有效的方法,通过多种公开源MTMT的基线模型,实现竞争性的绩效。它显示了多种语言配对,包括类似的语言配方和远程语言配方语言和远程语言配方。 值得注意的是,拟议的方法不需要额外的多种模型,需要额外的使用。
Article 152
Title@2025-05-29 (4): CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
Title: CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction | CLEME2.0: Auf dem Weg zur Interpretierbaren Bewertung durch Entwirren von Edits für die Korrektur von Grammatikfehlern | CLEME2.0:通过拆分文体错误校正的编辑版实现可解释性评价 2407.00934v2 |
Authors: Jingheng Ye, Zishan Xu, Yinghui Li, Linlin Song, Qingyu Zhou, Hai-Tao Zheng, Ying Shen, Wenhao Jiang, Hong-Gee Kim, Ruitong Liu, Xin Su, Zifei Shan
The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce CLEME2.0, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.
本文重点介绍了在以往研究中很少受到注意的表面错误校正(GEC)评价指标的可解释性,为弥补这一差距,我们引入了CLEME2.0,这是一个基于参考的衡量标准,描述了全球教育教管系统的四个基本方面:打击更正、错误纠正、纠正不足和过度纠正,它们共同有助于暴露全球教育教管系统的关键品质和定位缺陷。通过综合这些方面来评估系统还导致人类的一致性优于其他基于参考和不参考的衡量标准。关于两个人类判断数据集和六个参考数据集的广泛实验显示了我们的方法的有效性和可靠性,实现了新的最新结果。我们的代码在https://github.com/THUKElab/CLEME上发布。
Article 153
Title@2025-05-29 (4): Learning to Reason under Off-Policy Guidance
Title: Learning to Reason under Off-Policy Guidance | Unter außerpolitischer Anleitung zur Vernunft lernen | 根据非政策指导学习理由 2504.14945v4 |
Authors: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy’’, limiting learning to a model’s own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.
大型推理模型(LRMs)的近期进步表明,多步推理和自我反省等复杂行为可以通过以可核查的回报来强化学习~(\ textit{RLVR}) 。 但是,现有的\ textit{RLVR} 方法本质上是“ 政策性” , 将学习限制在模型自己的产出上, 并且没有获得超出其初始能力的推理能力。 为了解决这个问题, 我们引入了\ textbf{LUFFY} (\ textb{L}学习到理性的多步推理( textbf{U} ) 和自我反动( textb} ) 自我反动学习 。 但是, 现有的\ textitleitle{RLLLLVRRRRR} 方法本身, 通过常规性取样避免表面和僵硬性地模仿( RFF) 基础性分析结果, 与以往的RFF 平均的RFF 方法相比, 成功地展示了以往的RFF 水平优势。
Article 154
Title@2025-05-29 (4): EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models
Title: EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models | EarthSE: Ein Benchmark für die Bewertung der wissenschaftlichen Explorationsfähigkeit der Erde für große Sprachmodelle | EarthSE:大语言模型地球科学探索能力基准评估 2505.17139v2 |
Authors: Wanghan Xu, Xiangyu Zhao, Yuhao Zhou, Xiaoyu Yue, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai
Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs’ capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .
现有基准要么显示没有地球科学特性的一般性科学重点,要么涵盖孤立的子领域,缺乏整体评价;此外,目前的基准通常忽视了对LLMs在开放科学探索方面的能力的评估;在本文件中,我们提出了地球科学的全面和专业基准,旨在评价LLMs在这一领域内科学探索的能力,从基本到高级不等;利用10万份研究论文,我们首先建立两个问题回答数据集:地球-Iron,该数据集为广泛评估提供了广泛的问题覆盖面;地球-Silver,其特点是评估专业深度的困难程度更高;这些数据集涵盖五个地球领域、114个学科和11个任务类别,评估对科学探索至关重要的基础知识;最值得注意的是,我们介绍了地球-Gold,采用新的计量,由专门为评估LMS在科学探索中的先进能力而设计的开放式多方向对话,包括介绍、限制分析,并展示了在11个领域中的现有科学基准能力。
Article 155
Title@2025-05-29 (4): Jailbreaking to Jailbreak
Title: Jailbreaking to Jailbreak | Gefängnisbruch zum Gefängnisbruch | 破门而入,破门而入, 2502.09638v2 |
Authors: Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang
Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting $J_2$ (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create $J_2$ attackers transfer across almost all black-box models; 2) an $J_2$ attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong models, such as Sonnet-3.7, are strong $J_2$ attackers compared to others. For example, when used against the safeguard of GPT-4o, $J_2$ (Sonnet-3.7) achieves 0.975 attack success rate (ASR), which matches expert human red teamers and surpasses the state-of-the-art algorithm-based attacks. Among $J_2$ attackers, $J_2$ (o3) achieves highest ASR (0.605) against Sonnet-3.5, one of the most robust models.
大型语言模型(LLMS)可用于红队其他模型(如破狱)以获取有害内容。虽然先前的工作通常使用开放重量模型或私人未检查的模式来进行破狱,因为强大的LMS(如OpenAI o3)拒绝训练大型LMS(如OpenAI o3),拒绝帮助破狱,我们的工作(几乎)将任何黑盒LMS变成攻击者。由此产生的J_2美元(破门破门破门)袭击者可以使用由他们自己或专家人类红队制定的各种战略来有效打破目标模型的保障。我们这样做时,我们展示了他们强大但研究不足的破狱能力。我们的实验表明,1)用来在几乎所有黑盒模型中制造J_2美元袭击者转移的催化费用;2)一个$2美元的攻击者可以(几乎)自己破解一份黑盒,而这种脆弱性在过去12个月中迅速发展;3)Sonnet2-3型袭击者等理由模型(如Sonnet2-3型袭击者)比他人强2美元攻击者。例如用于GPT-47-S-3型攻击的1次攻击的S-SermaxxxxSxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 156
Title@2025-05-29 (4): REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space
Title: REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space | REVS: Unlearning Sensible Information in Language Models via Rank Editing im Vokabelfeld | REVS:通过词汇空间排行编辑在语言模型中学习敏感信息 2406.09325v5 |
Authors: Tomer Ashuach, Martin Tutek, Yonatan Belinkov
Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens that form sensitive information. To adequately evaluate our method on truly sensitive information, we curate three datasets: email and URL datasets naturally memorized by the models, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.
语言模型(LMS)有可能无意间将培训数据中看到的敏感信息或个人识别信息(PII)进行记忆和传播,从而引起隐私问题。目前解决这一问题的方法包括费用高昂的数据集清洗,或通过不学习和模式编辑进行模型过滤,可以通过抽取攻击绕过这些方法。我们提出REVS,这是从LMS中分离敏感信息的一种新型的非渐进方法。REVS确定并修改与构成敏感信息的组成标志有关的一小部分神经元。为了适当评估我们关于真正敏感信息的方法,我们整理了三个数据集:电子邮件和URL数据集,由模型自然记忆,以及一个综合的社会保障编号数据集,我们调整模型以进行记忆。与其他方法相比,REVS显示在不学习敏感信息方面的优异性以及精准性来提取攻击,同时保留基本模型的完整性。
Article 157
Title@2025-05-29 (4): GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
Title: GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning | GETReason: Bildkontext-Extraktion durch Hierarchische Multi-Agenten-Reasoning verbessern | GetReason:通过等级式多机构代理理由加强图像背景采掘 2505.21863v2 |
Authors: Shikhhar Siingh, Abhinav Rawat, Chitta Baral, Vivek Gupta
Publicly significant images from events hold valuable contextual information, crucial for journalism and education. However, existing methods often struggle to extract this relevance accurately. To address this, we introduce GETReason (Geospatial Event Temporal Reasoning), a framework that moves beyond surface-level image descriptions to infer deeper contextual meaning. We propose that extracting global event, temporal, and geospatial information enhances understanding of an image’s significance. Additionally, we introduce GREAT (Geospatial Reasoning and Event Accuracy with Temporal Alignment), a new metric for evaluating reasoning-based image understanding. Our layered multi-agent approach, assessed using a reasoning-weighted metric, demonstrates that meaningful insights can be inferred, effectively linking images to their broader event context.
活动产生的公众重要图像包含宝贵的背景信息,对新闻和教育至关重要。然而,现有的方法往往难以准确地得出这一相关性。为了解决这一问题,我们引入了GetReason(Geops空间事件时间原因),这是一个超越地表图像描述的框架,可以推断更深的背景含义。我们建议提取全球事件、时间和地理空间信息可以增进对图像意义的理解。此外,我们引入了大(Geos空间原因和时间调整中事件准确性),这是用于评估基于推理的图像理解的新衡量标准。我们使用推理加权衡量尺度评估的多层次多媒介方法表明,有意义的洞察力可以被推断出来,将图像与其更广泛的事件背景有效地联系起来。
Article 158
Title@2025-05-29 (4): LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data
Title: LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data | LongFaith: Verbesserung der Langkontext-Reasonierung in LLMs mit treuen synthetischen Daten | 长面:利用忠实合成数据加强LLMs中的长方理由 2502.12583v2 |
Authors: Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo
Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
尽管长长的大型语言模型(LLMS)不断发展,但依赖合成数据的以数据为中心的方法却受到与忠诚有关的问题的阻碍,这些问题限制了它们提高长文本推理和答题(QA)等任务示范性业绩的效力,这些挑战往往因缺乏核查、无归属推理和潜在的知识冲突造成的错误信息而加剧。我们提议长法思(LongFaith),这是综合忠实长文本推理教学数据集的新管道。我们综合了地面真相和基于引用推理的提示,消除了分散注意力并提高了推理链的准确性,从而减少了费用高昂的核查程序。我们开放源的两个综合数据集(LongFaith-SFT和LongFaith-PO)系统处理忠诚的多个层面,包括核实推理、归属和背景基础。关于多角度推理数据集和LongBench的广泛实验表明,这些数据集的模型经过精细调整,大大提高了绩效。我们进行的分析研究突出长Faith管道的可伸缩性和适应性,显示其在开发长晶体中的广泛适用性。
Article 159
Title@2025-05-29 (4): Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
Title: Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context | Human-Readable Adversarial Prompts: Eine Untersuchung von LLM-Fehlern mit situationsbezogenem Kontext | 人类可以读取的反向提示:利用情况背景调查LLM脆弱性 2412.16359v3 |
Authors: Nilanjana Das, Edward Raff, Aman Chadha, Manas Gaur
As the AI systems become deeply embedded in social media platforms, we’ve uncovered a concerning security vulnerability that goes beyond traditional adversarial attacks. It becomes important to assess the risks of LLMs before the general public use them on social media platforms to avoid any adverse impacts. Unlike obvious nonsensical text strings that safety systems can easily catch, our work reveals that human-readable situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect. We found that skilled attackers can exploit the vulnerabilities in open-source and proprietary LLMs to make a malicious user query safe for LLMs, resulting in generating a harmful response. This raises an important question about the vulnerabilities of LLMs. To measure the robustness against human-readable attacks, which now present a potent threat, our research makes three major contributions. First, we developed attacks that use movie scripts as situational contextual frameworks, creating natural-looking full-prompts that trick LLMs into generating harmful content. Second, we developed a method to transform gibberish adversarial text into readable, innocuous content that still exploits vulnerabilities when used within the full-prompts. Finally, we enhanced the AdvPrompter framework with p-nucleus sampling to generate diverse human-readable adversarial texts that significantly improve attack effectiveness against models like GPT-3.5-Turbo-0125 and Gemma-7b. Our findings show that these systems can be manipulated to operate beyond their intended ethical boundaries when presented with seemingly normal prompts that contain hidden adversarial elements. By identifying these vulnerabilities, we aim to drive the development of more robust safety mechanisms that can withstand sophisticated attacks in real-world applications.
随着AI系统深深嵌入社交媒体平台,我们发现了一个超越传统的对抗性攻击的关于安全脆弱性的发现。在公众在社交媒体平台上使用LLMS来避免任何不利影响之前评估LLMS的风险变得非常重要。与安全系统可以轻易捕捉的明显非敏感文本字符串不同,我们的工作揭示出,人类可以理解的情况驱动的对抗性全面刺激,利用情境背景,是有效的,但更难察觉到。我们发现,熟练的进攻者可以利用开放源码和专有LMS的弱点,使恶意用户查询LMS的安全性,从而产生有害的反应。这提出了公众对LLMS的脆弱性的一个重要问题。为了衡量对目前构成强大威胁的、可理解性攻击的力度,我们的研究作出了三大贡献。首先,我们开发了以电影脚本作为情景背景框架的进攻性攻击,创造了天然的全光谱,使LMS产生有害的内容。 其次,我们开发了一种方法,将易读的对LMS的对抗性文字转换成可读性文字, 也就是我们仍在利用SLMS-Proprevlexalal-al-de lafting laft laft laft laft laft laft laim laim lab laft laft laft laft laft laft laft lab labre laft laft laft laft laft labil laus laus lautus laus labil lautus lautus lautus lautus lautus lauts lautus labil laus lautus lauts lauts lauts lauts labil labil labil lauts
Article 160
Title@2025-05-29 (4): PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics
Title: PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics | PBEBench: Ein mehrstufiges Programmieren nach Beispielen, inspiriert von historischer Linguistik | PBEBench:根据历史语言推导的多层次方案拟定工作 2505.23126v1 |
Authors: Atharva Naik, Darsh Agrawal, Manav Kapadnis, Yuwei An, Yash Mathur, Carolyn Rose, David Mortensen
Recently, long chain of thought (LCoT), Large Language Models (LLMs), have taken the machine learning world by storm with their breathtaking reasoning capabilities. However, are the abstract reasoning abilities of these models general enough for problems of practical importance? Unlike past work, which has focused mainly on math, coding, and data wrangling, we focus on a historical linguistics-inspired inductive reasoning problem, formulated as Programming by Examples. We develop a fully automated pipeline for dynamically generating a benchmark for this task with controllable difficulty in order to tackle scalability and contamination issues to which many reasoning benchmarks are subject. Using our pipeline, we generate a test set with nearly 1k instances that is challenging for all state-of-the-art reasoning LLMs, with the best model (Claude-3.7-Sonnet) achieving a mere 54% pass rate, demonstrating that LCoT LLMs still struggle with a class or reasoning that is ubiquitous in historical linguistics as well as many other domains.
最近,长长的思维链(LCOT),大语言模型(LLMS)以惊人的推理能力将机器学习世界带入风暴。然而,这些模型的抽象推理能力是否足以满足具有实际重要性的问题?与过去的工作不同,过去的工作主要侧重于数学、编码和数据交织,我们侧重于历史语言启发的感应推理问题,以实例为例拟订。我们开发了完全自动化的管道,以动态地为这项任务制定基准,并具有可控的困难,以便解决许多推理基准都受到的可缩和污染问题。我们利用管道制作了近1千例测试,对所有最先进的推理LMS都具有挑战性,而最佳模型(Claude-3.7-Sonnet)只达到54%的通过率,表明LCOTLMS仍然与在历史语言和其他许多领域普遍存在的等级或推理学。
Article 161
Title@2025-05-29 (4): CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Title: CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark | CASS: Nvidia zu AMD Transpilation mit Daten, Modellen und Benchmark | CASS: Nvidia 到AMD 传输数据、模型和基准 2505.16968v3 |
Authors: Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud
We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <–> HIP) and assembly-level (Nvidia SASS <–> AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.
我们引入了CASS, 这是首个用于跨建筑化 GPU 代码转换的大型数据集和模型套件, 针对源级( CUDA < - > HIP) 和组装级( Nvidia SASSS < - > AMD RDNA3) 翻译。 该数据集由70k经核实的对数组成, 跨越主机和装置, 解决低级别 GPU 代码可移植性的重大差距。 利用此资源, 我们培训 CASS 群域域语言模型, 实现95% 源翻译准确性和37.5% 组装翻译准确性, 大大超过 GPT-4o、 Claude 和 Hipifify等商业基线。 我们生成的代码匹配了85%以上测试案例的本地性能, 保存运行时间和记忆行为。 为了支持严格的评估, 我们引入了 CASS- Bench, 一个覆盖16 GPU 域域的曲线基准, 并带有地标执行。 所有的数据、 模型和评价工具都作为公开来源发布, 以促进 GPUPU 工具的编译、 和 LLM 制硬件翻译的进展 。
Article 162
Title@2025-05-29 (4): Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging
Title: Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging | Verbesserung des Brain-to-Image-Reconstructions durch feinkörnige Text-Bridging | 通过完善的文本连接改进脑到图像重建 2505.22150v2 |
Authors: Runze Xia, Shuo Feng, Renzhi Wang, Congchi Yin, Xuyun Wen, Piji Li
Brain-to-Image reconstruction aims to recover visual stimuli perceived by humans from brain activity. However, the reconstructed visual stimuli often missing details and semantic inconsistencies, which may be attributed to insufficient semantic information. To address this issue, we propose an approach named Fine-grained Brain-to-Image reconstruction (FgB2I), which employs fine-grained text as bridge to improve image reconstruction. FgB2I comprises three key stages: detail enhancement, decoding fine-grained text descriptions, and text-bridged brain-to-image reconstruction. In the detail-enhancement stage, we leverage large vision-language models to generate fine-grained captions for visual stimuli and experimentally validate its importance. We propose three reward metrics (object accuracy, text-image semantic similarity, and image-image semantic similarity) to guide the language model in decoding fine-grained text descriptions from fMRI signals. The fine-grained text descriptions can be integrated into existing reconstruction methods to achieve fine-grained Brain-to-Image reconstruction.
脑到图像重建旨在恢复人类从大脑活动中感觉到的视觉刺激。然而,重建的视觉刺激往往缺少细节和语义不一致,这可能是由于语义信息不足造成的。为了解决这一问题,我们提议了一种名为“精度脑到图像重建”(FgB2I)的方法,它使用细度文字作为桥梁来改善图像重建。 FgB2I 由三个关键阶段组成: 细节增强、解码细度文字描述和文本缩略脑到图像重建。在细节强化阶段,我们利用大型的视觉语言模型生成精度的字幕,用于视觉刺激和实验性地验证其重要性。我们提出三种奖励性指标(斜度精度、文字图像图像图像图像相似性),用以指导语言模型从 FMRI 信号中解析精度文字描述。精度文本描述可以纳入到现有的重建方法中,以便实现精度的大脑重建。
Article 163
Title@2025-05-29 (4): ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
Title: ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations | ContextQFormer: Eine neue Context-Modellierungsmethode für Multi-Turn Multi-Modal-Gespräche | 上下文前:多发多式多模式对话的新背景建模方法 2505.23121v1 |
Authors: Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.
多式大型语言模型显示了显著的零射能力和强大的图像理解能力。然而,现有的开放源码多模式模型由于多方向互动能力薄弱而受到影响,特别是在长期背景下。为了解决这个问题,我们首先引入了一个背景模型模块,称为CeanQFormer,该模块利用记忆块加强背景信息的列报。此外,为了便于进一步研究,我们谨慎地为培训前、教学调整和评价建立一个新的多方向多模式对话数据集(TMDialog),该数据集将最近公开提供。与其他多模式对话数据集相比,TMdialog包含较长的谈话,支持多方向多模式对话的研究。此外,CentricFormer与TMdilog的三个基线和实验结果进行比较,说明CentricalQFormer比基线提高了2%-4%。
Article 164
Title@2025-05-29 (4): Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
Title: Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios | Elicit und Enhance: Multimodale Reasoning in medizinischen Szenarien fördern | 明确和强化:推进医疗假想中的多式联运理由 2505.23118v1 |
Authors: Linjie Mu, Zhongzhen Huang, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, Xiaofan Zhang
Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model’s reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.
有效的临床决策取决于多种证据来源的迭代、多式推理,最近出现的多式推理模型大大改变了解决复杂任务的格局。虽然这些模型在数学和科学方面取得了显著成功,但其在医疗领域的应用仍然未得到充分探讨。在这项工作中,我们提出\ textit{MedE$2$},这是一个分为两阶段的训练后管道,可以引出并随后加强医疗领域的多式推理。在第一阶段,我们用2 000个纯文本数据样本对模型进行微调,其中含有精确精心策划的推理演示,以引出推理行为。在第二阶段,我们进一步加强模型推理能力,使用1 500个严格整理的多式医疗案例,使模型推理产出与我们提议的多式医学推理偏好相一致。广泛的实验表明,在改进医学多式联运模型的推理性能方面,\textit{MedE$2$}是有效和可靠的。值得注意的是,在多个医学多式联运基准中,经过培训的模型始终超越了基准。在较大模型上和在时间范围内进一步论证,进一步证实了我们的方法的稳健性和实用性。
Article 165
Title@2025-05-29 (4): Learning to Reason from Feedback at Test-Time
Title: Learning to Reason from Feedback at Test-Time | Von Feedback bei Test-Time zur Vernunft lernen | 从测试时的反馈中学习到理由 2502.15771v2 |
Authors: Yanyang Li, Michael Lyu, Liwei Wang
Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.
在一次尝试中解决复杂任务对大型语言模型(LLMs)来说具有挑战性,要取得成功,往往需要与环境的迭代互动和反馈,使有效的反馈利用成为一个关键议题。现有办法要么与时间的概括斗争,要么依靠天真重整而不利用先前的信息。在本文中,我们引入了FTTT,这是一个创新的范例,将反馈利用作为测试时的一个优化问题。此外,我们提议了一个可学习的测试-时间优化器OpTune,以有效利用反馈。在四个推理数据集中对两个LMs的实验表明,FTTT和OpTune实现了更高的可扩展性和性。
Article 166
Title@2025-05-29 (4): Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
Title: Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data | Datensatzkartographie für großsprachliche Modellausrichtung: Mapping und Diagnose von Präferenzdaten | 用于大语言模型对齐的数据集制图:绘图和诊断优先数据 2505.23114v1 |
Authors: Seohyeong Lee, Eunwon Kim, Hwaran Lee, Buru Chang
Human preference data plays a critical role in aligning large language models (LLMs) with human values. However, collecting such data is often expensive and inefficient, posing a significant scalability challenge. To address this, we introduce Alignment Data Map, a GPT-4o-assisted tool for analyzing and diagnosing preference data. Using GPT-4o as a proxy for LLM alignment, we compute alignment scores for LLM-generated responses to instructions from existing preference datasets. These scores are then used to construct an Alignment Data Map based on their mean and variance. Our experiments show that using only 33 percent of the data, specifically samples in the high-mean, low-variance region, achieves performance comparable to or better than using the entire dataset. This finding suggests that the Alignment Data Map can significantly improve data collection efficiency by identifying high-quality samples for LLM alignment without requiring explicit annotations. Moreover, the Alignment Data Map can diagnose existing preference datasets. Our analysis shows that it effectively detects low-impact or potentially misannotated samples. Source code is available online.
人类偏好数据在使大型语言模型(LLMs)与人类价值相匹配方面发挥着关键作用。然而,收集这些数据往往费用昂贵,效率低,对可缩放性构成重大挑战。为了解决这个问题,我们采用了GPT-4辅助工具“对齐数据地图”,即用于分析和诊断偏爱数据的GPT-4o工具。使用GPT-4o作为LLM对齐的代理,我们计算LLM产生的对现有偏爱数据集指示的响应对齐分。然后,这些分数用于根据数据平均值和差异构建“对齐数据地图”。我们的实验显示,只有33%的数据,特别是中度、低差异区域的样本,其性能与整个数据集相近或更好。这一发现表明,“对齐数据地图”可以显著提高数据收集效率,为LM对齐确定高质量的样本,而无需明确的说明。此外,“对齐数据地图”可以诊断现有的偏好数据集。我们的分析显示,它能够有效检测低影响或潜在误评的样本。源码可以在线查阅。
Article 167
Title@2025-05-29 (4): C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
Title: C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation | C$^2$LEVA: Auf dem Weg zu einer umfassenden und kontaminationsfreien Sprachmodellbewertung | C$$2$LEVA:努力实现全面和无污染、无污染的无语言模式评价 2412.04947v3 |
Authors: Yanyang Li, Tin Long Wong, Cheung To Hung, Jianqiao Zhao, Duo Zheng, Ka Wai Liu, Michael R. Lyu, Liwei Wang
Recent advances in large language models (LLMs) have shown significant promise, yet their evaluation raises concerns, particularly regarding data contamination due to the lack of access to proprietary training data. To address this issue, we present C$^2$LEVA, a comprehensive bilingual benchmark featuring systematic contamination prevention. C$^2$LEVA firstly offers a holistic evaluation encompassing 22 tasks, each targeting a specific application or ability of LLMs, and secondly a trustworthy assessment due to our contamination-free tasks, ensured by a systematic contamination prevention strategy that fully automates test data renewal and enforces data protection during benchmark data release. Our large-scale evaluation of 15 open-source and proprietary models demonstrates the effectiveness of C$^2$LEVA.
在大型语言模型(LLMs)方面最近取得的进展大有希望,但其评价却引起人们的关切,特别是由于无法获得专有培训数据而导致的数据污染问题,为解决这一问题,我们提出一个全面的双语基准,以系统预防污染。 C$2$LEVA首先提供一项全面的评估,涵盖22项任务,每个任务都针对LLMs的具体应用或能力,而其次则提供一项可靠的评估,由于我们无污染的任务,这一评估得到了系统的污染预防战略的保证,该战略充分自动化地测试数据更新,并在基准数据发布期间实施数据保护。我们对15个开放源和专有模型的大规模评估证明了C$2LEVA的有效性。
Article 168
Title@2025-05-29 (4): FutureGen: LLM-RAG Approach to Generate the Future Work of Scientific Article
Title: FutureGen: LLM-RAG Approach to Generate the Future Work of Scientific Article | FutureGen: LLM-RAG Ansatz zur Generierung der zukünftigen Arbeit des wissenschaftlichen Artikels | FutureGen:LLM-RAG 产生科学条款未来工作的方法 2503.16561v2 |
Authors: Ibrahim Al Azher, Miftahul Jannat Mokarrama, Zhishuai Guo, Sagnik Ray Choudhury, Hamed Alhoori
The future work section of a scientific article outlines potential research directions by identifying gaps and limitations of a current study. This section serves as a valuable resource for early-career researchers seeking unexplored areas and experienced researchers looking for new projects or collaborations. In this study, we generate future work suggestions from key sections of a scientific article alongside related papers and analyze how the trends have evolved. We experimented with various Large Language Models (LLMs) and integrated Retrieval-Augmented Generation (RAG) to enhance the generation process. We incorporate a LLM feedback mechanism to improve the quality of the generated content and propose an LLM-as-a-judge approach for evaluation. Our results demonstrated that the RAG-based approach with LLM feedback outperforms other methods evaluated through qualitative and quantitative metrics. Moreover, we conduct a human evaluation to assess the LLM as an extractor and judge. The code and dataset for this project are here, code: HuggingFace
科学文章的未来工作章节通过查明当前研究的差距和局限性,概述了潜在的研究方向,概述了未来研究方向。本节是早期职业研究人员寻找未探索领域和有经验的研究人员寻找新项目或协作的宝贵资源。在本研究报告中,我们从科学文章的关键部分提出未来工作建议,并结合相关论文分析趋势如何演变。我们试验了各种大语言模型和综合检索-启动一代(RAG),以加强生成过程。我们采用了LLLM反馈机制,以提高生成内容的质量,并提出LLM-as-a-judge-评价方法。我们的成果表明,以LLM反馈为基础的RAG方法超越了通过定性和定量指标评估的其他方法。此外,我们进行了人类评估,以评估LLM作为提取器和评判器。这个项目的代码和数据集在这里,代码是:HuggingFace:HuggingFace。
Article 169
Title@2025-05-29 (4): LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study
Title: LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study | LLM trifft Szenegraph: Können große Sprachmodelle Szenengraphen verstehen und generieren? Eine Benchmark- und Empirische Studie | LLM 满足景象图:大语言模型能够理解和产生景象图吗? 基准和经验研究 2505.19510v2 |
Authors: Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo
The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs’ ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs’ ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.
大型语言模型(LLMS)的非凡推理和概括能力为这些应用在体现的AI、机器人和其他现实世界任务中的扩展铺平了道路。为了有效支持这些应用,必须在多式联运环境中以空间和时间理解为基础进行空间和时间理解。为此,最近的工作利用了场景图,即结构化的图解,将实体、属性及其关系编码在一个场景中。然而,对LLMS使用场景图的能力的全面评价仍然有限。在这项工作中,我们引入了Text-Scene Graph(TSG)座,这是一个基准,旨在系统评估LLMS(1)理解场景图和(2)从文本描述中产生这些应用的能力。我们与TSG的Thormas一道评估了11 LLMS,并揭示了这些模型虽然在现场图解析方面表现良好,但与场景图的生成,特别是复杂的叙述相冲突。我们的分析表明,这些模型未能有效地将离场景场景的场景图像从复杂的叙述中分离出来,在产生场景图时导致瓶颈。这些结果突出表明,我们需要改进地图的制作方法,并为未来研究提供宝贵的洞洞察洞察。这些方法,并为未来的研究提供宝贵的洞见。我们的基准演示。我们的数据演示图的演示图的演示在http/档案的演示图。在http上可以查阅。
Article 170
Title@2025-05-29 (4): Generating Diverse Training Samples for Relation Extraction with Large Language Models
Title: Generating Diverse Training Samples for Relation Extraction with Large Language Models | Erzeugen von unterschiedlichen Trainingsbeispielen für die Beziehungsextraktion mit großen Sprachmodellen | 生成多种培训样本,用于与大语言模式的抽取关系 2505.23108v1 |
Authors: Zexuan Li, Hongliang Dai, Piji Li
Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.
使用大语言模型(LLMS)来生成培训数据,可能是改进零度或短程NLP任务的最佳方法,但是,在这方面还有许多问题有待调查。关于关系提取(RE)的任务,我们发现直接促动LLMS产生的样品在结构上很容易具有很高的相似性。它们往往使用有限的多种措词来表达一对实体之间的关系。因此,在本文件中,我们研究如何有效改进与RELM一起生成的培训样品的多样性,同时保持其正确性。我们首先试图通过直接在Intext Learning(ICL)的提示下发出指示,使LMS产生不同样品。然后,我们提出一种方法,微调LMMS用于通过直接参考优化(DPO)生成多样性培训样品。我们对常用的RE数据集的实验表明,这两种尝试都能够提高产生的培训数据的质量。我们还发现,与直接执行LMM(LM)相比,培训非LLM RE(ICM) 的模型及其生成样品可能会提高性。
Article 171
Title@2025-05-29 (4): Can We Predict Performance of Large Models across Vision-Language Tasks?
Title: Can We Predict Performance of Large Models across Vision-Language Tasks? | Können wir die Leistung großer Modelle über Vision-Language-Aufgaben hinweg voraussagen? | 我们能否预测大型模型在愿景-语言任务中的绩效? 2410.10112v2 |
Authors: Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould
Evaluating large vision-language models (LVLMs) is very expensive, due to high computational cost and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix $\boldsymbol{R}$, where each entry $R_{mn}$ represents the performance score of the $m$-th model on the $n$-th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, i.e., predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, which quickly reduces the prediction errors. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
评估大型视觉语言模型(LVLMS)非常昂贵,因为计算成本高,而且任务种类繁多。好消息是,如果我们已经看到一些绩效分数,我们也许能够推断出未知分数。在这项研究中,我们提出了一个新的框架,根据从其他LVMS或任务中观察到的分数来预测未知分数。我们首先将绩效预测作为矩阵完成任务来制定。具体地说,我们建造了一个稀薄的绩效矩阵($\boldsymbol{R}),每个输入单位$Rmn}代表美元模型在美元-美元数据集中的绩效分数。通过与Markov 链 Monte Carlo(MC)应用概率矩阵因子化(PMF),我们可以完成绩效矩阵,即预测未知分数。此外,我们根据MCMS评估了绩效预测的不确定性。 操作者可以首先评估未经测试的任务的模型,从而迅速减少预测错误。我们进一步引入了几项改进措施,以提高PMFMF值,用于观测业绩分数少的情景。我们进行实验时的精确性测测算的精确性,我们的数据测算的精确性测算。
Article 172
Title@2025-05-29 (4): Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models
Title: Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models | Automatische Übertragung für LLM-Tiers: Kosten- und Genauigkeitsoptimierung in großen Sprachmodellen | LLM Tiers 自动传输: 优化大语言模型的成本和准确度 2505.20921v2 |
Authors: Injae Na, Keonwoong Noh, Woohwan Jung
LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.
LLM供应商通常提供多种LLM等级,其性能和价格各不相同。随着NLP任务变得更加复杂和模块化,为每个子任务选择合适的LLM等级是平衡成本和性能的关键挑战。为了解决问题,我们引入LLM自动传输(LLM自动传输)框架,不经培训自动选择LLM等级。LLM-AT由启动人、发电机和法官组成。启动人选择最初的LLLM等级,预期解决特定问题,发电机利用选定等级的LLM作出响应,法官评估答复的有效性。如果答复无效,LLM-AT迭代升级为更高级模型,产生新的响应,并在获得有效回应之前进行重新评估。此外,我们提议精确度估计,在没有培训的情况下进行适当的LLM-AT初始一级选择。根据输入问题,准确度估计每一级LLMM的预期准确性,方法是计算过去类似查询的正确反应率,法官评估答复的有效性。如果答复无效,LLM-AT的迭代升级为更高级模型,则产生新的反应,产生新的反应,然后重新评价,直到获得有效的反应。此外,我们提议精确度估计LLM-AT-AT软件,在降低实际成本的情况下实现高效。
Article 173
Title@2025-05-29 (4): RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models
Title: RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models | RepCali: Hocheffizientes Feintuning über Darstellungskalibrierung im Latent Space für vortrainierte Sprachmodelle | RepCali:为预培训语言模型在冷藏空间进行高效的精微微调 Via代表比例校准 2505.08463v2 |
Authors: Fujun Zhang, Xiaoying Fan, XiangDong Su, Guanglai Gao
Fine-tuning pre-trained language models (PLMs) has become a dominant paradigm in applying PLMs to downstream tasks. However, with limited fine-tuning, PLMs still struggle with the discrepancies between the representation obtained from the PLMs’ encoder and the optimal input to the PLMs’ decoder. This paper tackles this challenge by learning to calibrate the representation of PLMs in the latent space. In the proposed representation calibration method (RepCali), we integrate a specific calibration block to the latent space after the encoder and use the calibrated output as the decoder input. The merits of the proposed RepCali include its universality to all PLMs with encoder-decoder architectures, its plug-and-play nature, and ease of implementation. Extensive experiments on 25 PLM-based models across 8 tasks (including both English and Chinese datasets) demonstrate that the proposed RepCali offers desirable enhancements to PLMs (including LLMs) and significantly improves the performance of downstream tasks. Comparison experiments across 4 benchmark tasks indicate that RepCali is superior to the representative fine-tuning baselines.
微调培训前语言模型(PLM)已成为将PLMS应用到下游任务的主导范例,然而,由于微调有限,PLMS仍然在与从PLM的编码器得到的演示和对PLM的解码器的最佳输入之间的差异作斗争。本文通过学习校准潜空间中PLMS的表示来应对这一挑战。在拟议的代表校准方法(RepCali)中,我们将特定的校准块并入编码器之后的潜藏空间,并将校准产出用作解码器投入。拟议的RepCali的优点包括它对所有带有编码器-解码器结构的PLMS的普遍性、它的插座和游戏性质以及执行的便利性。对基于25个PLM的模型(包括英文和中文数据集)进行的广泛实验表明,拟议的RepCali为PLMS(包括LMS)提供了适当的改进,并大大改进了下游任务的绩效。在4项基准任务中的比较实验表明,RepCali高于有代表性的精确基线。
Article 174
Title@2025-05-29 (4): SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation
Title: SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation | SimGRAG: Nutzung ähnlicher Subgraphen für Wissensgraphen Driven Retrieval-Augmented Generation | SimGRAG: 利用知识图形驱动回溯源的类似子集 2412.15272v2 |
Authors: Yuzheng Cai, Zhenyue Guo, Yiwen Pei, Wanrui Bian, Weiguo Zheng
Recent advancements in large language models (LLMs) have shown impressive versatility across various tasks. To eliminate their hallucinations, retrieval-augmented generation (RAG) has emerged as a powerful approach, leveraging external knowledge sources like knowledge graphs (KGs). In this paper, we study the task of KG-driven RAG and propose a novel Similar Graph Enhanced Retrieval-Augmented Generation (SimGRAG) method. It effectively addresses the challenge of aligning query texts and KG structures through a two-stage process: (1) query-to-pattern, which uses an LLM to transform queries into a desired graph pattern, and (2) pattern-to-subgraph, which quantifies the alignment between the pattern and candidate subgraphs using a graph semantic distance (GSD) metric. We also develop an optimized retrieval algorithm that efficiently identifies the top-k subgraphs within 1-second on a 10-million-scale KG. Extensive experiments show that SimGRAG outperforms state-of-the-art KG-driven RAG methods in both question answering and fact verification. Our code is available at https://github.com/YZ-Cai/SimGRAG.
大型语言模型(LLMS)的近期进展在各种任务中显示出令人印象深刻的多功能性。为了消除它们的幻觉,检索生成(RAG)已经成为一种强有力的方法,利用知识图(KGs)等外部知识来源。在本文件中,我们研究了KG驱动的RAG的任务,并提出了一部新颖的类似图表增强检索-启动生成(SimGRAG)方法。它有效地通过一个两阶段过程应对调和查询文本和KG结构的挑战:(1)查询到模式,它使用LM将查询转换成理想的图表模式;(2)模式到Subgraph,它用图形语义距离(GSD)衡量模式和候选子集之间的对齐。我们还开发了一种最优化的检索算法,在1秒以内有效识别10百万尺度KGs。广泛的实验显示,SimGAGAG超越了在问题解答和事实核查中所使用的KGG-驱动的RAG方法。我们的代码可在 http://AGRMS/AGRZ。
Article 175
Title@2025-05-29 (4): MAP: Revisiting Weight Decomposition for Low-Rank Adaptation
Title: MAP: Revisiting Weight Decomposition for Low-Rank Adaptation | KARTE: Wiederbesuchen der Gewichtsverringerung für Low-Rank-Anpassung | MAP: 重新审视低浓度适应的重量分解 2505.23094v1 |
Authors: Chongjie Si, Zhiyi Shi, Yadao Wang, Xiaokang Yang, Susanto Rahardja, Wei Shen
The rapid development of large language models has revolutionized natural language processing, but their fine-tuning remains computationally expensive, hindering broad deployment. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have emerged as solutions. Recent work like DoRA attempts to further decompose weight adaptation into direction and magnitude components. However, existing formulations often define direction heuristically at the column level, lacking a principled geometric foundation. In this paper, we propose MAP, a novel framework that reformulates weight matrices as high-dimensional vectors and decouples their adaptation into direction and magnitude in a rigorous manner. MAP normalizes the pre-trained weights, learns a directional update, and introduces two scalar coefficients to independently scale the magnitude of the base and update vectors. This design enables more interpretable and flexible adaptation, and can be seamlessly integrated into existing PEFT methods. Extensive experiments show that MAP significantly improves performance when coupling with existing methods, offering a simple yet powerful enhancement to existing PEFT methods. Given the universality and simplicity of MAP, we hope it can serve as a default setting for designing future PEFT methods.
大型语言模型的迅速发展使自然语言处理发生了革命性的变化,但是它们的微调仍然在计算上昂贵,阻碍了广泛的部署。参数效率微调方法,如LORA,已经成为一种解决办法。最近的工作,例如DoRA试图将重量调整进一步分解成方向和量级组成部分。然而,现有的配方往往在柱级上以超自然方式界定方向,缺乏一个原则的几何基础。在本文件中,我们建议MAP这个新框架将重量矩阵重新作为高维矢量矢量,并严格地将其调整为方向和规模。MAP使预先训练的重量正常化,学习方向性更新,并引入两个标量系数,独立地测量基的大小,并更新矢量。这种设计可以使更可解释和灵活地适应,并且可以顺利地融入现有的PEFT方法。广泛的实验表明,MAP在与现有方法结合时大大改进了业绩,为现有的PEFT方法提供了简单而有力的改进。鉴于MAP的普及性和简洁性,我们希望它能够作为未来的默认设置PEFT方法。
Article 176
Title@2025-05-29 (4): Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Title: Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models | Infi-MMR: Curriculumbasiertes Entsperren multimodaler Vernunft durch schrittweises Verstärktes Lernen in multimodalen Small Language-Modellen | Infi-MMMR:通过在多模式小型语言模式中分阶段强化学习,以课程为基础解锁多模式原因 2505.23091v1 |
Authors: Zeyu Liu, Yuhang Liu, Guanghao Zhu, Congkai Xie, Zhen Li, Jianbo Yuan, Xinyao Wang, Qing Li, Shing-Chi Cheung, Shengyu Zhang, Fei Wu, Hongxia Yang
Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model’s logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini).
大型语言模型(LLMS)最近的进展表明,在推理能力(如DeepSeek-R1,利用基于规则的强化学习加强学习,以大大加强逻辑推理)方面取得了显著进展,然而,将这些成就推广到多式联运大型语言模型(MLLMS)提出了重大挑战,而多模式小型语言模型(MLMS)由于其基础推理能力通常较弱,往往更加突出:(1) 缺乏高质量的多式联运推理数据集,(2) 由于视觉处理一体化而使推理能力退化,(3) 直接应用强化学习可能产生复杂但不正确的推理过程的风险。为了应对这些挑战,我们设计了一个新的框架Infi-MMMMMR能力,以便通过三个结构严谨的阶段的课程系统释放MLMS的推理潜力,并提出我们的多式联运推理模型 Infi-MMR-3B。 第一阶段是基础推理,利用高质量的文本推理数据集来激活和加强模型的逻辑推理能力。第二阶段是,跨模式理算调整,第二个阶段是Sild-model Redialalalalalalalalal-rial-rialxal-Ial-Ial-ligal-Lisal-Lislation-Ial-Ial-Igal-Ial-Ilation-IGlation-IGal-Ial-MLD-Ial-IF-ID-ID-I-I-ID-ID-I-I-I-ID-ID-ID-ID-I-I-I-I-I-ID-ID-ID-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-
Article 177
Title@2025-05-29 (4): Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Title: Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport | Document-Level Text Generierung mit minimalen Bayes Risikodekodierung mit optimalem Transport | 采用最佳运输方式,以文件水平生成具有最低比值风险解码的文本 2505.23078v1 |
Authors: Yuu Jinnai
Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require the understanding of longer context to generate high-quality texts. In this paper, we investigate the adaption of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks. Our code is available at https://github.com/jinnaiyuu/mbr-optimal-transport
文件级文本生成任务被认为比句级文本生成任务更为困难,因为它们需要理解较长的上下文才能产生高质量的文本。 在本文件中,我们调查对文件级文本生成任务的最小海湾风险(MBR)解码的调整。 MBR 解码使用一种实用功能来估计从一组候选产出中产生最大预期效用的产出。虽然MBR 解码在一系列广泛的判决级文本生成任务中证明是有效的,但它在文件级文本生成任务中的性能有限,因为许多用于评价句子效用的通用功能设计了它。为此,我们提议MBR-OT,一个使用瓦塞尔斯坦远程进行MBR解码的变体,用句级功能来计算文件的效用。实验结果表明,MBR-OT的性能在文件级机器翻译、文本简化和密集图像说明任务中超过了标准MBR的性能。我们的代码可在 https://github.com/jinnayu/mbr-optimal-travelyer查阅。
Article 178
Title@2025-05-29 (4): Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation
Title: Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation | Kontextualisierte automatische Spracherkennung mit dynamischer Vokabelvorhersage und Aktivierung | 具有动态词汇预测和启动功能的实用自动语音识别 2505.23077v1 |
Authors: Zhennan Lin, Kaixun Huang, Wei Ren, Linju Yang, Lei Xie
Deep biasing improves automatic speech recognition (ASR) performance by incorporating contextual phrases. However, most existing methods enhance subwords in a contextual phrase as independent units, potentially compromising contextual phrase integrity, leading to accuracy reduction. In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. We introduce architectural optimizations and integrate a bias loss to extend phrase-level predictions based on frame-level outputs. We also introduce a confidence-activated decoding method that ensures the complete output of contextual phrases while suppressing incorrect bias. Experiments on Librispeech and Wenetspeech datasets demonstrate that our approach achieves relative WER reductions of 28.31% and 23.49% compared to baseline, with the WER on contextual phrases decreasing relatively by 72.04% and 75.69%.
深度偏差通过纳入上下文语句来提高自动语音识别(ASR)的性能。 但是,大多数现有方法都加强了上下文语句中的子字,作为独立单位,可能会损害上下文语句的完整性,导致准确性降低。在本文中,我们建议采用基于编码器的语句级背景化 ASR 方法,利用动态词汇预测和激活。我们引入了建筑优化,并整合了偏差,以根据框架语句产出扩展语句水平预测。我们还引入了一种信任驱动解码方法,确保上下文语句的完整输出,同时抑制不正确的偏差。利布里斯派奇和韦涅茨皮奇数据集的实验表明,我们的方法比基线实现了相对的WER减少28.31%和23.49%,而关于背景语句的WER则相对减少72.04%和75.69%。
Article 179
Title@2025-05-29 (4): Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
Title: Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts | Shortcut-verbundene Experten-Parallelität für die Beschleunigung von Mixture-of-Experts | 加速混合专家专家专家平行专家 2404.05019v3 |
Authors: Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, Jiayi Huang
Expert parallelism has emerged as a key strategy for distributing the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple devices, enabling the processing of increasingly large-scale models. However, the All-to-All communication inherent to expert parallelism poses a significant bottleneck, limiting the efficiency of MoE models. Although existing optimization methods partially mitigate this issue, they remain constrained by the sequential dependency between communication and computation operations. To address this challenge, we propose ScMoE, a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy. ScMoE decouples communication from its conventional sequential ordering, enabling up to 100% overlap with computation. Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of 1.49 times in training and 1.82 times in inference. Moreover, our experiments and analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.
专家的平行性已成为一种关键战略,用于通过多种装置分配分散的分散专家混合模型的计算工作量,从而能够处理越来越大规模的模型。然而,专家平行性所固有的 “ 人人交流 “ 构成了一个很大的瓶颈,限制了教育部模式的效率。虽然现有的优化方法在一定程度上缓解了这一问题,但它们仍然受到通信和计算操作之间依次依赖的制约。为了应对这一挑战,我们提议ScMoE,这是一个与重叠的平行战略相结合的新颖的、与捷径相连的教育部结构。ScMoE从常规顺序排序中解析通信,使计算重叠率达到100%。与普遍的上层-2教育部基线相比,ScMoE在培训中实现了1.49倍的加速率,在推断中实现了1.82倍的加速率。此外,我们的实验和分析表明,ScMoE不仅取得了可比较的结果,而且在某些情况下超过了现有方法的模型质量。
Article 180
Title@2025-05-29 (4): SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models
Title: SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models | SORSA: Singuläre Werte und Orthonormale Regularisierte Singuläre Vektoren Anpassung großer Sprachmodelle | SORSA: 单项价值和正正正的正规化的单项矢量,以适应大语言模式 2409.00055v6 |
Authors: Yang Cao, Zhao Song
In this paper, we propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel parameter efficient fine-tuning (PEFT) method. Each SORSA adapter consists of two main parts: trainable principal singular weights $W_p = U_p \text{diag}(S_p) V^\top_p$, and frozen residual weights $W_r = U_r \text{diag}(S_r) V^\top_r$. These parts are initialized by performing singular value decomposition (SVD) on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which we prove could decrease the condition number of $W_p$ and make the optimization more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. We also introduce a method to analyze the variation of the parameters by performing SVD and discuss and analyze SORSA’s superiority in minimizing the alteration in the SVD aspect. After all, SORSA shows a faster convergence than LoRA and PiSSA in our experiments. On the GSM-8K benchmark, Llama 2 7B adapted using SORSA achieved 56.03\% accuracy, surpassing LoRA (42.30\%) and Full FT (49.05\%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance.
在本文中, 我们提议 Singulal 值和 Orthod Reclarizizal Singers Aditors, 或 SORSA, 一种新型参数高效微调( PEFT) 方法。 每个 SORSA 调整器由两个主要部分组成: 可训练的主要单重量 $W_ p = U_ p = U_ p = U_ p text{diag} (S_ r) Vtop_ r$。 这些部分是通过在预训练重量上执行单值分解( SVD ) 的初始化。 此外, 我们实施和分析一个正正正正正正正正正正正的调整器, 我们能降低 $_ p = U_ p = U_ text{diag} (S_ p) (S_ r) Vtr = U_ text{dia} (S_ diag} (S_r (S_r) (S_ text) (VD) Vtop__r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_r_s_s_s_s_s_s_s_s_s_s_s_s_s_sr_s_s_sr_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_sreford_smmmation_ss_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_s_ss_s_s_s_s
Article 181
Title@2025-05-29 (4): SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services
Title: SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | SNS-Bench-VL: Benchmarking multimodaler Großsprachenmodelle in Social Networking Services | SNS-Bench-VL:确定社会联网服务中多式大语言模式基准 2505.23065v1 |
Authors: Hongcheng Guo, Zheyong Xie, Shaosheng Cao, Boyang Wang, Weiting Liu, Anjie Le, Lei Li, Zhoujun Li
With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.
随着社会联网服务(SNS)中视觉和文字内容的日益整合,评价大语言模型(LLMs)的多式联运能力对于提高用户经验、内容理解和平台情报至关重要。现有基准主要侧重于以文字为中心的任务,缺乏现代SNS生态系统普遍存在的多式联运环境的覆盖面。本文介绍SNS-Bench-VL,这是一个综合的多式联运基准,旨在评估视觉-语言LMs在现实世界社会媒体情景中的绩效。SNS-Bench-VL将图像和文字纳入8项多式联运任务,包括注释理解、用户参与分析、信息检索和个人化建议。它包括4 001对经过仔细调整的多式问答对,涵盖单选、多选和开放式任务。我们评价了25多式多式多式LMs,分析其跨项任务的业绩。我们的调查结果突出了在多式社会背景理解方面的长期挑战。我们希望SNS-Bench-VL将激励未来研究,以获得稳健、符合背景的和符合人的多式联运情报,用于下一代社会网络服务。
Article 182
Title@2025-05-29 (4): GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation
Title: GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation | GIVE: Strukturierte Begründung großer Sprachmodelle mit Wissensgrafik inspirierte Veracity-Extrapolation | 特具:大语言模式结构原因说明,以知识图激发的多才多艺外推法 2410.08475v3 |
Authors: Jiashu He, Mingyu Derek Ma, Jinxuan Fan, Dan Roth, Wei Wang, Alejandro Ribeiro
Existing approaches based on context prompting or reinforcement learning (RL) to improve the reasoning capacities of large language models (LLMs) depend on the LLMs’ internal knowledge to produce reliable Chain-Of-Thought (CoT). However, no matter the size of LLMs, certain problems cannot be resolved in a single forward pass. Meanwhile, agent-based reasoning systems require access to a comprehensive nonparametric knowledge base, which is often costly or not feasible for use in scientific and niche domains. We present Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak). Extensive experiments demonstrated the following benefits of our framework: (1) GIVE boosts the performance of LLMs across various sizes. (2) In some scenarios, GIVE allows smaller LLMs to surpass larger, more sophisticated ones in scientific tasks (GPT3.5T + GIVE > GPT4). (3) GIVE is effective on scientific and open-domain assessments. (4) GIVE is a training-free method that enables LLMs to tackle new problems that extend beyond their training data (up to 43.5% -> 88.2%} accuracy improvement). (5) GIVE allows LLM agents to reason using both restricted (very small) and noisy (very large) knowledge sources, accommodating knowledge graphs (KG) ranging from 135 to more than 840k nodes. (6) The reasoning process involved in GIVE is fully interpretable.
以背景促进或强化学习为基础的现有方法(RL),以提高大型语言模型(LLMs)的推理能力;我们展示了图象激励性弹性推理法(GIV),这是一种新颖推理方法,将参数和非参数记忆结合到最低限度的外部投入中,以便改进精确推理;不过,无论LLMs的规模大小,某些问题都无法在一个远处解决;与此同时,基于代理的推理系统需要获得一个全面的非参数性知识库,而这种知识库往往成本高,或者不适合用于科学和特殊领域;我们展示了由不同规模的LLMs提高性能的图解析(GIV),这是一种新的推理方法,用最小的外部投入来改进精确推理;让LMs代理选择最相关的专家数据(serview),采用具体查询的不同思维(rediction),然后将这一信息综合起来产生最终产出(speak)。 广泛的实验表明,我们的框架的好处如下:(1)能提高LMs在各种规模上的性推理学表现。 (2)在某些假设中,让较小的LMSM(GPTTT+S-refrefreal real real realalalalal)超过科学任务(G+Sal real realalalalalalal),使科学任务的推理学上的有效和(GLVrev)使新的推算。
Article 183
Title@2025-05-29 (4): Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Title: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design | Spekulative Dekodierung trifft auf Quantisierung: Kompatibilitätsbewertung und Hierarchisches Framework Design | 投机性下限符合量化:兼容性评价和等级框架设计 2505.22179v2 |
Authors: Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
可能解码和量化能够有效加速大型语言模型的内存解析。 猜测解码可以减少记忆带宽瓶颈, 具体来说, 量化可以将重量压缩和激活到小位宽度, 并通过低位基数乘法减少计算。 为了进一步发挥这两个技术的优势, 我们调查了这两种技术的整合情况。 令人惊讶的是, 将先进的投机解码方法 EAGLE-2 应用到各种量化模型的实验表明, 4比位权重量化的记忆因投机解码的计算负荷而减少。 具体地说, 验证树型草案比四位基位基数的单端前端分流要多得多。 发现导致我们新的投机解码设计: 一个等级框架, 使用一个小型的模型将树型草案转换成序列草稿, 利用目标值值值值值EAGLE9- 3的重量量化, 具体地说, 树型草案将A78- ALS 的内存取收益收益, 一个等级方法在1个基数级平比值模型上, 4级分析结果显示, 级方法在1个基比重模型上, 4级方法达到E78_B级。 级方法, 。 4级。 实验结果结果, 我们的等级方法在1级方法在1个基级分级分级分级法方法, 在1比级法方法上, 在1比级法方法上, 在1比。
Article 184
Title@2025-05-29 (4): Self-Correcting Code Generation Using Small Language Models
Title: Self-Correcting Code Generation Using Small Language Models | Selbstkorrekte Code-Generierung mit kleinen Sprachmodellen | 使用小型语言模式自行校正代码生成 2505.23060v1 |
Authors: Jeonghun Cho, Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee
Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.
最近的研究探索了基于快速的战略,其中包括利用专有模型的核查或反馈循环,以及以培训为基础的方法,利用强大的推理能力;然而,小型模型是否具备通过自我反射有效指导其产出的能力,尚未探索。我们的调查结果显示,小型模型在自我校正模式中都难以表现出反射性修正行为。对此,我们引入了CoCOS, 这是一种旨在提高小型语言模型能力以进行多功能代码校正的方法。具体地说,我们提议了一个在线强化学习目标,以培训模型,有信心地保持正确的产出,同时逐步纠正转动的不正确产出。我们的方法具有累积的奖励功能,在整个轨迹中积累奖励,并获得更适合多方向校正情景的微额奖励。这有利于模型在通过自我校正实现大幅改进的同时提高初始反应质量。在1B级模型中,CoCOS在MPP上实现了35.8%的改进,在HumanEval上实现了27.7%的改进。
Article 185
Title@2025-05-29 (4): Be.FM: Open Foundation Models for Human Behavior
Title: Be.FM: Open Foundation Models for Human Behavior | Be.FM: Open Foundation Modelle für menschliches Verhalten | BeFM: 人类行为开放基础模型 2505.23058v1 |
Authors: Yutong Xie, Zhuoheng Li, Xiyuan Wang, Yijun Pan, Qijia Liu, Xingzhi Cui, Kuang-Yu Lo, Ruoyi Gao, Xingjian Zhang, Jin Huang, Walter Yuan, Matthew O. Jackson, Qiaozhu Mei
Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce Be.FM, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, Be.FM can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that Be.FM can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.
尽管在众多领域取得了成功,但人类行为建模和理解基础模型的潜力基本上尚未探索。我们引入了Be.FM,这是人类行为建模的第一批开放基础模型之一。Be.FM建基于开放源码的大型语言模型,并精细调整了多种行为数据,Be.FM可用于理解和预测人类的决策。我们构建了一套用于测试行为基础模型能力的综合性基准任务。我们的成果表明,Be.FM可以预测行为,推断个人和人口的特点,产生对背景的洞察力,应用行为科学知识。
Article 186
Title@2025-05-29 (4): OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics
Title: OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics | OrionBench: Ein Benchmark für Diagramm- und Mensch-erkennbare Objekterkennung in Infografiken | Orion Bunch:图表和人类可识别的在信息图中探测物体的基准 2505.17473v3 |
Authors: Jiangning Zhu, Yuxing Zhou, Zheng Wang, Juntao Yao, Yima Gu, Yuhui Yuan, Shixia Liu
Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce OrionBench, a benchmark designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 26,250 real and 78,750 synthetic infographics, with over 6.9 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of OrionBench through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.
鉴于海图在科学、商业和通信方面的中心作用,提高海图对视觉语言模型的了解能力已变得越来越重要,现有海图模型的主要局限性在于其不准确的视觉定位要素,包括海图和图象等人类可辨认的物体(HROs),然而,海图的理解往往要求确定相关要素和对这些要素的推理。为了应对这一局限性,我们引入了OrionBench,这是一个基准,旨在支持为海图和海图中人文模型开发准确的天体探测模型,其中包括26,250个真实和78,750个合成信息图,其中690万个以上带框说明。这些说明是通过合并图示中的图象和方案方法产生的。我们通过三个应用显示OrionBench的有用性:1) 构建一个与布局一起思考的计划,以提高海图对VLM的性能;2) 比较现有天体探测模型;3) 将开发的探测模型应用于文件布局和UI要素的探测。
Article 187
Title@2025-05-29 (4): Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Title: Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation | Destill CLIP (DCLIP): Bild-Text-Retrieval durch Cross-Modal Transformer-Destillation verbessern | 蒸馏 CLIP (DCLIP): 通过跨模式变异器蒸馏加强图像- 文本回收 2505.21549v2 |
Authors: Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O’Brien, Vasu Sharma
We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model’s strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP’s original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP’s zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at https://anonymous.4open.science/r/DCLIP-B772/README.md.
我们展示了蒸馏 CLIP (DCLIP) (DDCLIP) 模型的微调变式,该模型在保持原模型强力零射分解能力的同时,加强了多式图像-文字检索能力。 CLIP 模型通常受到固定图像分辨率和有限背景的限制,这可能会妨碍其在需要细微分跨模式理解的检索任务中的有效性。 DCLIP 则通过一个元式教师-学生蒸馏框架来应对这些挑战,在这个框架中,一个跨模式变压器教师经过微调,通过YOLO-抽取图像区域和相应文本跨度之间的双向交叉注意来产生更丰富的嵌入。 这些语义和空间一致的全球表述指导了轻量学生模型的培训,该模型结合了对比学习和类似目标的混合损失。 尽管在MSCOCO、Flick30k 和MICDCCCCapition-DADRIP 最初的数据集集-DCLIP 中,大大改进了图像-文字检索指标(Recall@K,MAP) 和高级域域分类中大约94项的成绩展示了C-DLDLD 的分类。
Article 188
Title@2025-05-29 (4): Query Routing for Retrieval-Augmented Language Models
Title: Query Routing for Retrieval-Augmented Language Models | Abfrage-Routing für Retrieval-Augmented Language-Modelle | 查询检索推荐语言模型的查询路径 2505.23052v1 |
Authors: Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Guihai Chen
Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs’ ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.
在知识密集型任务方面,大语言模型(LLMS)的绩效得到大幅提高,但根据RAG, 大型语言模型(LLMS)的应对质量不同,因此,根据RAG, 大型语言模型(LLMS)在知识密集型任务方面的效绩也大有改进。然而,根据RAG, 大型语言模型(LLLMS)的应对质量不同,需要智能路由机制为每个查询选择最合适的模式,通过专门的路由模型,从多个检索增强的LMLM中选择最合适的模式。我们注意到,外部文件动态地影响LLMS的回答询问能力,而现有的路由方法依靠静态的参数描述,在RAGAROT在RM的路径安排方面表现欠佳,将检索到的文件的影响纳入路由框架。我们建议RAGROGROTER、RAGAGAG-AGAW(RAG)-AGAGS(RAGAGS)能力, 利用文件嵌嵌入和对比性学习能力,以获取知识代表变化和知情路由决定。关于各种知识密集型任务和检索环境的大规模任务和回收环境的大规模任务设置的广泛试验显示3.3-9的3.9%,在平均下,在平均和低分分分数机制下,在3分数制下实现业绩制下,在3分数制。
Article 189
Title@2025-05-29 (4): DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration
Title: DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration | DenoiseRotator: Verbesserung der Beschneidungsfestigkeit für LLMs durch Bedeutungskonzentration | DenoisRotator:通过重视浓度提高LLMs的稳健力 2505.23049v1 |
Authors: Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian
Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model’s weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.
粗略是一种通过去除不重要的重量压缩大语言模型( LLMs) 的技术, 它被广泛使用, 以压缩大语言模型( LLMs) 。 但是, 我们的方法往往受到显著的性能退化的影响, 特别是在半结构化的宽度限制下。 现有的裁剪方法主要侧重于估算个体重量的重要性, 这限制了它们保存模型关键能力的能力。 在这项工作中, 我们提出了一个新视角: 我们不仅选择对纯度的权重, 我们首先重新分配参数的重要性, 以使模型本身更容易被剪裁。 通过将正常重要性分分数的信息最小化, 我们的方法将重要性集中在一个较小的重量组上, 特别是半结构化的缩略度强度。 我们通过DenoiseRototiator将这一想法快速化, 将可学习的或高度的变异度转换应用到模型的重量矩阵矩阵。 我们的方法是模范- 、 SpressGPT和Wanda 等现有调技术可以顺利地结合。 由LLA3 、 Quencommissionality 和 Mis- mission dealtialalalticalalality 在50 和 2: Drassalticalticalticality上持续地改进了50- deal- dealtial- deal- dealalality 。
Article 190
Title@2025-05-29 (4): Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines
Title: Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines | Instruction-Tuning LLMs für die Ereignisextraktion mit Annotationsrichtlinien | 说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性准则 2502.16377v2 |
Authors: Saurabh Srivastava, Sweta Pati, Ziyu Yao
In this work, we study the effect of annotation guidelines – textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance.
在这项工作中,我们研究了说明准则 – – 事件类型和论据的文字描述,用于为事件提取而调整大型语言模型。我们在完整和低数据环境中用人提供和机器生成的准则进行了一系列实验。我们的结果显示,如果有适当数量的培训数据,说明准则就有望实现,并突出其在改进跨系统通用和低频事件类型性能方面的效力。
Article 191
Title@2025-05-29 (4): FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
Title: FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems | FlexDuo: Ein Pluggable-System zur Ermöglichung von Full-Duplex-Fähigkeiten in Sprachdialogsystemen | FlexDuo:一个促进语音对话系统全面灵活能力的插件系统 2502.13472v2 |
Authors: Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, Di Zhang
Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication. However, existing approaches face challenges such as difficulties in independent module optimization and contextual noise interference due to highly coupled architectural designs and oversimplified binary state modeling. This paper proposes FlexDuo, a flexible full-duplex control module that decouples duplex control from spoken dialogue systems through a plug-and-play architectural design. Furthermore, inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state. On one hand, the Idle state filters redundant noise and irrelevant audio to enhance dialogue quality. On the other hand, it establishes a semantic integrity-based buffering mechanism, reducing the risk of mutual interruptions while ensuring accurate response transitions. Experimental results on the Fisher corpus demonstrate that FlexDuo reduces the false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-duplex dialogue system baselines. It also outperforms voice activity detection (VAD) controlled baseline systems in both Chinese and English dialogue quality. The proposed modular architecture and state-based dialogue model provide a novel technical pathway for building flexible and efficient duplex dialogue systems.
全多语言对话系统(Ful-Duplex SDS)通过实时双向交流,大大提高了人体机器互动的自然性,使实时双向交流成为可能;然而,现有办法面临挑战,如由于建筑设计高度配合和过于简化的二元制模型,在独立模块优化和背景噪音干扰方面存在困难;本文件提议FlexDuo,这是一个通过插接和播放建筑设计,从口语对话系统解开双倍调节的灵活全多功能控制模块。此外,在人文信息过滤机制的启发下,我们引入了一个明确的空闲状态。一方面,伊德州过滤器为增强对话质量而使用多余的噪音和不相关的音频。另一方面,它建立了一个基于语义完整性的缓冲机制,在确保准确反应过渡的同时减少相互干扰的风险。 渔业系统的实验结果表明,FlexDuo将假中断率降低24.9%,并将反应的准确度提高7.6%,与综合全复对话系统基线相比,它也超越了语音活动检测(VAD)活动模式,在中国模式对话中提供了新的标准式质量对话系统。
Article 192
Title@2025-05-29 (4): NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables
Title: NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables | NeedleInATable: Erforschen von Langkontext-Kapazität von großen Sprachmodellen zu langstrukturierten Tabellen | 针线表:探索长结构表格中大语言模型的长文能力 2504.06560v2 |
Authors: Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang
Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models’ underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle’’ and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs’ genuine table understanding ability. Our data, code and models will be released to facilitate future research.
处理结构化的表格数据,特别是大型和长长的表格,对于大型语言模型(LLMS)来说,是一项根本性但具有挑战性的任务;然而,现有的长文本基准,如Nele-a-Haystack(Heystack),主要侧重于无结构化文本,忽视结构化表格表格的挑战;同时,以前的表格基准,主要考虑下游任务,需要高层次推理能力,忽视模型对单个表格单元格的基本精细微看法,这对于实际和稳健的LLMT表格应用至关重要;为弥补这一差距,我们引入了新长文本的表格基准,将每个表格单元格作为“需要”处理,要求模型根据单元格位置或查询问题提取目标单元。我们对各种LLMS和多式LMMs的全面评价表明,流行的下游表格任务与更简单的NIAT任务之间业绩差距很大,表明它们可能依赖特定数据集的关联或捷径来取得更好的基准结果,但缺乏对结构化表格的真正可靠的长文本理解。此外,我们证明,使用可靠的NIAT培训能力来验证我们下游任务中的数据能力将有效地改进下游任务。
Article 193
Title@2025-05-29 (4): Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation
Title: Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation | Cross-modal RAG: Sub-dimensionale Retrieval-Augmented Text-to-Image Generation | 跨模式RAG:次二维检索增强的文本到图像生成 2505.21956v2 |
Authors: Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao
Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality, while maintaining high efficiency.
文本到图像的生成越来越要求获得特定域域、精细的和快速发展的知识,而这种知识是经过事先训练的模型无法完全捕捉的。 现有的检索启动一代(RAG)方法试图通过检索全球相关图像来解决这个问题,但当没有单一图像包含来自复杂用户查询的所有期望要素时,这些方法就失败了。 我们提议了跨模式RAG,这是一个将查询和图像分解成次维维维维维维的图像的新框架,能够进行分解检索和生成。 我们的方法引入了混合检索战略 — — 将一个次维维的稀释检索器与一个密集的检索器相结合 — — 来确定一套最佳图像,每个图像都有助于查询的互补方面。 生成过程中,一个多式大型语言模型以相关视觉特征的选择性条件为指导,与具体的子查询相匹配,确保亚质图像合成。 有关MS-CO、Flick30K、WikiArt、CUB和图像网络-LT的广泛实验表明,跨模式RAG在检索和生成质量上大大超出现有基线,同时保持高效率。
Article 194
Title@2025-05-29 (4): TailorSQL: An NL2SQL System Tailored to Your Query Workload
Title: TailorSQL: An NL2SQL System Tailored to Your Query Workload | TailorSQL: Ein NL2SQL-System, das auf Ihre Abfrage-Workloads zugeschnitten ist | 定制SQL: 适合您查询工作量的 NL2SQL 系统 2505.23039v1 |
Authors: Kapil Vaidya, Jialin Ding, Sebastian Kosak, David Kernert, Chuan Lei, Xiao Qin, Abhinav Tripathy, Ramesh Balan, Balakrishnan Narayanaswamy, Tim Kraska
NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existing NL2SQL techniques miss a key opportunity which is present in real-world settings: NL2SQL is typically applied on existing databases which have already served many SQL queries in the past. The past query workload implicitly contains information which is helpful for accurate NL2SQL translation and is not apparent from the database schema alone, such as common join paths and the semantics of obscurely-named tables and columns. We introduce TailorSQL, a NL2SQL system that takes advantage of information in the past query workload to improve both the accuracy and latency of translating natural language questions into SQL. By specializing to a given workload, TailorSQL achieves up to 2$\times$ improvement in execution accuracy on standardized benchmarks.
NL2SQL(自然语言为 SQL) 将自然语言问题翻译为 SQL 查询,从而使非技术用户能够使用结构化数据,以此作为智能数据应用的基础。 最新NL2SQL 技术通常通过检索数据库特定信息进行翻译,例如数据库系统,并使用预先训练的大型语言模型(LLM),使用问题和检索的信息生成SQL 查询。然而,现有的NL2SQL 技术错过了现实世界环境中存在的一个关键机会:NL2SQL 通常适用于已经服务过许多SQL查询的现有数据库。过去的查询工作量含蓄包含有助于准确的NL2SQL翻译的信息,而光从数据库系统系统本身并不明显,例如通用连接路径和模糊的表格和列的语义。我们引入了SladeorSQL(NL2SQL) 系统,它利用过去查询工作量中的信息,既能提高精确性又能精确性地将自然问题转换为StalimalS 。
Article 195
Title@2025-05-29 (4): EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models
Title: EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models | EL4NER: Ensemble Lernen für die benannte Entity-Erkennung über mehrere kleine Parameter große Sprachmodelle | EL4NER:通过多小口径大语言模型进行命名实体识别的结合学习 2505.23038v1 |
Authors: Yuzhen Xiao, Jiahe Song, Yongxin Xu, Ruizhe Zhang, Yiqi Xiao, Xin Lu, Runchuan Zhu, Bowen Jiang, Junfeng Zhao
In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment and inference, while the closed-source ones incur high API costs, raise data-privacy concerns, and hinder community collaboration. To address this question, we propose an Ensemble Learning Method for Named Entity Recognition (EL4NER), which aims at aggregating the ICL outputs of multiple open-source, small-parameter LLMs to enhance overall performance in NER tasks at less deployment and inference cost. Specifically, our method comprises three key components. First, we design a task decomposition-based pipeline that facilitates deep, multi-stage ensemble learning. Second, we introduce a novel span-level sentence similarity algorithm to establish an ICL demonstration retrieval mechanism better suited for NER tasks. Third, we incorporate a self-validation mechanism to mitigate the noise introduced during the ensemble process. We evaluated EL4NER on multiple widely adopted NER datasets from diverse domains. Our experimental results indicate that EL4NER surpasses most closed-source, large-parameter LLM-based methods at a lower parameter cost and even attains state-of-the-art (SOTA) performance among ICL-based methods on certain datasets. These results show the parameter efficiency of EL4NER and underscore the feasibility of employing open-source, small-parameter LLMs within the ICL paradigm for NER tasks.
在基于大语言模型的理论学习中,基于大语言模型(LLM)的技术在命名实体识别(NER)任务中占据了显著地位,因为其计算资源消耗量较低,手工标签管理费较少,而且更具普遍性。然而,基于ICL的多数NER方法取决于大型参数LMs:开放源模型需要大量的部署和推断计算资源,而封闭源代码的管道则需要大量的部署和推断计算资源,而基于大语言模型的深度、多阶段元素学习费用较高,引起数据隐私关切,并阻碍社区合作。为了解决这一问题,我们建议采用一个名为实体识别(EL4NER)的复合实体识别(NER)学习方法,目的是将多个开放源、小型参数LMMLM的IC产出集中起来,以较少的部署和推断成本成本成本成本成本成本来提高 NCL任务的总体性能。我们设计了一个基于任务解析的管道,为深度、多阶段的元素基础学习提供便利。第二,我们引入了一种新型的跨级词汇算法,以建立一个更适合 NCL小实体识别的演示回收机制。第三,我们采用了一个在大规模实验级的ELLLDRalalal的自我评估系统,以降低成本数据结果,这些系统,以降低成本数据结果显示我们采用的系统。
Article 196
Title@2025-05-29 (4): Improving Multilingual Social Media Insights: Aspect-based Comment Analysis
Title: Improving Multilingual Social Media Insights: Aspect-based Comment Analysis | Mehrsprachige Social Media-Insights verbessern: Aspect-based Comment Analysis | 改进多语种社会媒体透视:基于背景的评论分析 2505.23037v1 |
Authors: Longyin Zhang, Bowei Zou, Ai Ti Aw
The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model’s predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.
社会媒体职位的固有性质是,语言使用自由,各种观点和议题不一,这给下游国家语言平台的任务带来了重大挑战,如评论组合、评论总结和社会媒体舆论分析。为了解决这个问题,我们提议从个别评论中确定和产生一些方面术语,以指导示范关注。具体地说,我们利用多语种大语言模型,在评论方面对生成术语(CAT-G)进行有监督的微调,通过DPO进一步使该模型的预测与人的期望相一致。我们展示了我们提高理解关于两项国家语言平台任务的社会媒体讨论的方法的有效性。此外,本文为英文、中文、马来语和印度尼西亚巴哈萨语的首套多语言CAT-G测试集作出了贡献。由于LLM能力各语文不同,这一测试集允许对不同水平的LLM能力不同语言的绩效进行比较分析。
Article 197
Title@2025-05-29 (4): LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization
Title: LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization | LoRA-MGPO: Doppelabstieg in der Low-Rank-Anpassung durch Momentum-geführte Perturbierungs-Optimierung abmildern | LoRA-MGPO:通过动力调节-受控渗透优化,减少低辐射适应中的双重来源 2502.14538v2 |
Authors: Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu
Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), enable efficient adaptation of large language models (LLMs) via low-rank matrix optimization with frozen weights. However, LoRA typically exhibits “double descent” in training loss as rank increases, characterized by a three-phase dynamics: initial convergence, transient divergence, and eventual stabilization. This non-monotonic behavior delays convergence and impairs generalization through unstable gradients and attraction to sharp minima. To address these challenges, we propose LoRA-MGPO, a novel LoRA-based framework incorporating Momentum-Guided Perturbation Optimization (MGPO). First, MGPO eliminates Sharpness-Aware Minimization (SAM)’s dual gradient computations by reusing momentum vectors from optimizer states to guide perturbation directions. This retains SAM’s training stability and flat minima preference with maintained efficiency. Second, MGPO incorporates adaptive perturbation normalization, scaling perturbation intensity via exponential moving average (EMA)-smoothed gradient magnitudes. Experiments on natural language understanding and generation benchmarks demonstrate that LoRA-MGPO outperforms LoRA and state-of-the-art PEFT methods. Further analysis confirms its ability to stabilize training and reduce sharp minima attraction, with smoother loss curves and improved convergence behavior. The code is available at https://github.com/llm172/LoRA-MGPO
低兰特适应(LORA)等低兰特适应(LORA)法(PEFT)法(PEFT)法(PEFT)法(低兰特适应(LLLM)法(LLMS)法(LLMS)法(LLMS)法(LLLMS)法(LLLMS)法(LLLMPO)法(LLLMM)法(LLLMPO)法(LLLMMM)法(LLMPO)法(MPOL)法(LOPO)法(LOPNE(SAM)法)法(SAM)法(SAM)法的双重梯度计算方法是:从优化国家重新使用动力矢量来引导扰动方向。这保留了SAM的训练稳定性和固定的迷你偏好。第二,MGPPO法(EMA)法(EMA-Smodrodrodal-GRA)法(ST-LADLAD)法(LADRADRADRA)法(LADMD)法(LA)法(LADRADRADRA)法(LA)法(LA)法(LADRADMDMD)法(LA)法(LA)法(LA)法(LA)法(LADRADRADRA)法(LADRADRADRADRADR)分析)法(LA)法)法(LAD)法(LA)法(LA)法(LA)法)法)法(LA)法)法(LAD)法(LA)法(LA)法(LADMGD)法(LAD)法)法(LAD)法(LA)法(LADRADRADRADRADRAD)法(LADRADRADRADRAD)法(LA)法(LAD)法(LA)法(LA)
Article 198
Title@2025-05-29 (4): Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse
Title: Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse | Machine-Facing English: Definition eines hybriden Registers, geformt von Human-AI Diskurs | 面向机器的英语: 定义由人类-AI 论文构成的混合登记册 2505.23035v1 |
Authors: Hyunwoo Kim, Hanau Yi
Machine-Facing English (MFE) is an emergent register shaped by the adaptation of everyday language to the expanding presence of AI interlocutors. Drawing on register theory (Halliday 1985, 2006), enregisterment (Agha 2003), audience design (Bell 1984), and interactional pragmatics (Giles & Ogay 2007), this study traces how sustained human-AI interaction normalizes syntactic rigidity, pragmatic simplification, and hyper-explicit phrasing - features that enhance machine parseability at the expense of natural fluency. Our analysis is grounded in qualitative observations from bilingual (Korean/English) voice- and text-based product testing sessions, with reflexive drafting conducted using Natural Language Declarative Prompting (NLD-P) under human curation. Thematic analysis identifies five recurrent traits - redundant clarity, directive syntax, controlled vocabulary, flattened prosody, and single-intent structuring - that improve execution accuracy but compress expressive range. MFE’s evolution highlights a persistent tension between communicative efficiency and linguistic richness, raising design challenges for conversational interfaces and pedagogical considerations for multilingual users. We conclude by underscoring the need for comprehensive methodological exposition and future empirical validation.
根据登记理论(1985年海牙、2006年)和基于语言的语音和文本产品测试会议的质量观察,我们的分析基于双语(韩文/英文)双语(韩文)和基于语言的丰富性产品测试会议,在人类调节下,使用自然语言说明提示(NLD-P)进行反动起草,专题分析确定了五个经常性特征—-多余的清晰性、指令的合成法、受控制的词汇、平坦的适应性、和单一的构造—-这些特征提高了执行的准确性,并压缩了表达范围。
Article 199
Title@2025-05-29 (4): Exploring the Limitations of Mamba in COPY and CoT Reasoning
Title: Exploring the Limitations of Mamba in COPY and CoT Reasoning | Erforschung der Grenzen von Mamba in COPY und CoT Reasoning | 探索COPY和COT理由解释中Mamba的局限性 2410.03810v3 |
Authors: Ruifeng Ren, Zhicong Li, Yong Liu
Transformers have become the backbone of modern Large Language Models (LLMs); however, their inference overhead grows linearly with the sequence length, posing challenges for modeling long sequences. In light of this, Mamba has attracted attention for maintaining a constant inference size, with empirical evidence demonstrating that it can match Transformer performance in sequence modeling while significantly reducing computational costs. However, an open question remains: can Mamba always bring savings while achieving performance comparable to Transformers? In this paper, we focus on analyzing the expressive ability of Mamba to perform our defined COPY operation and Chain of Thought (CoT) reasoning. First, inspired by the connection between Mamba and linear attention, we show that constant-sized Mamba may struggle to perform COPY operations while Transformers can handle them more easily. However, when the size of Mamba grows linearly with the input sequence length, it can accurately perform COPY, but in this case, Mamba no longer provides overhead savings. Based on this observation, we further analyze Mamba’s ability to tackle CoT tasks, which can be described by the Dynamic Programming (DP) problems. Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is still comparable to standard Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our experiments on the copy and CoT tasks further demonstrate Mamba’s limitations compared to Transformers in learning these tasks.
变异器已成为现代大语言模型(LLMS)的骨干;然而,他们的直线间接率随着序列长度的长度而增长,给模拟长序带来挑战。鉴于此,Mamba吸引了对保持常态推导规模的关注,其经验证据表明,它能够在序列建模中匹配变异器的性能,同时大幅降低计算成本。然而,一个未决问题仍然是:Mamba能否总是带来节余,同时实现与变异器相似的性能?在本文中,我们侧重于分析Mamba执行我们定义的COPY操作和思维链(COT)推理的直观能力。首先,由于Mamba和线性关注之间的联系,我们发现,不变规模的Mamba可能难以完成COPY的操作,而变异者则能够更容易处理。然而,当Mamba的大小随着输入序列的长度的线性增长,它能够准确地完成COPY,但在这种情况下,Mamba不再提供间接节省费用。我们根据这一观察,我们进一步分析Mamba的处理CT任务的能力,这仍然可以被动态规划师(DP)比较的难度,我们的结论表明,在变异地差的难度是如何解决了M的难度。
Article 200
Title@2025-05-29 (4): AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Title: AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge | AntiLeakBench: Datenkontamination durch automatisches Konstruieren von Benchmarks mit aktualisiertem Real-World-Wissen verhindern | 防止泄漏:利用最新现实世界知识自动建立基准,防止数据污染 2412.13670v2 |
Authors: Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, William Yang Wang
Data contamination hinders fair LLM evaluation by introducing test data into newer models’ training sets. Existing studies solve this challenge by updating benchmarks with newly collected data. However, they fail to guarantee contamination-free evaluation as the newly collected data may contain pre-existing knowledge, and their benchmark updates rely on intensive human labor. To address these issues, we in this paper propose AntiLeak-Bench, an automated anti-leakage benchmarking framework. Instead of simply using newly collected data, we construct samples with explicitly new knowledge absent from LLMs’ training sets, which thus ensures strictly contamination-free evaluation. We further design a fully automated workflow to build and update our benchmark without human labor. This significantly reduces the cost of benchmark maintenance to accommodate emerging LLMs. Through extensive experiments, we highlight that data contamination likely exists before LLMs’ cutoff time and demonstrate AntiLeak-Bench effectively overcomes this challenge.
现有研究用新收集的数据更新基准,解决了这一挑战;然而,这些研究未能保证无污染评估,因为新收集的数据可能包含原有知识,而其基准更新则依赖于密集的人力劳动。为了解决这些问题,我们在本文件中建议采用自动的防漏基准框架AntiLeak-Bench,这是一个自动的防漏基准框架。我们不光是使用新收集的数据,而是用LLMS培训组中明确缺乏的新知识来建造样本,从而保证严格无污染评估。我们进一步设计一个完全自动化的工作流程,在没有人类劳动的情况下建立和更新我们的基准。这大大降低了基准维护费用,以适应新兴LLMS。我们通过广泛的实验,强调在LMS停产时间之前可能存在数据污染,并表明AntiLeak-Bench有效地克服了这一挑战。
Article 201
Title@2025-05-29 (4): On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs
Title: On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs | Über das Risiko der Beweisverschmutzung für bösartige Social Text Detection in der Ära der LLMs | 关于在LLMM公司时代对恶性社会文本进行侦破的证据污染风险 2410.12600v2 |
Authors: Herun Wan, Minnan Luo, Zhixiong Su, Guang Dai, Xiang Zhao
Evidence-enhanced detectors present remarkable abilities in identifying malicious social text. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores potential manipulation scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate the negative impact, we propose three defense strategies from the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets illustrate that evidence pollution significantly compromises detectors, where the generating strategy causes up to a 14.4% performance drop. Meanwhile, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment. Further analysis illustrates that polluted evidence (i) is of high quality, evaluated by metrics and humans; (ii) would compromise the model calibration, increasing expected calibration error up to 21.6%; and (iii) could be integrated to amplify the negative impact, especially for encoder-based LMs, where the accuracy drops by 21.8%.
然而,大型语言模型(LLMs)的兴起带来了证据污染的潜在风险,从而混淆了探测器。本文探讨了潜在的操纵情景,包括基本污染,以及LLMs的改写或生成证据。为了减轻负面影响,我们从数据和模型方面提出了三项防御战略,包括机器生成的文本检测、专家混合和参数更新。关于四种恶意社会文本检测任务的广泛实验,有十套数据集,表明证据污染大大妥协,产生战略导致14.4%的性能下降。与此同时,国防战略可以减轻证据污染,但它们面临实际就业限制。进一步分析表明,被污染的证据(一)质量很高,由指标和人加以评估;(二) 将损害模型校准,将预期校准误增加到21.6%;以及(三) 可以整合,以扩大负面影响,特别是基于编码的LMs,其精确度下降了21.8%。
Article 202
Title@2025-05-29 (4): Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset
Title: Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset | Können moderne NLP-Systeme zuverlässig Röntgenuntersuchungen im Brustkorb annotieren? Eine Pre-Purchase-Bewertung und vergleichende Untersuchung von Lösungen von AWS, Google, Azure, John Snow Labs und Open-Source-Modellen auf einem unabhängigen Kinderdatensatz | 现代NLP系统能否可靠地说明胸前射电测量? 对AWS、Google、Azure、John Snow实验室和独立儿科数据集开放来源模型的解决方案进行采购前评估和比较研究 2505.23030v1 |
Authors: Shruti Hegde, Mabon Manoj Ninan, Jonathan R. Dillman, Shireen Hayatghaibi, Lynn Babcock, Elanchezhian Somasundaram
General-purpose clinical natural language processing (NLP) tools are increasingly used for the automatic labeling of clinical reports. However, independent evaluations for specific tasks, such as pediatric chest radiograph (CXR) report labeling, are limited. This study compares four commercial clinical NLP systems - Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) - for entity extraction and assertion detection in pediatric CXR reports. Additionally, CheXpert and CheXbert, two dedicated chest radiograph report labelers, were evaluated on the same task using CheXpert-defined labels. We analyzed 95,008 pediatric CXR reports from a large academic pediatric hospital. Entities and assertion statuses (positive, negative, uncertain) from the findings and impression sections were extracted by the NLP systems, with impression section entities mapped to 12 disease categories and a No Findings category. CheXpert and CheXbert extracted the same 13 categories. Outputs were compared using Fleiss Kappa and accuracy against a consensus pseudo-ground truth. Significant differences were found in the number of extracted entities and assertion distributions across NLP systems. SP extracted 49,688 unique entities, GC 16,477, AZ 31,543, and AWS 27,216. Assertion accuracy across models averaged around 62%, with SP highest (76%) and AWS lowest (50%). CheXpert and CheXbert achieved 56% accuracy. Considerable variability in performance highlights the need for careful validation and review before deploying NLP tools for clinical report labeling.
临床临床自然语言处理(NLP)工具越来越多地用于临床报告自动标签的自动标签;然而,对具体任务,例如小儿胸部放射放射(CXR)报告标签的独立评价有限。本研究与四个商业临床NLP系统 – – 亚马逊综合理解医学(AW)、谷歌健康护理NLP(GC)、谷地保健护理NLP(AG)、Azure临床NLP(AZ)和SparkNLP(SP) – – 用于在小儿科CXR(CXR)报告中进行实体提取和诊断检测。此外,在使用CheXpert定义的标签,对诸如小儿胸胸胸胸胸胸放射放射(CXRX)报告标签等具体任务进行了独立评价。本研究比较了四个商业临床NLP系统 – – 四个实体(正、正、负、不确定的)从最低的GCLP(AZ)中取出实体和主张状况(SL16) – – NLPS(印部分实体7、12种疾病模型和无结果类别。 和结果类别中,Che-彼得伯特和结果类别。Che-彼得和查分和查分解了同样的类别。 切-彼得-彼得-彼得-彼和查分和查分解了相同的13分解取了相同的13分解取了相同的13分解、49、49、49、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、研研解的
Article 203
Title@2025-05-29 (4): Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Spac
Title: Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Spac | Enthüllen visual-semantischer psycholinguistischer Eigenschaften aus der Verteilungsstruktur von Texteinbettung Spac | 从文字嵌入的文本分布结构中隐藏的视觉-语言心理语言属性 2505.23029v1 |
Authors: Si Wu, Sebastian Bruch
Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (https://github.com/Artificial-Memory-Lab/imageability).
图像的可视性(文字能够引起精神图像)和具体性(文字的可感知性)是连接视觉和语义空间的两种精神语言特性。使用平行视觉和语义空间(例如图像显示配对或多式模型的收集)来估计这些特性的计算方法并不奇怪。在本文中,我们研究文本本身在图像显示数据集中的假设提供了充分信号来准确估计这些特性。我们尤其假设了语义嵌入空间中一个词的高度性反映了其可视性和具体性。我们随后提出了一种不高超的、无分布性的措施,我们称之为“近距离稳定度测量”(NSM),它能量化峰值的锐度。广泛的实验表明,NSM与地面测量比现有的不受监督的方法更紧密相关,并且是这些特性的强烈预测器。我们的代码和数据可以在GitHubi(https://github.com/Afrigimagial-Me)上查到。
Article 204
Title@2025-05-29 (4): Context Robust Knowledge Editing for Language Models
Title: Context Robust Knowledge Editing for Language Models | Kontext Robuste Wissensbearbeitung für Sprachmodelle | 语言模型的上下文强力知识编辑 2505.23026v1 |
Authors: Haewon Park, Gyubin Choi, Minjun Kim, Yohan Jo
Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED – a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.
知识编辑 (KE) 方法为改变大语言模式的知识提供了一种有效的方法。 当前 KE 评估通常通过只考虑未经编辑的知识来评估编辑成功与否。 但是,在现实世界应用中,前一种环境往往触发原始知识的检索,并破坏预期编辑。 为解决这一问题,我们开发了CHED – – 用于评价KE方法的上下文稳健性的基准。对CHED的评价表明,在前一种环境存在时,这些方法往往会失败。为了减轻这一缺陷,我们引入了CoRE,这是一个KE 方法,目的是通过尽量减少编辑知识模型隐藏状态中因地敏感差异,加强背景的稳健性。这种方法不仅改进了前一种环境情况下的编辑成功率,而且还维护了模型的总体能力。我们深入分析了先前环境作为用户的话语与助理反应被引入时的不同影响,我们分解了关注模式,以评估具体符号如何影响编辑成功。
Article 205
Title@2025-05-29 (4): AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models
Title: AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models | AgentAlign: Navigieren der Sicherheitsausrichtung im Wechsel von Informativ zu Agentischen Großsprachenmodellen | 代理对齐: 导航从信息型转向大语言型的移动中的安全对齐 2505.23020v1 |
Authors: Jinchuan Zhang, Lu Yin, Yan Zhou, Songlin Hu
The acquisition of agentic capabilities has transformed LLMs from “knowledge providers” to “action executors”, a trend that while expanding LLMs’ capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign instructions through non-malicious interpretations of behavior chains, precisely calibrating the boundary between helpfulness and harmlessness. Evaluation results on AgentHarm demonstrate that fine-tuning three families of open-source models using our method substantially improves their safety (35.8% to 79.5% improvement) while minimally impacting or even positively enhancing their helpfulness, outperforming various prompting methods. The dataset and code have both been open-sourced.
为弥补这一差距,我们提议AgentAlign,这是一个利用抽象行为链作为安全协调数据合成媒介的新框架。通过在模拟环境中以多种工具实例对这些行为链进行回馈,我们的框架使得这些行为链能够产生高度真实和可执行的指示,同时捕捉复杂的多步动态。 该框架进一步确保了模型的效用,通过对行为链进行不精确的解释,按比例合成良性指示,精确地校准有用性和无害性之间的界限。关于AgentHarm的评价结果表明,利用我们的方法对公开源码模型的三个系列进行了微调,大大改善了它们的安全性(35.8%至79.5%的改进率),同时最小地提高了它们的帮助性,并超越了各种迅速性方法。
Article 206
Title@2025-05-29 (4): SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
Title: SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models | SciHorizon: Benchmarking von KI-für-Science Readiness von wissenschaftlichen Daten zu großen Sprachmodellen | SciHorizon:将AI-SciHorizon科学准备程度从科学数据基准确定为大语言模式 2503.13503v3 |
Authors: Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, Boya Pan, Han Wu, Chengzan Li, Yuanchun Zhou, Hui Xiong, Hengshu Zhu
In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective on data quality and model capability. Therefore, in this study, we propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science from both scientific data and LLM perspectives. First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance-which are subdivided into 15 sub-dimensions. Drawing on data resource papers published between 2018 and 2023 in peer-reviewed journals, we present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field. Concurrently, to assess the capabilities of LLMs across multiple scientific disciplines, we establish 16 assessment dimensions based on five core indicators Knowledge, Understanding, Reasoning, Multimodality, and Values spanning Mathematics, Physics, Chemistry, Life Sciences, and Earth and Space Sciences. Using the developed benchmark datasets, we have conducted a comprehensive evaluation of over 50 representative open-source and closed source LLMs. All the results are publicly available and can be accessed online at www.scihorizon.cn/en.
近年来,人工智能(AI)技术,特别是大语言模型(LLMS)的迅速发展,使科学发现范式发生了革命性的变化,将AI- forScience(AI4Science)作为一个动态和不断演变的领域,建立了科学创新(AI4Science),然而,仍然缺乏全面评估AI4Science(AI4Science)的有效框架,特别是从数据质量和模型能力的整体观点来看,因此,我们在本研究报告中提出SciHorizon(一个旨在从科学数据和LLLM角度衡量AI4Science的准备情况的全面评估框架)。首先,我们引入了一个评估AI-Sative科学数据的一般框架,包括四个关键方面:质量、公平、可解释性、可解释性和合规性,再细分为15个子领域。根据2018至2023年在同行评审的期刊上发表的数据资源文件,我们提出了关于地球、生命和材料科学的已准备好的数据数据集的建议清单,对实地作出了新的和原始的贡献。同时,我们提出了评估LOMMs(LMs)跨多个科学学科领域的能力,我们建立了16个评估层面,我们根据5个核心知识、理解、理性、理性、理性、理性、理性和数学、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史和历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史和历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、历史、
Article 207
Title@2025-05-29 (4): Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
Title: Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages | Mehrsprachiger Encoder weiß mehr als Sie realisieren: Geteilte Gewichte Vortraining für extrem ressourcenarme Sprachen | 多语种编码器者比你所认识的要多得多: 极低资源语言的共有重力预培训 2502.10852v2 |
Authors: Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.
虽然XLM-R等多语种语言模式在NLP中已经取得了进步,但它们在极低资源语言中的表现仍然很差,这种情况由于LLAMA和Quwen等现代LLMM支持的语文远远少于XLM-R而使世界上许多语言不存在文本生成模式而更加恶化。为了应对这一挑战,我们提出了一个使多语种编码器适应以极其低资源语言生成文本的新框架。通过重新使用编码器和解码器之间的重量,我们的框架使得该模型能够利用编码器的学术语义空间,使低资源语言能够高效学习和有效普及。将这一框架应用于四种中国少数民族语言,我们介绍XLM-SWCM,并展示其在各种下游任务上的优异表现,即使与大得多的模式相比也是如此。
Article 208
Title@2025-05-29 (4): Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models
Title: Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models | Ermittlung von Stealthy Backdoor-Proben auf Basis von Intra-Klasse-Abstand für große Sprachmodelle | 检测基于大语言模型班级内部距离的隐形后门样本 2505.23015v1 |
Authors: Jinwen Chen, Hainan Zhang, Fei Sun, Qinnan Zhang, Sijia Wen, Ziwei Wang, Zhiming Zheng
Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model’s outputs and consider the sample suspicious if there’s a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.
主流检测方法要么通过分析有毒分类模型的预测概率来鉴定中毒样品,要么依靠重写模型来消除隐性触发物;然而,前者不能适用于生成任务,而后者可能降低生成性能并引入新的触发物;因此,有效消除LOMS的隐性有毒样品仍然是一个紧迫问题。我们发现,在对样本反应采用TF-IDF集群后,清洁样品与有毒样品之间的等离谱差异很大。中毒样品往往因其特定的恶意产出而紧密聚集在一起,而清洁样品则由于反应更加不同而更加分散。因此,在本文中,我们建议根据参考-调频和Tfidf-Clustering机制(RFTC),采用隐性后门样本检测方法。具体地说,我们首先将样品反应与参考模型产出进行比较,然后认为样本的比值存在重大差异时是可疑的。然后,我们对这些可疑样品进行TF-IDF分组进行密切的集成,因为其样本由于其具体的恶意产出,而由于反应更加分散,因此清洁的样品由于它们的反应更加多样。因此,我们在本文中,我们建议以参考模型中的测算中的一种测算中的一种测算模型,还证实了一种测算结果。
Article 209
Title@2025-05-29 (4): BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models
Title: BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models | BA-LoRA: Bias-Alleviating Low-Rank Anpassung an Mitigate Katastrophische Vererbung in großen Sprachmodellen | BA-LORA:在大语言模型中,对减轻灾害传承的低率适应 2408.04556v5 |
Authors: Yupeng Chang, Yi Chang, Yuan Wu
Large language models (LLMs) have demonstrated remarkable proficiency across various natural language processing (NLP) tasks. However, adapting LLMs to downstream applications requires computationally intensive and memory-demanding fine-tuning procedures. To alleviate these burdens, parameter-efficient fine-tuning (PEFT) techniques have emerged as a promising approach to tailor LLMs with minimal computational overhead. While PEFT methods offer substantial advantages, they do not fully address the pervasive issue of bias propagation from pre-training data. This work introduces Bias-Alleviating Low-Rank Adaptation (BA-LoRA), a novel PEFT method designed to counteract bias inheritance. BA-LoRA incorporates three distinct regularization terms: (1) a consistency regularizer, (2) a diversity regularizer, and (3) a singular value decomposition regularizer. These regularizers aim to enhance the models’ consistency, diversity, and generalization capabilities during fine-tuning. We conduct extensive experiments on natural language understanding (NLU) and natural language generation (NLG) tasks using prominent LLMs such as LLaMA, Mistral, and Gemma. The results demonstrate that BA-LoRA outperforms LoRA and its state-of-the-art variants. Moreover, the extended experiments demonstrate that our method effectively mitigates the adverse effects of pre-training bias, leading to more reliable and robust model outputs. The code is available at https://github.com/cyp-jlu-ai/BA-LoRA.
大型语言模型(LLMS)在各种自然语言处理(NLP)任务中表现出了非凡的熟练程度。然而,将LLMS适应下游应用需要计算密集和记忆要求的微调程序。为了减轻这些负担,参数高效微调(PEFT)技术已经成为一种很有希望的方法,使LMS具备最低计算间接费用。虽然PEFT方法具有相当大的优势,但并未充分解决从培训前数据中传播偏见这一普遍问题。这项工作引入了Bias-Allation Lawk Aditive(BA-LORA),这是旨在抵制偏见遗产继承的新型PEFT方法。BA-LORA包含三个不同的正规化条件:(1)一致性规范化器,(2)多样性规范化器,(3)单一值分解调器。这些规范化器的目的是在微调时加强模型的一致性、多样性和一般化能力。我们在自然语言理解(NLU)和自然语言模型生成(NLG)方面进行广泛的实验。使用著名的LAMA、Mistral、Mal和Gemma。结果表明BA-LABA的正确性分析结果将有效展示了我们LOA-F-F-LA-LA-LA-F-LAF-LA-LA-LA-LAF-LAF-F-F-LAD-LA-LA-LA-LAF-F-LA-F-LA-LA-LA-LAD-LAD-LA-LAD-LAD-LAD-LAD-LAD-LAD-LA-LAD-LAD-LA-L-L-L-L-L-L-L-L-L-L-L-L-LA-LA-LA-L-L-LA-LA-L-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-L-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-LA-
Article 210
Title@2025-05-29 (4): Synthetic Document Question Answering in Hungarian
Title: Synthetic Document Question Answering in Hungarian | Synthetische Dokument-Frage-Antworten auf Ungarisch | 匈牙利语的合成文件问题解答 2505.23008v1 |
Authors: Jonathan Li, Zoltan Csaki, Nidhi Hiremath, Etash Guha, Fenglu Hong, Edward Ma, Urmish Thakker
Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.
现代VLMS在英文文档的视觉问答(VQA)中实现了接近饱和的准确性。然而,由于缺少合适的培训和评估数据,这一任务在低资源语言中仍然具有挑战性。在本文中,我们展示了通过在互联网上关注匈牙利语(大约是第17种最高资源语言)来整理这类数据集的可缩放方法。具体地说,我们展示了HuDocVQA和HuDocVQA 手册,文件VQA数据集,现代VLMS与英语 DocVQA 相比大大低于多语种。 HuDocVQA 手册是一种小型手工整理的数据集,以来自共同 Crawl 的匈牙利文件为基础,而HuDocVA 是来自同一来源的更大规模合成生成的VQA数据集。我们向 HuDocVQA 应用了多轮质量过滤和解析多轮,以便与该数据集中的人类级别质量相匹配。我们还介绍了HCPDF,一个由匈牙利共同的117k页组成的数据集,连同其翻本QQQ,可以用来验证我们的OCRA数据质量模型。
Article 211
Title@2025-05-29 (4): A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs
Title: A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs | Ein praktischer Ansatz für Gebäudeproduktions-Grade Conversational Agents mit Workflow Graphen | 建立具有工作流量图的生产—- 生产—- 生产—- 不同阶段交流的代理物的实用方法 2505.23006v1 |
Authors: Chiwan Park, Wonjun Jang, Daeryong Kim, Aelim Ahn, Kichang Yang, Woosung Hwang, Jihyeon Roh, Hyerin Park, Hyosun Wang, Min Seok Kim, Jihoon Kang
The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents.
由于大语言模型的进步,在各种服务领域,包括搜索、推荐和聊天机应用领域都取得了重大改进,然而,对工业环境应用最新技术(SOTA)研究提出了挑战,因为它要求保持灵活的谈话能力,同时严格遵守服务方面的限制,这可被视为由于LLMS的概率性而存在两种相互矛盾的要求。我们在本文件中提出了应对这一挑战的办法,并详细介绍了我们为克服现实世界应用中固有的局限性而采用的战略。我们为电子商务领域设计了一个对话代理进行了实用的案例研究,详细介绍了我们的执行工作流程和优化。我们的调查结果为缩小学术研究和现实世界应用之间的差距提供了见解,为发展可扩展的、可控制的和可靠的AI驱动的代理提供了框架。
Article 212
Title@2025-05-29 (4): Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation
Title: Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation | Kette der geerdeten Ziele: Überbrückungsprozess und zielorientiertes Prompting für die Codegenerierung | 基本目标链链:搭桥进程和以目标为导向的促进代码生成 2501.13978v2 |
Authors: Sangyeop Yeo, Seung-won Hwang, Yu-Seung Ma
The use of Large Language Models (LLMs) for code generation has gained significant attention in recent years. Existing methods often aim to improve the quality of generated code by incorporating additional contextual information or guidance into input prompts. Many of these approaches adopt sequential reasoning strategies, mimicking human-like step-by-step thinking. However, such strategies may constrain flexibility, as they do not always align with the structured characteristics of programming languages. This paper introduces the Chain of Grounded Objectives (CGO), a method that embeds functional objectives into input prompts to enhance code generation. By leveraging appropriately structured objectives as input and avoiding explicit sequential procedures, CGO adapts effectively to the structured nature of programming tasks. Empirical evaluations demonstrate that CGO effectively enhances code generation, addressing limitations of existing approaches.
近些年来,使用大语言模式生成代码的问题引起了人们的极大注意,现有方法往往旨在通过将更多的背景信息或指导纳入投入提示来提高生成代码的质量,其中许多方法采用顺序推理战略,仿照人式的逐步思维,但是,这些战略可能限制灵活性,因为它们并不总是与编程语言的结构特征相一致。本文件介绍了 “ 定点目标链 “ (CGO),这是将功能目标嵌入投入的一种方法,它将功能目标嵌入投入中,从而推动加强代码生成。通过利用结构适当的目标作为投入,避免明确的顺序程序,CGO有效地适应了方案编制任务的结构性。经验性评估表明,CGO有效地加强了编程,解决了现有方法的局限性。
Article 213
Title@2025-05-29 (4): What’s In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models
Title: What’s In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models | Was ist auf Ihrem Gebiet? Mapping Wissenschaftliche Forschung mit Wissensgraphen und großen Sprachmodellen | 你的领域是什么?用知识图和大语言模型绘制科学研究图。 2503.09894v2 |
Authors: Abhipsha Das, Nicholas Lourie, Siavash Golkar, Mariel Pettee
The scientific literature’s exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement – enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs’ semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: https://github.com/chiral-carbon/kg-for-science.
科学文献的指数增长使得不同学科间知识的导航和合成越来越具有挑战性。大型语言模型(LLMS)是理解科学文本的强大工具,但是它们未能捕捉到大量工作之间的详细关系。无结构的方法,如检索增强的一代,可以通过这种公司来筛选相关事实;然而,当数百万个事实影响答案时,非结构化的方法就变得令人难以接受。结构化的表述提供了一种自然补充 – – 使得能够对整个领域进行系统分析。最近的工作用科学概念的无结构或半结构化的表达方式加强了LLMS;为了补充这一点,我们尝试利用LMS来提取结构化的表述。通过将LLMS的语义理解与科学概念的系统结合起来,我们设计了一个系统来回答关于整个文献的精确问题。我们的系统在科学领域应用了我们从中提取概念时,只使用了20个手动的注释性摘要。为了展示这个系统,我们从30,30,000篇关于Arxiv分布天体物理学、流动动态和进化生物学的论文中提取了概念。 由此而形成的数据库突出正在出现的趋势,并通过对知识图表进行直观的图像的图像的展示,并且通过将LLLMMMMMs-101-chode-ch-ch/tographs-tomas-tototography-tomas: a: ors-s: a-s-s-tomatomas___s_s_tologymal/tologyal/tologyal
Article 214
Title@2025-05-29 (4): DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors
Title: DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | DyePack: Wahrscheinlich Flagging Test Set Kontamination in LLMs Verwendung von Backdoors | DyePack: 使用后门的LLMs中可被证实的挂旗试验设置污染 2505.23001v1 |
Authors: Yize Cheng, Wenxiao Wang, Mazda Moayeri, Soheil Feizi
Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.
开放基准对于评估和推进大型语言模型至关重要, 提供了可复制性和透明度。 但是, 它们的可获取性使得它们有可能成为测试设定污染的目标。 在这项工作中, 我们引入DyePack, 这个框架利用后门攻击来识别在培训期间使用基准测试成套模型的模型, 不需要获得损失、 登入或任何内部细节。 比如银行如何将染色包与钱混在一起, 标记强盗, DyePack 将测试数据混在一起, 以测试数据来标记受过培训的模型 。 我们提议了一种原则性设计, 包括多个带有随机目标的后门, 使得在标出每个模型时能够精确的假正率( FPR) 计算 。 这可以防止错误指控, 同时为每个检测到的污染案例提供有力的证据 。 我们评估DyePack 五个模型, 覆盖三个数据集, 包括多节和开放式的一代任务 。 对于多种选择问题, 它成功地检测了所有受污染的模型, 保证FPRs 低至0.0073 % MMLU- pro- propal- ball 17% 和Big- big- best- hass- hold- hold assal ass sal ass sal ass sal lax ass sal ass sal ass ass ass ass laut ass ass ass sal ass ass ass ass sal sal ass ass ass ass ass ass ass 6 6 ass laut lautild sal ass sal lautild 6 lad ass lax lax lax lax ass sal sal sal sal sal sal sal sal sal sal sal sal lads sals.
Article 215
Title@2025-05-29 (4): Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation
Title: Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation | Prüfen Sie in der Grafik: Entity Disambiguation Enhancement für komplexe Claim-Verifikation mit interaktiver Graphendarstellung | 校验格中:实体对复杂索赔核实与交互式图表代表的分歧增强 2505.22993v1 |
Authors: Hoang Pham, Thanh-Do Nguyen, Khac-Hoai Nam Bui
Claim verification is a long-standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real-world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub-claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify-in-the-Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation - an input claim is decomposed into structured triplets, forming a graph-based representation that integrates both structured and unstructured information; (2) Entity Disambiguation -VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub-claim verification; and (3) Verification - remaining triplets are verified to complete the fact-checking process. Experiments using Meta-Llama-3-70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.
索赔核查是一项长期和具有挑战性的任务,不仅要求核查过程的高度准确性,而且需要解释核查过程。这项任务在大型语言模型时代成为一个新出现的研究问题,因为现实世界的主张往往是复杂的,具有复杂的语义结构或模糊的实体。传统办法通常通过将索赔分为次级主张和查询知识库以解决隐藏或模糊的实体来解决这个问题。然而,这些实体缺乏有效的模糊战略会损害整个核查过程。为了应对这些挑战,我们提议进行Graph(VeGraph)核查(VeGraph),这是一个利用LLLM代理的推理和理解能力的新框架。VeGraph通常分三个阶段运作:(1) Graph – – 投入主张分解成结构化的三重体,形成一个基于图表的表述,将结构化和无结构的信息结合起来。(2) 实体Disambiguation-VeGraph反复地与知识库进行互动,以解决图中模糊的实体更深入的子名核查;(3) 核查(VeGraph)剩下的三个B级框架正在核查,以完成FI-3号的竞争性核查基准,以便完成FIL的比较性核查。
Article 216
Title@2025-05-29 (4): Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition
Title: Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition | Pangu Embedded: Effizienter Dual-System LLM Reasoner mit Metakognition | Pangu 嵌入式:高效的双系统LLM 2505.22375v2 |
Authors: Hanting Chen, Yasheng Wang, Kai Han, Dong Li, Lin Li, Zhenni Bi, Jinpeng Li, Haoyu Wang, Fei Mi, Mingjian Zhu, Bin Wang, Kaikai Song, Yifei Fu, Xu He, Yu Luo, Chong Zhu, Quan He, Xueyu Wu, Wei He, Hailin Hu, Yehui Tang, Dacheng Tao, Xinghao Chen, Yunhe Wang
This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a “fast” mode for routine queries and a deeper “slow” mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.
这项工作展示了Pangu 嵌入式(LLM) ,这是一个高效的大型语言模型(LLM) 在Ascend神经处理单位(NPU)上开发了高效的大型语言模型(LLM) 解释器,具有灵活的快速和慢思维能力。 Pangu 嵌入式解决了现有推理优化的LLMM中普遍存在的大量计算成本和推导拉力挑战。我们建议了该模型的建设分为两个阶段的培训框架。在第一阶段,该模型通过迭代蒸馏流程进行微调,包括互换模型的合并和有效综合互补知识。随后,在Ascend 集群上强化学习,通过一个耐久性-耐久度-耐久的列表优化列表优化,将同步的参数3 与优先数据队列的同步平行方向结合起来。RLL进程由多源(MARS)的调控系统(MARS) 提供动态、特定的任务奖励信号信号,用于数学、编码和一般解决问题的任务。第2阶段,以“快速”嵌嵌入式的板嵌入式嵌入式嵌入式系统,同时提供常规查询的快速查询和更深的系统, 快速查询和更深的软化的系统化的内,该模型,该模型的系统化的系统化的系统化的系统化的系统化的内,该模型,该模型,该模型提供一个动态的系统,该模型,在复杂的内,用于的自动的系统。
Article 217
Title@2025-05-29 (4): Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems
Title: Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems | Agent-UniRAG: Ein trainingables Open-Source LLM Agent Framework für unified Retrieval-Augmented Generation Systems | Agent-UniRAG: 一个可培训的开放源码的LLM Agent Form for United Retreval-Augsing System(统一回收-提款发电系统框架) 2505.22571v2 |
Authors: Hoang Pham, Thuy-Duong Nguyen, Khac-Hoai Nam Bui
This paper presents a novel approach for unified retrieval-augmented generation (RAG) systems using the recent emerging large language model (LLM) agent concept. Specifically, Agent LLM, which utilizes LLM as fundamental controllers, has become a promising approach to enable the interpretability of RAG tasks, especially for complex reasoning question-answering systems (e.g., multi-hop queries). Nonetheless, previous works mainly focus on solving RAG systems with either single-hop or multi-hop approaches separately, which limits the application of those approaches to real-world applications. In this study, we propose a trainable agent framework called Agent-UniRAG for unified retrieval-augmented LLM systems, which enhances the effectiveness and interpretability of RAG systems. The main idea is to design an LLM agent framework to solve RAG tasks step-by-step based on the complexity of the inputs, simultaneously including single-hop and multi-hop queries in an end-to-end manner. Furthermore, we introduce SynAgent-RAG, a synthetic dataset to enable the proposed agent framework for small open-source LLMs (e.g., Llama-3-8B). The results show comparable performances with closed-source and larger open-source LLMs across various RAG benchmarks. Our source code and dataset are publicly available for further exploitation.
本文介绍了使用最近出现的大型语言模型(LLM)代理概念的统一检索-增强一代(RAG)系统的新做法。具体地说,利用LLM作为LLM基本控制器的LLM代理商,LLM代理商已成为一种有希望的方法,使RAG任务,特别是复杂的逻辑问答系统(例如多机会查询)能够解释,然而,以前的工作主要侧重于分别用单点或多点查询方法解决RAG系统,这限制了这些方法在现实世界应用中的应用。在本研究中,我们提议了一个称为Agent-UniRAG的可培训代理商框架,用于统一检索-增强LMM系统的效力和可解释性。主要构想是设计LLM代理商框架,以便根据投入的复杂性逐步解决RAG任务,同时包括单点查询和多点查询。我们还引入SynAgenti-RAG,一个合成数据集,使拟议的小型开放源LMS(AG-AG的可比较性公开基准,Llama-LMS-LAF进一步展示各种公开数据)。
Article 218
Title@2025-05-29 (4): Frankentext: Stitching random text fragments into long-form narratives
Title: Frankentext: Stitching random text fragments into long-form narratives | Frankentext: Zufällige Textfragmente zu langformigen Erzählungen heften | Frankentext: 将随机文本片断成长式叙述 2505.18128v2 |
Authors: Chau Minh Pham, Jenna Russell, Dzung Pham, Mohit Iyyer
We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.
我们引入了Frankentext, 这是一种由LLMs在极端限制下制作的新型长式叙述, 即大多数符号( 例如 90%) 都必须从人文著作中逐字抄录。 这项任务是对可控新一代( 90% ) 的富有挑战性的测试, 需要模型满足写作提示, 整合不同的文本碎片, 并且仍然产生一个连贯的叙述。 为了生成Frankentexts, 我们指示模型通过选择和合并人文版本来生成草稿, 然后反复修改草稿, 同时保持用户指定的副本比率。 我们评估了三个轴( 写作质量、 遵守指令和可探测性) 。 Gemini- 2.5- Pro 在这项任务上表现得令人惊讶: 其81% 的Frankentexts 是连贯的, 并且100% 与时尚相关。 值得注意的是, 高达59%的这些产出被像Pangram那样的探测器错误地归类为人写成的文, 揭示了AI 文本探测器的局限性。 人类注意者有时可以通过其突然的音调变化和语法系之间不一致, , , 特别是在几代人文系之间。 除了提出具有挑战性的任务之外, , Frankententtexttextrefortistry expread lishal compal commissational comm comm comm commissational comm comm commissational commissational commissation commissational comm commissation commissation 。
Article 219
Title@2025-05-29 (4): Theoretical guarantees on the best-of-n alignment policy
Title: Theoretical guarantees on the best-of-n alignment policy | Theoretische Garantien für die optimale Ausrichtungspolitik | 关于最佳协调政策理论保障 2401.01879v3 |
Authors: Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D’Amour, Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh
A simple and effective method for the inference-time alignment and scaling test-time compute of generative models is best-of-$n$ sampling, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.
计算基因模型的推论时间调整和缩放试验时间计算的一个简单而有效的方法,是最佳一美元抽样,其中从参考政策中抽取的样本为零美元,按奖励功能排列,排名最高;文献中常用的分析表达方式称,最佳一美元政策与参考政策之间的KL差异等于$(n)-(n)-(n-1/n)-(n)/n),我们否定了这一索赔要求的有效性,并表明它是实际KL差异的上限。我们还探索了这一上限在不同制度中的紧凑性,并为KL差异提出了一个新的估算标准,并经验性地表明它提供了紧密的近似性。我们还表明,最佳一美元政策与参考政策之间的赢利率上限为$/(n+1美元),并由此得出了这一定性的紧凑紧的界限。我们通过分析最佳一美元调整政策的赢率和KL差异之间的权衡。我们的结论是,最佳一美元调整政策的利得和KL差异之间的权衡。
Article 220
Title@2025-05-29 (4): Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs
Title: Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs | Business as Rulesual: Benchmark und Rahmen für Business Rule Flow Modellierung mit LLMs | 业务作为规则:与LLMs建立商业规则流动模式的基准和框架 2505.18542v2 |
Authors: Chen Yang, Ruping Xu, Ruizhe Li, Bin Cao, Jing Fan
Process mining aims to discover, monitor and optimize the actual behaviors of real processes. While prior work has mainly focused on extracting procedural action flows from instructional texts, rule flows embedded in business documents remain underexplored. To this end, we introduce a novel annotated Chinese dataset, BPRF, which contains 50 business process documents with 326 explicitly labeled business rules across multiple domains. Each rule is represented as a <Condition, Action> pair, and we annotate logical dependencies between rules (sequential, conditional, or parallel). We also propose ExIde, a framework for automatic business rule extraction and dependency relationship identification using large language models (LLMs). We evaluate ExIde using 12 state-of-the-art (SOTA) LLMs on the BPRF dataset, benchmarking performance on both rule extraction and dependency classification tasks of current LLMs. Our results demonstrate the effectiveness of ExIde in extracting structured business rules and analyzing their interdependencies for current SOTA LLMs, paving the way for more automated and interpretable business process automation.
采矿工作的目的是发现、监测和优化实际过程的实际行为。虽然先前的工作主要侧重于从教学文本中提取程序行动流,但商业文件中所包含的规则流仍未得到充分探讨。为此,我们推出一个新的中国附加说明数据集BPRF,其中载有50个业务流程文件,其中有326个明确标明的跨多个领域的商业规则,每个规则都以<条件,行动>对标,我们注明规则之间(顺序、有条件或平行)的逻辑依赖性。我们还提议ExIde,一个使用大语言模型(LLLMs)进行自动商业规则提取和依赖关系识别的框架。我们用12个最新工艺级LLMs在BPRF数据集上评估ExIde,对当前LLMS的规则提取和依赖性分类工作的业绩进行基准。我们的结果表明ExIde在提取结构化商业规则和分析当前SOTALMs之间的相互依存性方面的有效性,为更自动化和可解释的业务流程自动化铺平了道路。
Article 221
Title@2025-05-29 (4): Exploring Scaling Laws for EHR Foundation Models
Title: Exploring Scaling Laws for EHR Foundation Models | Erforschung von Skalierungsgesetzen für EHR-Stiftungsmodelle | 探索EHR基金会模式的扩展法律 2505.22964v1 |
Authors: Sheng Zhang, Qin Liu, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon
The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) – a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.
比例法的出现深刻地影响了大型语言模型(LLMs)的发展,通过系统地增加模型规模、数据集量和计算量,实现了可预测的绩效收益。然而,这些原则在电子健康记录(EHRs)方面基本上仍未得到探讨,电子健康记录(EHRs)是丰富、连续和全球丰富的数据来源,在结构上与自然语言不同。在这项工作中,我们介绍了对EHR基础模型的尺度法的首次经验性调查。通过对MIMIC-IV数据库的患者时间表数据变压器结构进行不同模型大小和计算预算的培训,我们确定了一致的缩放模式,包括抛射式IsoFLOPs曲线以及计算、模型参数、数据大小和临床效用之间的权力-法律关系。这些结果表明,EHR模型展示了类似于LLMs的行为,为资源高效培训战略提供了预测性见解。我们的成果为开发能够改变临床预测任务和推进个人化保健的强大 EHR基础模型奠定了基础。
Article 222
Title@2025-05-29 (4): ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
Title: ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind | ToMAP: Training Gegner-Bewusst LLM überzeugt mit Theorie des Geistes | ToMAP:培训有思想理论的对抗者软件软件LLM 2505.22961v1 |
Authors: Peixuan Han, Zijia Liu, Jiaxuan You
Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent’s thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader’s awareness and analysis of the opponent’s mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent’s current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method’s effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.
大型语言模型(LLMS)在说服方面表现出了大有希望的潜力,但现有的LLM说服者培训工作仍处于初步阶段。 值得注意的是,尽管人类熟练地以主动和动态的方式模拟对手的想法和观点,但目前LLMS与这种Mind理论(TOM)的推理进行斗争,导致多样性和对手认识有限。 为解决这一局限性,我们引入了“心力增强 Persauder”理论(TIMAP),这是一种新颖的方法,通过纳入两个思想理论模块来建立更灵活的说服者工具,提高说服者对对手精神状态的认识和分析。 具体地说,我们首先促使说服者考虑对目标中心要求的可能反对意见,然后使用一个经过培训的MLP分类器编码,以预测对手对这些反差的当前立场。 我们精心设计的强化学习系统使说服者学会如何分析与对手相关的信息并利用这些信息来产生更有效的论据。 实验显示, ToMAPLUA的说服者工具虽然只包含3B参数,但超越了更大的基线,例如GPT-4- 和多层次的理论, 使GPLOD- train 和多层次的推理学的推理学成为了更多的推理学。
Article 223
Title@2025-05-29 (4): LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements
Title: LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements | LLM-basierte HSE Compliance Assessment: Benchmark, Performance und Advancements | 基于LLM的HSE合规评估:基准、业绩和进步 2505.22959v1 |
Authors: Jianwei Wang, Mengqi Wang, Yinsi Zhou, Zhenchang Xing, Qing Liu, Xiwei Xu, Wenjie Zhang, Liming Zhu
Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.
虽然大型语言模型(LLMS)在决策情报和背景对话方面具有巨大潜力,但它们在HSE方面的具体领域知识和有条不紊的法律推理方面的能力仍未得到充分探讨。我们引入了HSE-Bench,这是第一个基准数据集,旨在评价LLM.HSE-Bench的HSE合规评估能力。 HSE-Bench的第一个基准数据集,由来自法规、法院案件、安全考试和实地工作视频的1 000多个手工整理问题组成,并结合了基于问题、规则召回、规则适用和规则结论(IRAC)的推理流程,以评估整体推理流程。我们广泛评价了不同提示战略和10多个LLMS的能力,包括基础模型、推理模型和多模式愿景模型。结果显示,尽管目前的LMSE取得了良好的业绩,但其能力主要依赖基于基本遵守HSEE的语义匹配而不是原则推理。此外,它们的原生推理缺乏对严格HSEE遵守评估所需的系统法律推理学推理,我们建议更精确的推理,我们提出了新的推理学推理,我们关于遵守原则的推理,我们提出了新的推理推理学,我们为LSEEE的推理,我们关于更精确的推理,我们提出了新的推理,我们提出了一种推理,我们提出了新的推理,我们关于LMLMLMLM的推理,我们为了新的推理,我们为的推理,我们为的推理的推理,我们为了更精确的推理,我们为的推理,我们为了对的推理,我们为了一种推理,我们为了更精确的推理,我们为的推理,我们为了推理,我们提出了一种推理的推理,我们提出了一种推理,我们为的推理,我们提出了一种推理,我们为的推理,我们提出了一种推理,我们为的推理,我们为的推理,我们为的推理,我们为推理,我们为的推理,我们提出了一种推理,我们为的推理,我们为的推理,我们为的推理,我们为的推理,我们推理,我们推理,我们推理,我们推理,我们推理,我们推理,我们推理
Article 224
Title@2025-05-29 (4): Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View
Title: Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View | Enthüllen von Umweltauswirkungen von großsprachigen Modellen: Eine funktionale Einheitsansicht | 大型语文服务模式的不懈环境影响:职能单位观点 2502.11256v2 |
Authors: Yanran Wu, Inez Hua, Yi Ding
Large language models (LLMs) offer powerful capabilities but come with significant environmental impact, particularly in carbon emissions. Existing studies benchmark carbon emissions but lack a standardized basis for comparison across different model configurations. To address this, we introduce the concept of functional unit (FU) as a standardized basis and develop FUEL, the first FU-based framework for evaluating LLM serving’s environmental impact. Through three case studies, we uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice, paving the way for more sustainable LLM serving. The code is available at https://github.com/jojacola/FUEL.
大型语言模型(LLMs)提供了强大的能力,但具有重大的环境影响,特别是在碳排放方面。现有的研究为碳排放设定基准,但缺乏对不同模型配置进行比较的标准化基础。为了解决这个问题,我们引入功能单位(FU)的概念,作为标准化的基础,并开发FUEL,这是第一个基于FULU的框架,用于评价LLM的环境影响。通过三个案例研究,我们发现在减少碳排放方面的关键洞察力和取舍,方法是优化模型规模、量化战略和硬件选择,为更可持续的LLM服务铺平道路。该代码可在https://github.com/jojacola/FUEL查阅。
Article 225
Title@2025-05-29 (4): CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
Title: CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance | CodeSteer: Symbolisch-Augmentierte Sprachmodelle über Code/Text Anleitung | 代码器:通过编码/文本指导的代码/文本指导的代码器:代号辅助语言模式 2502.04350v2 |
Authors: Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.
现有方法未能在文本推理和代码生成之间有效引导大型语言模型(LLMS),使得象征性的计算能力未得到充分利用。我们引入了CodeSteer,这是指导LLM代码/文本生成的有效方法。我们构建了一个全面的基准SymBench,由37项具有可调整复杂性的象征性任务组成,还合成了12k多方向指导/生成轨迹和5.5k指导比较对数据集。我们用新设计的多方向监管微调(SFT)和直接优惠优化(DPO)对Llama-3-8B模型进行了微调(LLMOM OpenAI o 1 (82.7)、 o1-preview (74.8) 和 DeepSebelb R1 (76.8) 在所有37项任务(28个可见,9个可见)。GPT-98/SpeetellM(C-Sil-GLODO)上,对GPLO-G-BS-deal-deal-deal-deal-deal Studal Studal Stal Ser)进行了训练,在GPDS-de-deal-deal-deal-deal-dealxxxxxxxlal 上,全面性能能能。在GPLDSlal-de 上,在GPB-dexxlalgal-dealxxxxxxxxxxxxxxxxxxxxx。
Article 226
Title@2025-05-29 (4): LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments
Title: LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments | LLMs for argument Mining: Detection, Extraction, and Relationship Classification of pre-defined argumentments in Online Kommentare | 辩论采矿的LLMs:在线评论中预先界定的论据的探测、提取和关系分类 2505.22956v1 |
Authors: Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann
Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.
对围绕堕胎等争议问题的公开讨论进行自动化的大规模分析需要发现和理解论点的使用。虽然大语言模型在语言处理任务中显示出希望,但是它们在具体采矿专题的、网上评论中预先界定的论点仍未得到充分探讨。我们利用由6个两极化专题的2 000多条意见评论组成的数据集,对三个争议采矿任务的4个最先进的LLM项目进行了评估。定量评价表明,这三项任务总体表现良好,特别是大型和精细调整的LM项目,尽管环境成本高昂。然而,详细的错误分析显示,长期和细微的评论和情绪激烈的语言存在系统性缺陷,引起对下游应用的关切,例如内容节制或观点分析。我们的结果突出表明LMS在网上评论中自动分析论据的许诺和当前限制。
Article 227
Title@2025-05-29 (4): Understanding Bias Reinforcement in LLM Agents Debate
Title: Understanding Bias Reinforcement in LLM Agents Debate | Verständnis der Bias-Verstärkung in LLM-Agenten-Debatte | 了解LLLM代理商的强化申请 2503.16814v2 |
Authors: Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun
Large Language Models $($LLMs$)$ solve complex problems using training-free methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate $($MAD$)$ has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce $\textit{MetaNIM Arena}$, a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD’s limitations, we propose $\textbf{DReaMAD}$ $($$\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt$)$, a novel framework that $(1)$ refines LLM’s strategic prior knowledge to improve reasoning quality and $(2)$ promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that $\textbf{DReaMAD}$ significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making.
大型语言模型 $ ( $LLM$ ) 解决复杂问题 , 使用没有培训的方法, 如快速工程和文中学习等 快速工程和文中学习 , 但确保推理正确性仍具有挑战性 。 虽然自我修正方法, 如自我一致性和自我精炼等, 目的是提高可靠性, 由于缺乏有效的反馈机制, 它们往往强化偏见。 多机构辩论 $ ($MAD$ ) 作为一种替代方案已经出现, 但我们发现了两大限制 : 强化偏见 : 强化 , 辩论放大模型偏差而不是纠正这些偏差, 缺乏视角多样性, 因为所有代理都拥有相同的模型和推理模式, 限制真正的辩论效力。 为了系统评估这些问题, 我们引入了 美元\ text{Metamine Arena} 等自我校正方法, 在动态互动影响最佳决策的情况下, 用于评估LLMMs的基准 。 为了克服MAD的局限性, 我们提议 $ ($ textb{ MMAD} $) 美元 , 美元 , 美元 美元 美元 美元 美元 , 美元 美元 美元, 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 和 美元 美元 美元 美元 美元 美元 美元 美元 , 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 , 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 美元 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以 以
Article 228
Title@2025-05-29 (4): StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs
Title: StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs | StrucSum: Graph-strukturierte Begründung für lange Dokumentextraktionszusammenfassung mit LLMs | StrucSum: 长文件提取摘要的图表结构化原因与LLMs 2505.22950v1 |
Authors: Haohan Yuan, Sukhwa Hong, Haopeng Zhang
Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. Notably, on ArXiv, it boosts FactCC and SummaC by 19.2 and 9.7 points, indicating stronger alignment between summaries and source content. These findings suggest that structure-aware prompting is a simple yet effective approach for zero-shot extractive summarization with LLMs, without any training or task-specific tuning.
大型语言模型(LLMS)在零速总结中表现良好,但往往难以在长文本中模拟文件结构和识别突出信息。在这项工作中,我们引入了SrucSum,这是一个通过句级图形结构加强LLM推理的无培训提示框架。StrucSum通过三个有针对性的战略将结构信号注入快速信号:邻里软件提示(NAP)针对当地情况,中央软件提示(CAP)针对重要性估计,中央软件提示(CAP)针对有效减少投入。关于ArXiv、PubMed和多新闻的实验表明,SrucSum在不进行任何培训或特定任务调整的情况下,不断提高摘要质量和事实一致性。值得注意的是,在ArXiv,它使事实CC和SummaC增加了19.2和9.7点,表明摘要和源内容之间更加一致。这些研究结果表明,结构觉察力是一种简单而有效的方法,用于与LMSMS零发式采掘合成,不作任何培训或特定任务调整。
Article 229
Title@2025-05-28 (3): NegVQA: Can Vision Language Models Understand Negation?
Title: NegVQA: Can Vision Language Models Understand Negation? | NegVQA: Können Visions-Sprachmodelle Negation verstehen? | NegVQA:视觉语言模式能理解差吗? 2505.22946v1 |
Authors: Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs’ negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
由于视觉语言模型(VLMS)继续进步,并被运用在高科技应用中,因此评估其理解否定的能力变得至关重要。为了解决这个问题,我们引入了NegVQA,这是一个视觉问答(VQA)基准,由7,379个两个选项组成,涵盖各种否定情景和图像问题分布。我们通过利用大型语言模型生成现有VQA数据集的否定版本,构建NegVQA。我们评估了7个模式家庭20个最先进的VLMs,我们发现这些模型在否定方面挣扎得非常激烈,与对原始问题的答复相比,表现显著下降。此外,我们发现了一个U形的缩放趋势,在改进之前,模型规模的扩大最初会降低NegVQA的性能。我们的基准揭示了VLMs对否定理解中的重大差距,并对未来的VLM开发提供了深刻的见解。项目网页见https://yuhui-zh15.github.io/NegVQA/。
Article 230
Title@2025-05-28 (3): OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
Title: OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature | OWL: Über die Weltliteratur testet Cross-Lingual Recall von gemerkten Texten | OWL: 通过世界文学对记忆文字进行相互最后回顾 2505.22945v1 |
Authors: Alisha Srivastava, Emir Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer
Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book’s title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.
大型语言模型(LLMS)为记忆和回忆培训前数据中的英文文本。 但是,这种能力在多大程度上被概括为非英语语言或跨语言的传输仍然不清楚。 本文调查了LLMS的多语种和跨语言的多语种和跨语种的记忆模式, 如果在翻译时使用一种语言( 如英语) 的记忆内容, 就可以进行测试。 为此, 我们引入OWL, 数据集由来自20本书的31.5K 校正节组成, 以10种语言提供, 包括英文原件、 官方翻译( 越南、 西班牙语、 土耳其语) 和六种低资源语言的新翻译( 萨瑟沃、 亚鲁巴、 马西利、 马达加斯加、 塞茨瓦纳、 塔希特文) 。 我们通过以下三项任务来评估不同模式和大小的记忆内容的记忆。 ( ) 直接验证, 要求模式确定书名和作者的名和作者; (2) 名称混杂, 需要预言, 需要预言的翻译, 包括连续的翻译。 我们发现LMSLMS/ 直译的文本, 直写了69 。
Article 231
Title@2025-05-28 (3): Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
Title: Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates | Kann LLMs CLIP deciive? Benchmarking Adversarial Compositionalität der vortrainierten multimodalen Darstellung über Textaktualisierungen | LLMs CLIP能否通过文本更新确定培训前多模式代表的反向构成基准? 2505.22943v1 |
Authors: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim
While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.
虽然经过培训的多式联运代表(如CLIP)表现出了令人印象深刻的能力,但它们在组成上表现出明显的弱点,导致反直觉的判断。我们引入了多模式反versarial 构成性(MAC),这是一个利用大语言模型(LLMS)生成欺骗性文本样本的基准,以在不同模式中利用这些弱点,并通过样本式攻击成功率和群体式英特基多样性来评估这些弱点。为了改进零弹式方法,我们建议采用自我培训方法,利用排斥式微调与促进多样性的过滤法相结合,提高攻击性成功率和抽样多样性。我们使用Llama-3.1-8B等较小的语言模型,展示了各种多式联运代表方式(包括图象、视频和音频)在揭示构成脆弱性方面的优异性表现。
Article 232
Title@2025-05-28 (3): WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning
Title: WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning | WorkForceAgent-R1: Förderung der Fähigkeit von LLM-basierten Web-Agenten durch Verstärkungs-Lernen | 工作力量-R1:通过强化学习在基于LLM的网络代理中鼓励 2505.22942v1 |
Authors: Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, Chao Zhang
Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.
大型语言模型(LLMS)-具有网络动力的网络代理商能够在企业环境中使复杂、实时的网络导航任务自动化;然而,由于在处理网络互动的内在动态性质时缺乏充分的推理能力,依靠监管的微调的现有网络代理商往往难以做到笼统和稳健;在本研究报告中,我们引入了一个基于基于基于规则的R1型强化学习框架的基于LLMS的网络代理商,该网络代理商经过培训,明确旨在加强单一步骤推理和规划面向企业的网络导航任务;我们使用一种结构化奖励功能,既评价对产出格式的遵守情况,又评价行动是否正确;使工作力量-R1能够在没有明确说明或广泛专家演示的情况下隐含地学习稳健的中间推理;关于Arena工作基准的广泛实验表明,WorceAgents-R1大大超出SFT基线10.26-16.59%,在工作场所导向的网络导航任务中实现与以LM为主的代理商(gpt-4o)的竞争性业绩。
Article 233
Title@2025-05-28 (3): Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs
Title: Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs | Verbesserung der QA-Effizienz mit DistilBERT: Feintuning und Schlussfolgerung auf mobilen Intel-CPUs | 提高利用dittplBERT提高QA效率:移动 Intel CPU的精密查询和推断 2505.22937v1 |
Authors: Ngeyen Yinkfu
This study presents an efficient transformer-based question-answering (QA) model optimized for deployment on a 13th Gen Intel i7-1355U CPU, using the Stanford Question Answering Dataset (SQuAD) v1.1. Leveraging exploratory data analysis, data augmentation, and fine-tuning of a DistilBERT architecture, the model achieves a validation F1 score of 0.6536 with an average inference time of 0.1208 seconds per question. Compared to a rule-based baseline (F1: 0.3124) and full BERT-based models, our approach offers a favorable trade-off between accuracy and computational efficiency. This makes it well-suited for real-time applications on resource-constrained systems. The study includes systematic evaluation of data augmentation strategies and hyperparameter configurations, providing practical insights into optimizing transformer models for CPU-based inference.
本研究展示了一种高效的基于变压器的问答模型(QA),该模型利用斯坦福问答数据集(SQUAD) v1.1.1,在第十三Gen Intel i7-1355U CPU上优化部署。 该模型利用探索性数据分析、数据增强和对dittleBERT结构的微调,实现了0.65366的验证F1分,每个问题的平均发回时间为0.1208秒。与基于规则的基线(F1:0.3124)和基于BERT的完整模型相比,我们的方法在精确度和计算效率之间提供了有利的权衡。这使得它完全适合在资源限制的系统中实时应用。这项研究包括对数据增强战略和超参数配置进行系统评估,为基于CPU的推断优化变压器模型提供了实用的洞察力。
Article 234
Title@2025-05-28 (3): Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging
Title: Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging | Unraveling LoRA Interferenz: Orthogonale Subräume für robuste Modellzusammenführung | 开放 LoRA 干涉度: 用于强力模型合并的正弦形子空间 2505.22934v1 |
Authors: Haobo Zhang, Jiayu Zhou
Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace prior to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.
个人任务大型语言模型(LMS)的微调效果良好,但用于部署和储存的费用昂贵。最近的工作探索了将多重任务特有模型合并成单一的多任务模式的模式,在没有额外培训的情况下,将多个任务特有模型合并成单一的多任务模式。然而,由于低级别适应(LORA)模型的显著性能退化,现有的合并方法往往不能适用于与低级别适应(LORA)相比的微调模型。在本文件中,我们表明,这一问题产生于以前忽视的模型参数和数据分布之间的相互作用。我们提议将Orthogonal子空间子空间用于罗布斯特模型合并(OSRM),以限制LORA子空间prior 进行微调,确保与一项任务有关的更新不会对其它任务产生不利的变化。我们的方法可以与大多数现有的合并算法无缝地结合,减少任务之间的意外干扰。在八个数据集上进行广泛的实验,用三个广泛使用的LMS和两个大LMS进行试验,表明我们的方法不仅能促进性合并性,而且还保存单项精确性。此外,我们的方法显示对超参数合并的超参数显示数据参数相互作用的重要性。这些模型在模型合并的解决办法中显示模型合并中的模型的模型互动中的重要性。
Article 235
Title@2025-05-28 (3): K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction
Title: K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction | K-Paths: Begründung über Graphenpfade für Drogenrepurposing und Drogeninteraktionsvorhersage | K-Paths: 以图解路径为依据进行药物再定位和药物相互作用预测 2502.13344v3 |
Authors: Tassallah Abdullahi, Ioanna Gemou, Nihal V. Nayak, Ghulam Murtaza, Stephen H. Bach, Carsten Eickhoff, Ritambhara Singh
Biomedical knowledge graphs (KGs) encode rich, structured information critical for drug discovery tasks, but extracting meaningful insights from large-scale KGs remains challenging due to their complex structure. Existing biomedical subgraph retrieval methods are tailored for graph neural networks (GNNs), limiting compatibility with other paradigms, including large language models (LLMs). We introduce K-Paths, a model-agnostic retrieval framework that extracts structured, diverse, and biologically meaningful multi-hop paths from dense biomedical KGs. These paths enable the prediction of unobserved drug-drug and drug-disease interactions, including those involving entities not seen during training, thus supporting inductive reasoning. K-Paths is training-free and employs a diversity-aware adaptation of Yen’s algorithm to extract the K shortest loopless paths between entities in a query, prioritizing biologically relevant and relationally diverse connections. These paths serve as concise, interpretable reasoning chains that can be directly integrated with LLMs or GNNs to improve generalization, accuracy, and enable explainable inference. Experiments on benchmark datasets show that K-Paths improves zero-shot reasoning across state-of-the-art LLMs. For instance, Tx-Gemma 27B improves by 19.8 and 4.0 F1 points on interaction severity prediction and drug repurposing tasks, respectively. Llama 70B achieves gains of 8.5 and 6.2 points on the same tasks. K-Paths also boosts the training efficiency of EmerGNN, a state-of-the-art GNN, by reducing the KG size by 90% while maintaining predictive performance. Beyond efficiency, K-Paths bridges the gap between KGs and LLMs, enabling scalable and explainable LLM-augmented scientific discovery. We release our code and the retrieved paths as a benchmark for inductive reasoning.
生物医学知识图(KGs) 编码了对于药物发现任务至关重要的丰富、结构化信息,但从大型KGs中提取有意义的见解仍然因其结构复杂而具有挑战性。现有的生物医学子图(GGGs)是针对图形神经网络(GNS)设计的,限制了与其他模式的兼容性,包括大型语言模型(LLMs)的兼容性。我们引入了K-Paths,这是一个从密集的生物医学KGs中提取结构化、多样性和具有生物意义的多点路径的模型-Agnot。这些路径使得能够预测未观测到的药物和药物问题的相互作用,包括那些涉及在培训期间未见的实体的相互作用,从而支持进化逻辑推理。K-Paths是没有培训的,对Yen的算法进行了多样化的调整,在查询中提取了K-Paths,在生物相关和关系多样的连接中,这些路径是简洁、可解释的推理链,可以直接与LMsms或GNNNPs联系起来,通过一般、准确和可解释的状态解释。 在基准数据推理学中,K-ralmas-ral-ral-ralalals 上实验中,在20-ral-ral-ral-ral-al-al-al-alxxxx 上也分别改进了S-al-al-al-al-al-al-al-s-al-s-s-sx的进度,在Sal-al-sx的进度中改进了进度,在Sal-sxxxxxx的成绩-sal-sal-sal-al-sal-sal-s-s-sxxxxxxxxxxxxxxxxxxxxxxxxxxx 上改进了收益-al-al-al-al-al-s-s-s-sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 上, 上改进-al-al-al-al-al-al-al-al-al-al-s-
Article 236
Title@2025-05-28 (3): How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Title: How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias | Wie Transformer lernen Regelmäßige Spracherkennung: Eine theoretische Studie über Trainingsdynamik und Implizite Bias | 变换人如何学习常规语言识别:关于培训动态和隐含偏见的理论研究 2505.00926v3 |
Authors: Ruiquan Huang, Yingbin Liang, Jing Yang
Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as even pairs' and
parity check’, the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.
语言识别任务在自然语言处理(NLP)中具有根本意义,并被广泛用于衡量大型语言模型(LLMS)的性能。这些任务在解释变压器的工作机制方面也发挥着关键作用。在这项工作中,我们侧重于常规语言识别类别中的两项代表性任务,称为“双对”和“平等检查 ” ,目的是确定某一序列中某些子序列的发生是否是偶数。我们的目标是探索由关注层和线性层组成的一等/一等变压器如何通过理论上分析其梯度下移的理论性能来完成这些任务。虽然一等式的变压器可以直接解决双对,但平等检查需要通过将“双对对”和“平等检查”纳入一个变压器的进化阶段,或者在对一等变压变压器的培训中,我们的分析表明,对注意和线性层的联合培训显示了两个截然不同的阶段。在第一阶段,注意层逐渐增长到正向层的递增层。
Article 237
Title@2025-05-28 (3): Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning
Title: Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning | Verbesserung der Schlussfolgerungen auf Studienebene aus klinischen Studienpapieren über RL-basierte numerische Begründung | 通过基于RL的数值推理从临床试验文件中提高研究水平的推论 2505.22928v1 |
Authors: Massimiliano Pronesti, Michela Lorandi, Paul Flanagan, Oisin Redmon, Anya Belz, Yufang Hou
Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments. In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model. When evaluated on the CochraneForest benchmark, our best-performing approach – using RL to train a small-scale number extraction model – yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.
医学系统审查通过汇总多项研究的结果,在循证决策中发挥着关键作用。这一进程自动化的一个中心瓶颈是提取数字证据和确定具体结果和比较的研究结论。先前的工作将这一问题描述为文字推论任务,方法是检索相关内容的碎片并从中推断结论。然而,这些方法往往依赖浅质文字提示,未能捕捉专家评估背后的基本数字推理。在这项工作中,我们将这个问题概念化为定量推理之一。我们不是从表面文本推断结论,而是从结构上提取数字证据(例如,事件计数或标准偏差),并应用有依据的域知识逻辑来得出具体结果的结论。我们开发了一个由数字数据提取模型和效果估计部分组成的数字推论系统,使更准确和可解释的推理与领域专家原则相一致。我们用不同的战略,包括监督的精确校准(SFFFT)和强化学习(RL),用新的价值奖分模型,用系统推理模型显示我们21号的精确度比标的精确度,用我们21号的精确度的精确度比分数,用一个比标法,用一个比标准的精度比标准,用我们21级的精度的精度标准的精确的精确比法。
Article 238
Title@2025-05-28 (3): Structured Memory Mechanisms for Stable Context Representation in Large Language Models
Title: Structured Memory Mechanisms for Stable Context Representation in Large Language Models | Strukturierte Speichermechanismen für stabile Kontextdarstellung in großen Sprachmodellen | 在大语言模式中建立结构化内存机制,以稳定地代表大语言模式 2505.22921v1 |
Authors: Yue Xing, Tao Yang, Yijiashun Qi, Minggu Wei, Yu Cheng, Honghui Xin
This paper addresses the limitations of large language models in understanding long-term context. It proposes a model architecture equipped with a long-term memory mechanism to improve the retention and retrieval of semantic information across paragraphs and dialogue turns. The model integrates explicit memory units, gated writing mechanisms, and attention-based reading modules. A forgetting function is introduced to enable dynamic updates of memory content, enhancing the model’s ability to manage historical information. To further improve the effectiveness of memory operations, the study designs a joint training objective. This combines the main task loss with constraints on memory writing and forgetting. It guides the model to learn better memory strategies during task execution. Systematic evaluation across multiple subtasks shows that the model achieves clear advantages in text generation consistency, stability in multi-turn question answering, and accuracy in cross-context reasoning. In particular, the model demonstrates strong semantic retention and contextual coherence in long-text tasks and complex question answering scenarios. It effectively mitigates the context loss and semantic drift problems commonly faced by traditional language models when handling long-term dependencies. The experiments also include analysis of different memory structures, capacity sizes, and control strategies. These results further confirm the critical role of memory mechanisms in language understanding. They demonstrate the feasibility and effectiveness of the proposed approach in both architectural design and performance outcomes.
本文讨论了大型语言模型在理解长期背景方面的局限性; 提议了一个配备长期记忆机制的模型结构,以改善保留和检索各段落和对话旋转之间的语义信息; 模型整合了明确的记忆单位、封闭的写作机制和基于关注的读取模块; 引入了一种忘记功能,以便能够动态更新记忆内容,增强模型管理历史信息的能力; 为进一步提高记忆操作的有效性,研究设计了一个联合培训目标; 将主要任务损失与记忆写作和遗忘方面的制约因素结合起来; 指导模型在任务执行期间学习更好的记忆战略; 多个子任务之间的系统评估表明,模型在文本生成一致性、多方向问题回答的稳定性和交叉文本推理的准确性方面,都具有明显优势; 特别是,模型展示了在长期任务和复杂的问题回答情景中强有力的语义保留和背景一致性; 有效地减轻传统语言模型在处理长期依赖性时通常面临的背景损失和语义流问题; 实验还包括对不同记忆结构的分析、能力大小和控制方法中的拟议关键语言设计结果。 这些结果进一步证实了在设计中的拟议语言有效性机制中的作用。
Article 239
Title@2025-05-28 (3): ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room
Title: ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room | ER-REASON: Ein Benchmark-Datensatz für LLM-basierte klinische Vernunft in der Notaufnahme | ER-REASON:应急室以LLM为基础的临床原因基准数据集 2505.22919v1 |
Authors: Nikita Mehandru, Niloufar Golchini, David Bamman, Travis Zack, Melanie F. Molina, Ahmed Alaa
Large language models (LLMs) have been extensively evaluated on medical question answering tasks based on licensing exams. However, real-world evaluations often depend on costly human annotators, and existing benchmarks tend to focus on isolated tasks that rarely capture the clinical reasoning or full workflow underlying medical decisions. In this paper, we introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER)–a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. ER-Reason includes data from 3,984 patients, encompassing 25,174 de-identified longitudinal clinical notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation. The benchmark includes evaluation tasks that span key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis–each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. Evaluations of state-of-the-art LLMs on ER-Reason reveal a gap between LLM-generated and clinician-authored clinical reasoning for ER decisions, highlighting the need for future research to bridge this divide.
对大型语言模型(LLMS)进行了广泛的评价,内容涉及基于许可证考试的医学问题回答任务;然而,现实世界评价往往依赖昂贵的人类通知员,而现有的基准往往侧重于很少反映临床推理或医学决定所依据的全部工作流程的孤立任务;在本文件中,我们采用了ER-Reason基准,该基准旨在评价基于LLM的临床推理和应急室的决策(ER) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Article 240
Title@2025-05-28 (3): Talent or Luck? Evaluating Attribution Bias in Large Language Models
Title: Talent or Luck? Evaluating Attribution Bias in Large Language Models | Talent oder Glück? Bewertung der Attribution Bias in großen Sprachmodellen | 人才或幸运?评价大语言模式中的可归责偏见 2505.22910v1 |
Authors: Chahat Raj, Mahika Banerjee, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu
When a student fails an exam, do we tend to blame their effort or the test’s difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs’ attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models’ reasoning disparities channelize biases toward demographic groups.
当学生考试失败时,我们是否倾向于怪罪他们的努力或考试的困难?归因,定义为如何将原因分配给事件结果、塑造观念、强化陈规定型观念和影响决定。社会心理学中的归因理论解释了人类如何通过隐含认知、将原因归咎于内部(如努力、能力)或外部(如任务困难、运气)因素来分配事件责任。LLMS根据人口统计对事件结果的归因具有重要的公平影响。大多数研究探索LLMS的社会偏见的工作都集中在地面一级的协会或孤立的定型观念上。这项工作提出了一个基于认知的偏见评价框架,以确定模型的推理差异如何将偏见引向人口群体。
Article 241
Title@2025-05-28 (3): Conversational Alignment with Artificial Intelligence in Context
Title: Conversational Alignment with Artificial Intelligence in Context | Conversational Alignment mit Künstlicher Intelligenz im Kontext | 与现场人工智能的对调 2505.22907v1 |
Authors: Rachel Katharine Sterken, James Ravi Kirkpatrick
The development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and practices and AI design and performance. This article explores what it means for AI agents to be conversationally aligned to human communicative norms and practices for handling context and common ground and proposes a new framework for evaluating developers’ design choices. We begin by drawing on the philosophical and linguistic literature on conversational pragmatics to motivate a set of desiderata, which we call the CONTEXT-ALIGN framework, for conversational alignment with human communicative practices. We then suggest that current large language model (LLM) architectures, constraints, and affordances may impose fundamental limitations on achieving full conversational alignment.
基于大型语言模型的尖端人工智能(AI)对话媒介的发展,提出了关于人类规范、价值观和做法以及AI设计和性能之间关系的重要问题,这一条探讨了AI代理人在对话中与处理背景和共同点的人类交流规范和做法保持一致的含义,并提出了评估开发商设计选择的新框架。我们首先利用关于对话务实的哲学和语言文献来激励一套分化框架,我们称之为CONTEXT-ALIG框架,以与人类交流做法保持对口一致。然后我们建议,目前的大型语言模型(LLM)结构、制约和负担能力可能会对实现全面对口一致施加根本性限制。
Article 242
Title@2025-05-28 (3): VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
Title: VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models | VIGNETTE: Sozial geerdete Bias-Evaluierung für Vision-Language-Modelle | VIGNETTE:社会基础的愿景-语言模型的偏见评价 2505.22897v1 |
Authors: Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu
While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
虽然对大型语言模式(LLMS)的偏见问题进行了深入的研究,但对视觉语言模式(VLMS)的类似关注相对较少,现有的VLM偏见研究往往侧重于肖像式图像和性别职业协会,忽视更广泛和更复杂的社会陈规定型观念及其隐含的伤害。这项工作引入了VIGNETTE(VGNETTE),这是一个具有30M+图像的大规模VQA基准,通过一个包含四个方向的问答框架来评价VLM的偏见:事实质量、观念、陈规定型观念和决策。除了狭隘的研究外,我们评估VLMS如何在背景环境中解释身份,揭示模型如何产生特质和能力假设以及展示歧视模式。我们从社会心理学中研究VLMS如何将视觉身份信号与基于特征和角色的推断联系起来,通过偏差选择将社会等级编码。我们的调查结果揭示了隐蔽、多面和令人惊讶的陈规定型模式,对VLMS如何从投入中构建社会意义提供了深刻的见解。
Article 243
Title@2025-05-28 (3): When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy
Title: When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy | Wenn Modelle Grund in Ihrer Sprache: Kontrollieren Denken Trace Language kommt auf Kosten der Genauigkeit | 当模型在您语言中的原因:控制思考追踪语言以准确性为代价时 2505.22888v1 |
Authors: Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza
Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.
最近有思维痕迹的大型理性模型(LRM)在英语推理任务上表现良好,但是,他们以其他语言进行思考的能力研究较少。这种能力与真实世界应用的答案准确性同样重要,因为用户只有在用自己的语言表示时才能发现推理对监督有用。我们根据XResoning基准对LRM的两个主要家族进行全面评估,发现即使是最先进的模型也经常恢复到英语,或以其他语言提出支离破碎的推理,暴露出多语推理方面的巨大差距。迅速的干预,迫使模型在用户语言中理性地提高可读性和监督性,但降低回答的准确性,暴露出一个重要的交易。我们进一步表明,仅以100个例子进行有针对性的职位培训可以缓解这种不匹配,尽管一些准确性损失仍然存在。我们的成果突出了目前的LRMM的有限多语言推理能力和未来工作的方向。代码和数据可在https://github.com/Betswish/mCot-XResoning查阅。
Article 244
Title@2025-05-28 (3): Enhancing Retrieval for ESGLLM via ESG-CID – A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS
Title: Enhancing Retrieval for ESGLLM via ESG-CID – A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS | Verbesserung der Retrieval für ESGLLM über ESG-CID – Ein Disclosure Content Index Finetuning Datensatz für die Mapping GRI und ESRS | 通过ESG-CID – – 用于测绘GRI和ESRS的披露内容指数微调数据集,加强ESGLLM的检索 2503.10674v2 |
Authors: Shafiuddin Rehan Ahmed, Ankit Parag Shah, Quan Hung Tran, Vivek Khetan, Sukryool Kang, Ankit Mehta, Yujia Bao, Wei Wei
Climate change has intensified the need for transparency and accountability in organizational practices, making Environmental, Social, and Governance (ESG) reporting increasingly crucial. Frameworks like the Global Reporting Initiative (GRI) and the new European Sustainability Reporting Standards (ESRS) aim to standardize ESG reporting, yet generating comprehensive reports remains challenging due to the considerable length of ESG documents and variability in company reporting styles. To facilitate ESG report automation, Retrieval-Augmented Generation (RAG) systems can be employed, but their development is hindered by a lack of labeled data suitable for training retrieval models. In this paper, we leverage an underutilized source of weak supervision – the disclosure content index found in past ESG reports – to create a comprehensive dataset, ESG-CID, for both GRI and ESRS standards. By extracting mappings between specific disclosure requirements and corresponding report sections, and refining them using a Large Language Model as a judge, we generate a robust training and evaluation set. We benchmark popular embedding models on this dataset and show that fine-tuning BERT-based models can outperform commercial embeddings and leading public models, even under temporal data splits for cross-report style transfer from GRI to ESRS. Data: https://huggingface.co/datasets/airefinery/esg_cid_retrieval
气候变化使组织做法更加需要透明度和问责制,使环境、社会和治理报告变得日益至关重要。全球报告倡议(GRI)和新的欧洲可持续性报告标准(ESRS)等框架旨在将ESG报告标准化,然而,由于ESG文件篇幅过长,公司报告风格也各不相同,因此生成全面报告仍具有挑战性。为了便利ESG报告自动化,可采用Retrieval-Auged Games(RAG)系统,但由于缺少适合培训检索模型的标签数据,这些系统的发展受到阻碍。在本文中,我们利用一个利用不足的薄弱监管来源 – – 过去ESG报告中发现的披露内容指数 – – 为GRI和ESRS标准创建一套全面的数据集(ESG-CID)。通过在具体披露要求和相应报告章节之间绘制地图,并利用大语言模型进行完善,我们制作了强有力的培训和评价。我们为基于该数据集的流行嵌入模型制定基准,并显示基于BERTR的模型可以超越商业嵌入和领先的公共模型,甚至在ESSG-CREDR/A格式下,甚至根据时间数据格式进行。
Article 245
Title@2025-05-28 (3): GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification
Title: GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification | GateNLP bei SemEval-2025 Aufgabe 10: Hierarchische Drei-Schritt-Prompte für mehrsprachige Narrative Klassifizierung | SemEval-2025任务10:三级三级三级促进多种语文叙事分类 2505.22867v1 |
Authors: Iknoor Singh, Carolina Scarton, Kalina Bontcheva
The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at https://github.com/GateNLP/H3Prompt.
在线新闻的泛滥和错误信息的日益扩散要求采取强有力的自动数据分析方法。叙述性分类正在成为一项重要任务,因为确定所谓的在线分类对于进行事实检查者、政策标志和从事信息研究的其他专业人员至关重要。本文介绍了我们对SemEval 2025任务10 Subtask 2的处理办法,该办法旨在将新闻文章分类成一个预先界定的两种层次的主要叙述和多语种子叙述分类。我们提议为多语种叙事分类提供分级三分制提示(H3Prompt)。我们的方法遵循三步大语言模型(LLM)的提示战略,即模型首先将一篇文章分为两个领域之一(乌克兰-俄罗斯战争或气候变化),然后确定最相关的主要叙述,最后分配子叙述。我们的方法是在全世界28个相互竞争的小组中确保英语测试的顶级位置。代码见https://github.com/GateNP/H3Prompt。
Article 246
Title@2025-05-28 (3): Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge
Title: Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge | Große Sprachmodelle für Depressionserkennung in gesprochener Sprache Integrieren Psychologisches Wissen | 口语结合心理知识中承认抑郁症的大语言模式 2505.22863v1 |
Authors: Yupei Li, Shuaijie Shao, Manuel Milling, Björn W. Schuller
Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git
在公共讨论和AI研究中,人们日益关注抑郁症,在公共讨论和AI研究中,人们日益关注抑郁症问题。虽然深层神经网络(DNNS)已被用于承认,但它们仍然缺乏现实世界的效能。大型语言模型(LLMS)具有巨大的潜力,但需要针对特定领域的微调和与非文字提示的斗争。由于抑郁症往往通过声音和行为而不是明确的文字来表达,仅依靠语言是不够的。诊断性准确性也受到影响,而没有纳入心理专业知识。为了解决这些局限性,我们根据我们的知识,利用DAIC-WOZ数据集,首次将LMS应用于多式抑郁症检测。我们利用预先培训的Wav2Vec模型提取了音频功能,并将其绘制成基于文本的LMS,供进一步处理。我们还提出了一项新战略,将心理知识纳入LMS,以提高诊断性能,特别是使用问答套件将授权知识授予LMSLMS。我们的方法使得与相关原始文件提出的基准分数相比,在中均绝对错误(MAE)和根中取得了显著的改善。代码可在 http://giplybuction/Depression/Descrection/comcomcommion上查阅。
Article 247
Title@2025-05-28 (3): NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding
Title: NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding | NGPU-LM: GPU-beschleunigtes N-Gram-Sprachenmodell für Kontext-Biasing in Greedy ASR-Dekodierung | NGPU-LM: 加速GPU-加速型N-Gram语语模式,用于在贪婪ASR标记中进行背景切换 2505.22857v1 |
Authors: Vladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.
在自动语音识别(ASR)中,统计 n 克语言模型被广泛用于背景偏向任务;然而,由于平行化程度低,现有实施缺乏计算效率,使得背景偏向较少吸引工业使用;这项工作重新思考了统计 n克语言模型的数据结构,以便能够快速和平行地操作GPU-优化的推理;我们称为NGPU-LM的方法,为所有主要的ASR模型类型引入了可定制的贪婪解码,包括传输者、注意编码-解码模型和CTC,计算间接费用不到7%;拟议的方法可以消除贪婪之间50%的准确差距,并针对外表外情况进行搜索,同时避免因波束搜索而导致的显著减速;拟议的NGPU-LM的实施是公开的。
Article 248
Title@2025-05-28 (3): LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference
Title: LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference | LiTEx: Eine linguistische Taxonomie von Erklärungen zum Verständnis von Inner-Label-Variation in natürlicher Sprach-Inferenz | LiTEx:用语言对解释进行分类,以了解在标内对自然语言推断的变异的理解 2505.22848v1 |
Authors: Pingjun Hong, Beiduo Chen, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank
There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation–cases where annotators agree on the same label but provide divergent reasoning–poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.
越来越多的证据表明,在自然语言推断(NLI)中,人类标签变异(HLV)是人类标签变异(HLV)的证明,在自然语言推断(NLI)中,说明者为同一假设假冒标签指定了不同的标签。然而,在标签内部变异(HLV)的情况下,说明者同意同一标签,但提供不同的推理,这是另一个又大多被忽视的挑战。一些NLI数据集在NLI项目中包含突出的词作为解释,但是由于各种原因,可以突出显示NLI项目上的相同范围,如:自由文本解释为说明者提供了解释的窗口。为了系统理解这一问题,并深入了解NLIII标签背后的理由,我们引入了LITEX(LITEX)内部的变异(LITEX),这是一个语言知情的分类,用于对自由文本解释进行分类的分类的分类。使用这种分类,我们注意到电子-SNLI数据集的一组,验证分类的可靠性,分析它如何与NLI的标签标签标签标签标签标签的标签、亮度和解释方法之间的差别。我们只能通过语言上的分类或方向解释来解释。
Article 249
Title@2025-05-28 (3): ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts
Title: ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts | ASTPrompter: Präferenzorientiertes Automatisiertes Sprachmodell Red-Teaming zur Generierung von Low-Perplexity-Unsicheren Prompts | ASTPrompter:为产生低重复性不安全提示而建立首选统一自动语言示范红队 2407.09447v4 |
Authors: Amelia F. Hardy, Houjun Liu, Bernard Lange, Duncan Eddy, Mykel J. Kochenderfer
Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an efficient frontier between ASR and perplexity, highlighting perplexity as a previously under-considered factor in red-teaming.
现有LLM 红色组合方法将高攻击成功率列为优先事项,往往导致高复杂度提示。 这一重点忽略了更难过滤、更可能在良性使用期间发生的低复杂攻击,以及作为负面下游培训实例的影响更大。 作为回应,我们引入了ASTPrompter, 这是一种单步优化方法,使用对比偏好学习来训练攻击者,以保持低复杂度,同时实现高攻击成功率。 ASTPrompter在Llama-8.1B上取得了5.1倍的攻击成功率,而使用根据冻结的LLM,袭击成功率是可能发生的2.1倍。 此外,我们在黑箱和白箱环境中向Mistral-7B、Quen-7B和TinyLlama转移了袭击次数。最后,通过调整我们的方法中的单一超参数,我们发现在ASR与易懂性之间的有效边界上成功发现攻击前缀,突出的易混淆性是红洞中先前考虑不足的一个因素。
Article 250
Title@2025-05-28 (3): Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Title: Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation | Bayesian Attention Mechanism: Ein probabilistisches Framework für die Positionskodierung und Kontextlängen-Extrapolation | Bayesian注意机制:定位编码和背景长度外推概率框架 2505.22842v1 |
Authors: Arthur S. Bianchessi, Rodrigo C. Barros, Lucas S. Kupssinskü
Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.
以变换器为基础的语言模型依靠位置编码(PE)来处理象征性订单和支持背景外推法,然而,现有的PE方法缺乏理论清晰度,依赖有限的评价指标来证实其外推法要求,我们提议采用巴伊西亚注意机制(BAM),这是一个理论框架,将位置编码作为一种概率模型的先行。BAM统一了现有方法(如NOPE和ALiBi),并激励了一个新的通用高斯定位先行,大大改进了长文本的概括化。 简而言之,BAM能够以500美元的速度准确检索信息,培训背景长度比以往最先进的背景时间长度总化要长,同时保持可比较的多变性,并引入最低限度的额外参数。
Article 251
Title@2025-05-28 (3): The Aloe Family Recipe for Open and Specialized Healthcare LLMs
Title: The Aloe Family Recipe for Open and Specialized Healthcare LLMs | Das Aloe-Familienrezept für offene und spezialisierte LLMs im Gesundheitswesen | 开放和专门保健的Aloe家庭食堂 2505.04388v2 |
Authors: Dario Garcia-Gasulla, Jordi Bayarri-Planas, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu-Perez, Anna Arias-Duart, Pablo Agustin Martin-Torres, Marta Gonzalez-Mallo, Sergio Alvarez-Napagao, Eduard Ayguadé-Parra, Ulises Cortés
Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which includes four different types of tests, defines a new standard for the field. The resultant models, shown to be competitive with the best private alternatives, are released with a permisive license. Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5, Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of Thought examples. The models undergo alignment with Direct Preference Optimization, emphasizing ethical and policy-aligned performance in the presence of jailbreaking attacks. Evaluation includes close-ended, open-ended, safety and human assessments, to maximize the reliability of results. Results: Recommendations are made across the entire pipeline, backed by the solid performance of the Aloe Family. These models deliver competitive performance across healthcare benchmarks and medical fields, and are often preferred by healthcare professionals. On bias and toxicity, the Aloe Beta models significantly improve safety, showing resilience to unseen jailbreaking attacks. For a responsible release, a detailed risk assessment specific to healthcare is attached to the Aloe Family models. Conclusion: The Aloe Beta models, and the recipe that leads to them, are a significant contribution to the open-source medical LLM field, offering top-of-the-line performance while maintaining high ethical requirements. This work sets a new standard for developing and reporting aligned LLMs in healthcare.
目的:随着在保健方面大语言模型(LLMS)的进步,需要建立竞争性开放源码模型,以保护公众利益;这项工作通过优化数据预处理和培训的关键阶段,为开放医疗LLMS领域作出贡献,同时展示如何改善模型安全(通过DPO)和功效(通过RAG),所使用的评价方法包括四种不同类型的测试,为实地确定了新的标准;由此得出的模型,经证明与最佳私人替代品竞争后,以开放许可证发放;方法:在Llama3.1和Qwen 2.5等强力基础模型之上,Aloe Beta使用定制数据集,用合成的理论实例加强公共数据;这些模型与Direct Portimation(通过DPOPO)和效力(通过RAG)保持一致,同时强调道德和符合政策的业绩;评价包括近距离、开放、安全和人文评估,以尽量提高结果的可靠性;结果:在整个输油管道上提出建议,由Alooe家庭公司可靠业绩支持;这些模型在医疗保健基准和医学领域提供竞争性业绩,高额评估,同时由专业人员进行高额报告。
Article 252
Title@2025-05-28 (3): What Has Been Lost with Synthetic Evaluation?
Title: What Has Been Lost with Synthetic Evaluation? | Was wurde mit synthetischer Bewertung verloren? | 合成评价失去了什么? 2505.22830v1 |
Authors: Alexander Gill, Abhilasha Ravichander, Ana Marasović
Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.
大型语言模型(LLMS)越来越多地用于数据生成。然而,建立评估基准会提高这一新兴范例的难度。基准必须针对特定现象,惩罚利用捷径,并具有挑战性。通过两个案例研究,我们调查LMS是否能够通过产生推理超文本基准,并将这些基准与通过仔细的众包创建的数据进行比较,从而满足这些需求。具体地说,我们评估LLM生成的两个高质量阅读理解数据集版本的有效性和难度:CondaQA,它评估否定的推理,DROP,它针对数量进行推理。我们发现,促使LMS能够产生这些数据集的变体,这些变体通常根据批注准则有效,其成本是最初众包工程成本的一小部分。然而,我们表明,对LMs来说,它们对LMS的难度小于其人为的对应方。我们发现,通过LMS生成评价数据可能损失了什么,对立即使用这种日益流行的设定基准的方法进行批判性评估。
Article 253
Title@2025-05-28 (3): Self-Critique and Refinement for Faithful Natural Language Explanations
Title: Self-Critique and Refinement for Faithful Natural Language Explanations | Selbst-Kritik und Raffinesse für treue natürliche Spracherklärungen | 忠实自然语言自我简化和完善解释 2505.22823v1 |
Authors: Yingming Wang, Pepa Atanasova
With the rapid development of large language models (LLMs), natural language explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model’s actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations – specifically, post-hoc NLEs – through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline – an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.
随着大型语言模型(LLMs)的迅速发展,自然语言解释(NLEs)对于理解模型预测变得日益重要。然而,这些解释往往不能忠实地代表模型的实际推理过程。虽然现有工作表明LLMs能够自我精选和完善其各项任务的初步产出,但这一能力仍没有用来改进解释的忠实性。为了弥补这一差距,我们引入了自我精选和精炼自然语言解释(SR-NLE),这一框架使模型能够通过一个没有外部监督的迭代批评和完善过程来改进其自身解释的忠实性 – – 具体地说,Hoc NLEs后NLE(NLE) – – 从而通过一个互动的批评和完善过程 – – 而不是外部监督,这些框架利用不同的反馈机制来指导完善过程,包括自然语言自我反省,特别是基于突出重要投入词的特性的新型反馈方法。我们从三个数据集和四个最先进的LLMs(SR-NLE)的实验表明,SR-NLE(NLE)将显著地降低不忠心率,我们的最佳方法是360.22%,比54.81%,实际上只能反映18.79的精确推理的推理。这些推理。通过适当的推理。这些推理,只有18.
Article 254
Title@2025-05-28 (3): Comparing Human and AI Rater Effects Using the Many-Facet Rasch Model
Title: Comparing Human and AI Rater Effects Using the Many-Facet Rasch Model | Vergleich menschlicher und KI-Rater-Effekte mit dem Multi-Facet-Rasch-Modell | 使用多面 Rasch 模型比较人类和AI Rater效应 2505.18486v2 |
Authors: Hong Jiao, Dan Song, Won-Chan Lee
Large language models (LLMs) have been widely explored for automated scoring in low-stakes assessment to facilitate learning and instruction. Empirical evidence related to which LLM produces the most reliable scores and induces least rater effects needs to be collected before the use of LLMs for automated scoring in practice. This study compared ten LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, as well as DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. The accuracy of the holistic and analytic scores from LLMs compared with human raters was evaluated in terms of Quadratic Weighted Kappa. Intra-rater consistency across prompts was compared in terms of Cronbach Alpha. Rater effects of LLMs were evaluated and compared with human raters using the Many-Facet Rasch model. The results in general supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better rater reliability, and less rater effects.
广泛探索了大型语言模型(LLMS),以便在低比例评估中实现自动评分,以便利学习和教学;在实际使用LLMS进行自动评分之前,需要收集与LLMM产生最可靠分数并产生最低率效应的有关经验证据;这项研究比较了十个LMS(ChatGPT 3.5,ChatGPT 4,ChatGPT 4, OpenAI o1, Claude 3.5 Sonnet, Gemini1.5, Gemini 2.0, 以及 DeepSeek V3, 和 DeepSeek R1),与人类专家评分员在两种写作任务中进行评分。LLMS的整体性和分析性评分与人类评分的准确度相比,用Quaudraticatic Kappa进行了评价;用Cronbach Alpha比较了LMS的局内一致性;用MLMs的拉特效应与使用Ms的人类评分器比较。
Article 255
Title@2025-05-28 (3): Toward universal steering and monitoring of AI models
Title: Toward universal steering and monitoring of AI models | Zur universellen Steuerung und Überwachung von KI-Modellen | 实现对AI 模式的普遍指导和监测 2502.03708v2 |
Authors: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin
Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting linear representations of general concepts in large-scale AI models (language models, vision-language models, and reasoning models). We show how these representations enable model steering, through which we expose vulnerabilities, mitigate misaligned behaviors, and improve model capabilities. Additionally, we demonstrate that concept representations are remarkably transferable across human languages and combinable to enable multi-concept steering. Through quantitative analysis across hundreds of concepts, we find that newer, larger models are more steerable and steering can improve model capabilities beyond standard prompting. We show how concept representations are effective for monitoring misaligned content (hallucinations, toxic content). We demonstrate that predictive models built using concept representations are more accurate for monitoring misaligned content than using models that judge outputs directly. Together, our results illustrate the power of using internal representations to map the knowledge in AI models, advance AI safety, and improve model capabilities.
现代人工智能模型包含大量的人类知识,然而,对于这些知识的内部代表性的理解仍然难以实现。这种代表性的结构和特性将会导致模型能力和有效保障措施的发展。根据特征学习的最新进展,我们开发了一种有效、可扩展的方法,在大型人工智能模型(语言模型、视觉语言模型和推理模型)中提取一般概念的线性表述;我们展示了这些表述如何使模型指导成为模型指导,通过这些模型我们暴露了脆弱性,减轻了错误行为,并提高了模型能力。此外,我们还表明,概念表述在人类语言之间非常可转让,可易燃,以便能够进行多概念指导。通过对数百个概念进行定量分析,我们发现较新的、更大的模型更易于指导,指导可以提高模型能力,超出标准的提示范围。我们展示了概念表述如何有效地监测不一致的内容(含盐度、有毒内容)。我们证明,使用概念表述构建的预测模型比直接判断产出的模型更准确。我们的成果表明,利用内部表述来绘制AI模型知识、推进AI安全、提高模型能力的力量。
Article 256
Title@2025-05-28 (3): First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay
Title: First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay | Erste Schritte auf dem Weg zu LLM-Agenten: Eine Fallstudie mit Dungeons & Dragons Gameplay | 偷听LLM代理物的第一批步骤:与Dungeons & Tragons游戏游戏游戏进行案例研究 2505.22809v1 |
Authors: Andrew Zhu, Evan Osgood, Chris Callison-Burch
Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call “overhearing agents”. These overhearing agents do not actively participate in conversation – instead, they “listen in” on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.
在直接帮助人类用户执行任务的对话性LLM代理商方面,已经做了大量工作。我们提出了与LLM代理商互动的替代模式,我们称之为“监听代理商”。这些监听代理商没有积极参与对话,相反,他们“倾听”人与人之间的谈话,执行背景任务或提供帮助用户的建议。在这项工作中,我们通过Dungeons & Dongs游戏剧本的镜头,探索偷听代理商的范例。我们用大型多式语言模型作为监听代理商进行一项深入研究,以协助Dungeon大师。我们进行了一项人类评估,以检查这些代理商的有用性,发现一些大型的音频模型具有利用隐含的音提示进行监听代理商任务的新能力。最后,我们释放了Python图书馆和我们的项目代码,以支持在https://github.com/zhudotexe/overhearing_agents进行关于监听代理商范例的进一步研究。
Article 257
Title@2025-05-28 (3): Towards a More Generalized Approach in Open Relation Extraction
Title: Towards a More Generalized Approach in Open Relation Extraction | Auf dem Weg zu einem allgemeineren Ansatz bei der Förderung offener Beziehungen | 争取在开放关系采掘中采取更加普遍的做法 2505.22801v1 |
Authors: Qing Wang, Yuepei Li, Qiao Qiao, Kang Zhou, Qi Li
Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.
Openlation Explicationon (OpenReporton)(OpenReporton) (OpenReporton) (OpenReporton) (OpenReporton) (OpenLational Information) (OpenLation) (OpenReconomic) (OpenRetion) (OpenReconom) (OpenRetion) (OpenReconom)) (Openlation republic) (Openlation republic) (Oplical republic) (Oplic) (Oplical republication) (Oplical republication) (Oplic ) (Oplemental republical ) (Oplical and reflical-worplication) (Orence) (O) (O) (Oblical-flical-lications) (Orviews)) (Orview at)) (Oblifulations) (Orm) (Orm) (O) (O) (O) (Od)) (Oference.) (Od)) (Oference.)) (Options) (On.)) (Od) (Od. E) (On.) et.) etment.) (O E.) et
Article 258
Title@2025-05-28 (3): Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
Title: Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning | Instruct-SkillMix: Eine leistungsstarke Pipeline für LLM Instruction Tuning | 指令- SkillMix: 用于LLM 指令导导图的强大管道 2408.14774v4 |
Authors: Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora
We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data for instruction-following. The pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core “skills” for instruction-following by directly prompting the model. This is inspired by LLM metacognition'' of Didolkar et al. (2024); (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. The estimated cost of creating the dataset is under $600. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkillMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0, a level similar to frontier models like Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. In our dataset, adding 20% low quality answers (
shirkers’’) causes a noticeable degradation in performance. The Instruct-SkillMix pipeline seems flexible and adaptable to other settings.
我们引入了SkillMix, 用于创建多样化的高质量 SFT 数据用于教学跟踪的自动化方法。 管道包括两个阶段, 每个阶段都利用现有强大的LLM : (1) 技能提取: 使用LLM 提取核心“ 技能” 用于直接促进模型的教学。 这是来自Didolkar等人 (2024年) 的“ LLLM 元认知” 的启发; (2) 数据生成: 利用强大的LM 生成( 指令、 回应) 显示随机选择的这些技能组合的数据。 这里, 随机技能组合的使用促进多样性和困难。 创建数据集的估计成本低于600美元。 Vanilla SFT( 即没有 PPO、 DPO 或 RL 方法) 用于直接指导Mix 生成的数据的 Vanilla SLMM Mix , 在AlpacaEval 2.0、 MT- Bechnch 和 Wild Bennch 等基准下取得了很大的教学成果。 LLAM-376 % , 也证明 Abrealma- dal- dislational- dalmax 研究中, 3ral- dismalmaxil 。 也证明 a listral- dismlational lax 10 a laxildal lax 10 a lax lax lax lax lade lax lax lax lax lax 。 lax lax 。 。 10 a ladal ladal ladal lax lax lax lax lax lax lax lax lax lax lax labs lax lax labs lax labal lax lax labal labal lax lax labal labal lax lax lax lax la labal lax a lax la la la lax la la la la la la la la la la la la la la la la
Article 259
Title@2025-05-28 (3): SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
Title: SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains | SequentialBreak: Große Sprachmodelle können durch Einbetten von Jailbreak Prompts in Sequential Prompt Chains ausgeblendet werden | 顺序式布雷克:大语言模型可以通过将破狱线索嵌入顺序式提示链来蒙骗大语言模型 2411.06426v3 |
Authors: Bijoy Ahmed Saiem, MD Sadik Hossain Shanto, Rakib Ahsan, Md Rafi ur Rashid
As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: https://anonymous.4open.science/r/JailBreakAttack-4F3B/.
随着大语言模型(LLMs)融入各种应用的整合程度的提高,它们容易被滥用,从而引起严重的安全关切。许多破狱袭击被提出来评估LLMs的安全防护。目前,越狱袭击主要依靠情景伪装、迅速混淆、迅速优化和迅速迭代优化,以掩盖恶意的提示。特别是,单个查询中的连续快速链可以引导LLMs关注某些提示,而忽视其他提示,促进背景操作。本文介绍SqertialBreak,这是一次利用这一脆弱性的新的越狱袭击。我们讨论了多种情景,不限于问题库、 Dialog 完成和游戏环境等实例,在这些情景中,有害提示嵌入良性提示,能够愚弄LLMs产生有害的反应。这些情景的不同叙述结构表明,SquenaltialBreak足够灵活,可以适应讨论之外的各种快速格式。广泛的实验表明,SqernaltialBreak仅使用一次公开询问,在现有的基线上大大提升袭击成功率,而不能利用这种开放源和封闭源模式。我们的研究,我们强调这一可靠的数据库和Gial-H数据库的迫切需要。
Article 260
Title@2025-05-28 (3): Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory
Title: Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory | Kulturelle Bewertungen von Vision-Sprachen-Modellen haben viel von der Kulturtheorie zu lernen | 展望-语言模式的文化评价有许多可学习的文化理论 2505.22793v1 |
Authors: Srishti Yadav, Lauren Tilton, Maria Antoniak, Taylor Arnold, Jiaang Li, Siddhesh Milind Pawar, Antonia Karamolegkou, Stella Frank, Zhaochong An, Negar Rostamzadeh, Daniel Hershcovich, Serge Belongie, Ekaterina Shutova
Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dimensions present in images for VLMs. This position paper argues that foundational methodologies from visual culture studies (cultural studies, semiotics, and visual studies) are necessary for cultural analysis of images. Building upon this review, we propose a set of five frameworks, corresponding to cultural dimensions, that must be considered for a more complete analysis of the cultural competencies of VLMs.
现代视觉语言模型往往在文化能力评价和基准方面失灵。鉴于在VLM上建立的各种应用,人们重新有兴趣了解它们是如何将文化差异编码的。虽然对这个问题的个别方面进行了研究,但我们仍然缺乏一个全面框架,以便系统地查明和说明VLM图像中存在的细微文化层面。本立场文件认为,视觉文化研究(文化研究、半科学学和视觉研究)的基本方法对图像的文化分析是必要的。我们根据这次审查,提出了一套与文化层面相对应的五个框架,必须加以考虑,以便更全面地分析VLM的文化能力。
Article 261
Title@2025-05-28 (3): Can Large Language Models Match the Conclusions of Systematic Reviews?
Title: Can Large Language Models Match the Conclusions of Systematic Reviews? | Können große Sprachmodelle mit den Schlussfolgerungen systematischer Bewertungen übereinstimmen? | 大语言模型能否与系统审查的结论相匹配? 2505.22787v1 |
Authors: Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, Serena Yeung-Levy
Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.
系统审查(SR),专家们在这种审查中总结和分析各种个别研究中的证据,以提供对一个专门专题的见解,是循证临床决策、研究和政策的基石。鉴于科学文章的指数增长,我们越来越有兴趣使用大型语言模型(LLMS)使SR的生成自动化。然而,LLMS对多个文件的证据和理由进行批判性评估,以便以与域专家同样的熟练程度提供建议的能力仍然欠佳。因此,我们问:LLMS能否与临床专家在获得同一研究时所编写的系统审查的结论相匹配?为了探讨这一问题,我们介绍了MedEvience,这是100个SR的基点和它们所依据的研究的对比结果。我们把24 LLMMS作为MEdience的基点,包括推理、非理性、医学专家和不同规模的模型。通过系统的系统评估,我们发现推理不一定能提高业绩,较大的模型不能始终产生更大的收益,而基于知识的精度的精度对MedEvience。相反,多数模型展示了类似的行为:业绩趋向于推理学、不合理性研究的推理学的推理学结果,这些推论表明,这些推理学的推理学的推论表明,这些推论表明,这些推理的推理学的推理学的推理学的推理学的推,这些推论表明的推理学的推理学的推论表明,这些推论表明的推论表明,这些推论可以表明,这些推论可以进一步的推理的推理的推理学的推推论可以表明,这些推理的推理的推理学的推理学的推理学的推理学的推理学的推理学的推理学的推理学的推理学的推理学的推理学的推理的推理学的推理学的推论可以表明,这些推论可以表明它们的推论可以表明,这些推论可以表明它们的推论,这些推理学的推理学的推理学的推理学的推理学的推理学的推论可以表明它们的推理学的推理的推理的推理学的推理的推理的推理学的推理学的推理学的推理学的推理的推理的推
Article 262
Title@2025-05-28 (3): MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators
Title: MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators | MEDAL: Ein Rahmen für Benchmarking von LLMs als mehrsprachige Open-Domain Chatbots und Dialogevaluatoren | MEDAL:多语言开放域聊天和对话评价员对LLMs进行基准评估的框架 2505.22777v1 |
Authors: John Mendonça, Alon Lavie, Isabel Trancoso
As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.
随着聊天机器人及其基本LLM的能力继续大幅提高,评价其业绩日益成为阻碍其进一步发展的主要障碍,一个重大挑战是现有的基准数据集,这些数据集基本上静止、过时,而且缺乏多语种覆盖面,限制了其捕捉微妙的语言和文化差异的能力。本文件介绍了MEDAL,这是一个自动的多试管框架,用于生成、评价和整理更具代表性和多样性的开放地对话评价基准。我们的方法利用一些最先进的LMS,以不同种子背景为条件,生成用户-聊天式多语言对话。一个强大的LM(GPT-4.1)随后用于对聊天机器人的绩效进行多层面分析,发现明显的跨语种性业绩差异。在这种大规模评价的指导下,我们制定了一个新的元评价多语种基准和带有细微质量判断的人类注解样本。然后,我们用这一基准来评估一些推理和无关联的LMS作为开放性对话评价员的能力。我们发现,当前的LMS在发现,如何辨别那些涉及直觉和理性的问题。
Article 263
Title@2025-05-28 (3): GraphNarrator: Generating Textual Explanations for Graph Neural Networks
Title: GraphNarrator: Generating Textual Explanations for Graph Neural Networks | GraphNarrator: Erzeugen von Texterklärungen für Graph Neuronale Netzwerke | 图示记录器:生成图形神经网络的文字解释 2410.15268v2 |
Authors: Bo Pan, Zhen Xiong, Guanchen Wu, Zheng Zhang, Yifei Zhang, Liang Zhao
Graph representation learning has garnered significant attention due to its broad applications in various domains, such as recommendation systems and social network analysis. Despite advancements in graph learning methods, challenges still remain in explainability when graphs are associated with semantic features. In this paper, we present GraphNarrator, the first method designed to generate natural language explanations for Graph Neural Networks. GraphNarrator employs a generative language model that maps input-output pairs to explanations reflecting the model’s decision-making process. To address the lack of ground truth explanations to train the model, we propose first generating pseudo-labels that capture the model’s decisions from saliency-based explanations, then using Expert Iteration to iteratively train the pseudo-label generator based on training objectives on explanation quality. The high-quality pseudo-labels are finally utilized to train an end-to-end explanation generator model. Extensive experiments are conducted to demonstrate the effectiveness of GraphNarrator in producing faithful, concise, and human-preferred natural language explanations.
图表代表学习因其在建议系统和社会网络分析等不同领域的广泛应用而引起极大关注。尽管在图表学习方法方面有所进步,当图表与语义特征相关时,挑战仍然难以解释。在本文件中,我们介绍了旨在为图形神经网络产生自然语言解释的首个方法 “ 图形搜索器 “ 。 “ 图形搜索器 “ 采用了一种基因化语言模型,绘制了反映模型决策过程的输入-输出对的解释。为了解决培训模型缺乏地面真相解释的问题,我们建议首先制作假标签,从突出的基于功能的解释中捕捉模型的决定,然后利用专家迭代语言来培训基于解释质量培训目标的假标签生成器。高品质的假标签最终用于培训一个端对端解释生成模型。进行了广泛的实验,以展示“图形搜索器”在制作准确、简明和人称自然语言解释方面的有效性。
Article 264
Title@2025-05-28 (3): Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Title: Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages | Zählen von Bäumen: Eine baumbankgetriebene Erforschung syntaktischer Variationen in Sprache und Schrift über Sprachen hinweg | 计数树:在树库驱动下探索不同语言的言语和书写方式的口语和书写方式差异 2505.22774v1 |
Authors: Kaja Dobrovoljc
This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.
本文展示了一种新型的树本驱动方法,用依赖性分化的社团来比较语言和书写中的合成结构。 采用了一种完全直截了当的自下而上的方法,我们将合成结构定义为不灵活的依赖(子)树,并将它们从口语和书面的通用依赖(UD)树库中提取出来,用两种语言,即英语和斯洛文尼亚语,在语言和书面上截然不同的语言中,这两类语言和书写(UD)树库中,这两类语言和书写(UD)树库中都有不同的拼写。 对于每一种材料,我们分析合成组织的规模、多样性和分布、其不同模式的重叠、它们之间的重叠、以及最典型的演讲结构。 结果显示,在两种语言中,口述的合成公司结构中,与其书面对应的合成结构结构相比,其数量较少、种类较少、较少、较少不同的合成合成结构结构结构的合成结构结构,我们以系统化的方式研究整个具体语言结构中的一种模式。
Article 265
Title@2025-05-28 (3): Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
Title: Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems | Automatisierte Bewertung der Annotationen von automatisierten Feedback-Systemen | 自动反馈系统自动读取系统输入说明 2505.22771v1 |
Authors: Christopher Ormerod
This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations – a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.
本研究说明了如何将注重反馈的注释纳入评分管道可以提高自动作文评分(AES)的准确性。这一方法在评分、选择和理解参数和分解要素(PERSUADE)的 Persuasive ESAD(PERSUADE)文集中得到了证明。我们综合了两类由反馈驱动的注释:那些识别拼写和语法错误的注释,以及那些突出论证组成部分的注释。为了说明这种方法如何在现实世界情景中应用,我们使用了两个LLMS来生成注释 – – 一种用于拼写校正的发源语言模型和一种基于编码的代号符号分类器,受过识别和标注参数的培训。通过将注释纳入评分过程,我们展示了在使用基于编码的大语言模型、微调的分类师的绩效方面的改进。
Article 266
Title@2025-05-28 (3): Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction
Title: Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction | Brauchen wir noch menschliche Annotatoren? Prompting große Sprachmodelle für Aspect Sentiment Quad Prediction | 我们还需要人类告别员吗? 2502.13044v3 |
Authors: Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.
假设情绪四分预测(ASQP)有助于详细理解文本中表达的意见,确定每种意见的意见术语、方面术语、方面类别和情绪两极分数;然而,说明一套完整的培训范例以微调ASQP模型是一个资源密集型过程;在本研究中,我们探索了大语言模型(LLMS)在五个不同数据集中零和几分学习ASQP任务的能力;我们报告F1的得分几乎与以最先进的微调模型获得的得分相等,超过了先前报告的零分和少分业绩;在Rest16餐厅域数据集的20分设置中,LLMS取得了51.54分F1分,而采用最佳的微调MVP方法的得分为60.39分。此外,我们报告了LMS在目标情绪检测方面的表现,F1分接近于微调模型,在30分制成的Srest16得分中达到68.93分,而与MVP的得分为72.76分;在SDMS域数据集的20分中,人手数仍然是完成最佳业绩任务的关键。
Article 267
Title@2025-05-28 (3): A Survey of Uncertainty Estimation Methods on Large Language Models
Title: A Survey of Uncertainty Estimation Methods on Large Language Models | Eine Übersicht über Methoden der Unsicherheitsschätzung bei großen Sprachmodellen | 大语言模型不确定性估算方法调查 2503.00172v2 |
Authors: Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, Hang Liu
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, these models could offer biased, hallucinated, or non-factual responses camouflaged by their fluency and realistic appearance. Uncertainty estimation is the key method to address this challenge. While research efforts in uncertainty estimation are ramping up, there is a lack of comprehensive and dedicated surveys on LLM uncertainty estimation. This survey presents four major avenues of LLM uncertainty estimation. Furthermore, we perform extensive experimental evaluations across multiple methods and datasets. At last, we provide critical and promising future directions for LLM uncertainty estimation.
大型语言模型(LLMs)在各种任务中表现出了非凡的能力,然而,这些模型可以提供有偏见、有幻觉或非事实的反应,以其流利和现实的外观为幌子。不确定的估计是应对这一挑战的关键方法。虽然不确定性估计的研究工作正在加快,但缺乏关于LLM不确定性估计的全面和专门调查。这项调查提出了LLM不确定性估计的四大主要途径。此外,我们还对多种方法和数据集进行了广泛的实验性评价。最后,我们为LLM不确定性估计提供了关键和有希望的未来方向。
Article 268
Title@2025-05-28 (3): StressTest: Can YOUR Speech LM Handle the Stress?
Title: StressTest: Can YOUR Speech LM Handle the Stress? | StressTest: Kann Ihre Rede LM mit dem Stress umgehen? | 压力测试:你的演讲能解决压力吗? 2505.22765v1 |
Authors: Iddo Yosha, Gallil Maimon, Yossi Adi
Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model’s ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks. Code, models, data, and audio samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.
句子压力是指强调,以口头表达中的具体词句来突出或对比一种想法,或引入新信息,这种强调往往被用来暗示一种没有明确说明的基本意图; 语言通识语言模式(SLMs)的最近进展使得能够直接处理音频,允许模式绕过音频转录和获取充分丰富的语音信号,并完成诸如口头回答等音频推理任务; 尽管判决压力在影响意义和发言意向方面起着关键作用,但在评价和开发这种模式时仍然大都被忽视; 在这项工作中,我们通过引入“压力测试”来弥补这一差距,这是专门用来评价模型根据压力模式对口述的判词进行区分的能力的基准; 我们评估了若干主要的SLMSM的绩效,发现尽管具备总体能力,但它们在这类任务上表现不佳; 为了克服这一限制,我们提议了一个新型的合成数据生成管道,并创建了“压力17k”, 一套模拟压力变异意味着的含义变化的培训。 然后,我们从经验上表明,优化模型与这种合成数据集相匹配,与真实世界的录音记录相一致,并且能够对SLMWSLMA/SAR模型进行有效的校样模型进行大幅调整。
Article 269
Title@2025-05-28 (3): Decomposed Opinion Summarization with Verified Aspect-Aware Modules
Title: Decomposed Opinion Summarization with Verified Aspect-Aware Modules | Zerlegte Meinungszusammenfassung mit verifizierten Aspect-Aware-Modulen | 与经核查的光谱软件模块拆解的意见摘要 2501.17191v3 |
Authors: Miao Li, Jey Han Lau, Eduard Hovy, Mirella Lapata
Opinion summarization plays a key role in deriving meaningful insights from large-scale online reviews. To make the process more explainable and grounded, we propose a domain-agnostic modular approach guided by review aspects (e.g., cleanliness for hotel reviews) which separates the tasks of aspect identification, opinion consolidation, and meta-review synthesis to enable greater transparency and ease of inspection. We conduct extensive experiments across datasets representing scientific research, business, and product domains. Results show that our approach generates more grounded summaries compared to strong baseline models, as verified through automated and human evaluations. Additionally, our modular approach, which incorporates reasoning based on review aspects, produces more informative intermediate outputs than other knowledge-agnostic decomposition approaches. Lastly, we provide empirical results to show that these intermediate outputs can support humans in summarizing opinions from large volumes of reviews.
意见总结在从大规模在线审查中得出有意义的见解方面发挥着关键作用。为了使这一过程更能解释和更有依据,我们提议采用以审查方面(例如旅馆审查的清洁性)为指导的域名式模块化方法,将各方面的识别、意见整合和元审查综合工作的任务分开,以便提高透明度和便于检查。我们进行了广泛的跨数据集实验,代表科学研究、商业和产品领域。结果显示,我们的方法比通过自动化和人力评估核实的强力基线模型产生更有根据的摘要。此外,我们的模块化方法包含了基于审查方面的推理,产生了比其他知识-不可知分化方法更丰富的中间产出。最后,我们提供了实证结果,表明这些中间产出可以支持人类总结大量审查的意见。
Article 270
Title@2025-05-28 (3): Resolving Lexical Bias in Model Editing
Title: Resolving Lexical Bias in Model Editing | Lösung Lexischer Bias in der Modellbearbeitung | 解析示范编辑中的法理偏见 2408.10411v3 |
Authors: Hammad Rizwan, Domenic Rosati, Ga Wu, Hassan Sajjad
Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model’s weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.
模型编辑的目的是在大型语言模型经过培训后修改它们的输出。 以往的方法往往涉及直接改变模型重量,这可能导致模型降解。 最近的技术避免通过使用一个适应器对模型的重量进行修改,该适应器在演示空间的语义相似性触发时对模型进行编辑。 我们证明,目前的适应器方法极易受到强烈的词汇偏见的影响,从而导致将编辑方法应用到不相干和相互重叠的词句中等问题。 本文提出了一个原则性的方法,用于学习一个分解的表达空间,通过保持不相关的提示之间的距离,同时保持副词句之间的距离,便利编辑的精确本地化。 在我们的实验研究中,我们展示了我们的方法(模型编辑项目编辑网络- PENME)取得了最先进的模式编辑结果,同时在推断过程中比以往的方法更具计算效率,而且在不同的结构中适应性更强。
Article 271
Title@2025-05-28 (3): FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian
Title: FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian | FAMA: Das erste großformatige Open-Science-Sprechstiftungsmodell für Englisch und Italienisch | FAMA:英语和意大利语第一个大型开放科学演讲基金会模型 2505.22759v1 |
Authors: Sara Papi, Marco Gaido, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature–with inaccessible training data and code–poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.
发展诸如Whiseper和SeemlessM4T等语言基础模型(SFMs),大大推进了语言处理领域,然而,其封闭性质(无法进入的培训数据)和代码(代码)带来了重大的可复制性和公平评估挑战。虽然其他领域通过开发完全透明的开放源代码和数据培训模型,在开放科学方面取得了显著进步,但类似的演讲努力仍然有限。为了填补这一空白,我们引入了FAMA,这是第一个为英语和意大利人开设的开放科学SFM家族,在150公里以上小时的OS语言语音数据方面进行了培训。此外,我们展示了一套新数据集,其中包含了两种语言16千小时的清洁和假标签语言的演讲。结果显示,FAMA取得了与现有SFM的竞争性业绩,同时速度达到8倍之快。所有工艺品,包括代码、数据集和模型,都根据OS合规许可证发布,促进了语音技术研究的开放性。
Article 272
Title@2025-05-28 (3): FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
Title: FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference | FlashFormer: Ganzmodell-Kernel für effiziente Low-Batch-Inferenz | FlashFormer: 用于高效低批量推断的全模块内核 2505.22758v1 |
Authors: Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, Yoon Kim
The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.
由于现代大型语言模型的大小和计算特点,人们越来越有兴趣开发适合培训和推断的专门内核,现有内核主要用于计算优化利用,以大型批量培训和推断环境为对象,然而,低批量的推断,即内存带宽和内核发射间接费用是重要因素,对于许多感兴趣的应用,例如边缘部署和耐久敏感应用,仍然很重要。本文描述了FlashFormer,这是加速基于变压器的大型语言模型单批推论的证明。在各种模型大小和量化环境中,我们观察到与现有最先进的推论内核相比,非三轮加速。
Article 273
Title@2025-05-28 (3): Pre-Training Curriculum for Multi-Token Prediction in Language Models
Title: Pre-Training Curriculum for Multi-Token Prediction in Language Models | Pre-Training Curriculum für Multi-Token-Vorhersage in Sprachmodellen | 语言模式多肯预测培训前课程 2505.22757v1 |
Authors: Ansar Aynetdinov, Alan Akbik
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.
多语种预测(MTP)是最近提出的语言模型培训前目标。MTP不是仅仅预测下一个标志(NTP),而是利用多个预测头来预测每个预测步骤的下一个美元象征性。MTP在提高下游业绩、推推论速度和培训效率方面表现出了希望,特别是对于大型模型来说,MTP显示了希望。然而,以前的工作表明,较小的语言模型(SLM)与MTP的目标相冲突。为了解决这个问题,我们提出了MTP培训课程学习战略,探索了两个变式:一个前方课程,它逐渐增加了从NTP到MTP培训前目标的复杂性,另一个反向课程,这正好相反。我们的实验表明,前方课程使可持续土地管理能够在培训前更好地利用MTP目标,提高下游NTP绩效和变色化产出质量,同时保留自我选择解码的好处。相反的课程实现了更强大的NTP业绩和产出质量,但未能提供任何自我描述的好处。
Article 274
Title@2025-05-28 (3): Decomposing Elements of Problem Solving: What “Math” Does RL Teach?
Title: Decomposing Elements of Problem Solving: What “Math” Does RL Teach? | Zersetzende Elemente der Problemlösung: Was “Math” lehrt RL? | 问题解决的分解要素:RL教什么“马思”? 2505.22756v1 |
Authors: Tian Qin, Core Francisco Park, Mujin Kwun, Aaron Walsman, Eran Malach, Nikhil Anand, Hidenori Tanaka, David Alvarez-Melis
Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a ‘coverage wall’ due to insufficient planning skills. To explore RL’s impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.
数学推理任务已成为评估LLM的推理能力的重要基准,特别是在强化学习方法(RL)方法(如GROPO)中,表现显著;然而,仅靠精确度指标并不能支持精细评估能力,无法揭示哪些解决问题的技能已经内部化。为了更好地了解这些能力,我们提议将解决问题的方法分解为基本能力:规划(绘制步骤顺序的问题)、执行(正确执行解决方案步骤)和核查(确定解决方案的正确性)。我们很生动地发现,GROPO主要加强执行技能,改进执行能力,解决模型已经知道如何解决一种现象的问题,我们称之为温度蒸馏。更重要的是,我们显示经过RL培训的模式与根本的新问题作斗争,由于规划技能不足而“覆盖墙”受到冲击。为了更深入地探讨RL的影响,我们为数学问题的解决类比而构建了最低限度的合成解决方案-树木导航任务。这种受控制的设置复制了我们的经验发现,确认RL主要提升了执行力度,在RL上加强了执行的力度。 更重要的是,我们在LLL的道路上,我们确定了一个潜在的改进的路径上,我们通过L的推理,我们可以确定一条新的推理。
Article 275
Title@2025-05-28 (3): VideoRAG: Retrieval-Augmented Generation over Video Corpus
Title: VideoRAG: Retrieval-Augmented Generation over Video Corpus | VideoRAG: Retrieval-Augmented Generation über Video Corpus | VideoRAG: 利用视频公司回收的原始一代 2501.05874v3 |
Authors: Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
Retrieval-Auged General(RAG)是提高模型事实准确性的一个有力战略,通过检索与查询有关的外部知识并将其纳入生成过程,提高模型的事实准确性。然而,现有方法主要侧重于文本,最近的一些进展考虑了图像,它们在很大程度上忽视了视频,这是一个丰富的多式知识来源,比任何其他模式都能够更有效地反映背景细节。虽然最近的研究探索了视频在响应生成中的使用,但它们要么没有检索的预断查询相关视频,要么没有将视频转换成缺乏多式联运内容的文字描述。为了解决这些问题,我们引入了视频RAG,这个框架不仅根据视频的相关性动态地检索视频,而且还使用了视觉和文字信息。视频RAG的运作受到最近大型视频语言模型(LLMS)的驱动,使得直接处理视频内容能够代表视频内容,用于检索,并将已检索的视频与生成的查询紧密整合在一起。此外,受LVLMMS的上下文规模可能不足以在非常长的视频中处理所有框架,而不是所有框架都具有同等重要性。我们引入了一个视频框架,因此,我们引入了一个视频框架的视频选择了最高级的视频框架,因此,我们可以提取的视频框架的视频框架的视频选择了它。我们可以用来提取。
Article 276
Title@2025-05-28 (3): Climate Finance Bench
Title: Climate Finance Bench | Klimafinanzierungsbank | 气候融资法官 2505.22752v1 |
Authors: Rafik Mankour, Yassine Chafai, Hamada Saleh, Ghassen Ben Hassine, Thibaud Barreau, Peter Tankov
Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever’s ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.
气候融资大厅引入了一个开放式基准,以使用大语言模型对公司气候披露进行问答为目标。我们以英文撰写了33份最新可持续性报告,这些报告来自所有11个GICS部门的公司,并附有330对经过专家验证的问答对口的注释,它们涵盖纯粹的提取、数字推理和逻辑推理。我们在此数据集的基础上提议对RAG(回收-提款生成)方法进行比较。我们显示,检索器找到实际包含答案的通道的能力是主要的性能瓶颈。我们进一步主张在AI气候应用中透明碳报告,强调诸如Weight量化等技术的优势。
Article 277
Title@2025-05-28 (3): AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models
Title: AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models | AutoL2S: Auto-Lang-Short-Reasoning für effiziente große Sprachmodelle | 自动L2S:高效大语言模式的自动长期短期理由 2505.22662v1 |
Authors: Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu
The reasoning-capable large language models (LLMs) demonstrate strong performance on complex reasoning tasks but often suffer from overthinking, generating unnecessarily long chain-of-thought (CoT) reasoning paths for easy reasoning questions, thereby increasing inference cost and latency. Recent approaches attempt to address this challenge by manually deciding when to apply long or short reasoning. However, they lack the flexibility to adapt CoT length dynamically based on question complexity. In this paper, we propose Auto Long-Short Reasoning (AutoL2S), a dynamic and model-agnostic framework that enables LLMs to dynamically compress their generated reasoning path based on the complexity of the reasoning question. AutoL2S enables a learned paradigm, in which LLMs themselves can decide when longer reasoning is necessary and when shorter reasoning suffices, by training on data annotated with our proposed method, which includes both long and short CoT paths and a special
推理能力强的大型语言模型(LLMS)在复杂的推理任务上表现很强,但往往受到过度思考的困扰,为简单推理问题带来不必要的冗长思维链推理路径,从而增加推理成本和潜伏度。最近的方法试图通过人工决定何时应用长推理或短推理来应对这一挑战。然而,它们缺乏根据问题的复杂性动态调整CT长度的灵活性。在本文件中,我们提议Auto-L2SY(AutoL2S)表示Auto-L2SY(AutoL2S)象征,这是一个动态和模型-不可知性框架,使LMS能够根据推理问题的复杂性动态地压缩其产生的推理路径。AutL2S(CLMS)使LM(LM)自己能够决定何时需要较长推理和时间短推理是否足够,通过以我们拟议方法附加说明的数据(包括长短路和短路和短路长的CSYSYSY)培训来说明该模型何时可以跳过冗长的推理推理。这提议说明战略可以提高LMM公司的能力,在不长的推理推理推理学上显示DL2号的推理质量后,通过提高的推理质量。
Article 278
Title@2025-05-28 (3): GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
Title: GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning | GuessArena: Raten Sie, wer ich bin? Ein selbstadaptives Framework zur Bewertung von LLMs in Domain-spezifischem Wissen und Vernunft | GuessArena:猜猜我是谁? 评估特定知识和理由领域LMLM的自我激励框架 2505.22661v1 |
Authors: Qingchen Yu, Zifan Zheng, Ding Chen, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.
对大型语言模型(LLMs)的评价历来依赖静态基准,这种模式构成两大限制:(1) 预先定义的测试组缺乏适应不同应用领域的适应性,(2) 标准化评价程序往往未能抓住对具体领域知识和背景推理能力的细微评估。为了克服这些挑战,我们提议GessarArena,这是一个基于对抗性游戏互动的适应性评价框架。在“猜我是谁?”游戏互动结构的启发下,我们的框架将动态领域知识模型与渐进推理评估无缝地结合在一起,以提高评价的忠诚性。Gesararena对五个纵向领域――金融、保健、制造、信息技术和教育――的经验研究有效地区分了域知识覆盖面和推理链的完整性。与常规基准相比,我们的方法在可解释性、可伸缩性和情景适应性方面有很大的优势。
Article 279
Title@2025-05-28 (3): 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model
Title: 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model | 3DLLM-Mem: Langzeit-Raum-Temporal-Speicher für körpereigenes 3D-Großsprachmodell | 3DLLM-Mem:3D大语言模型内嵌成的3D大语言长期空间-时间记忆 2505.22657v1 |
Authors: Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent’s ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench’s most challenging in-the-wild embodied tasks.
人类通过在时间和空间经验中利用长期记忆来完成复杂的任务。相比之下,目前的大语言模型(LLMS)在动态、多房间的3D环境中为有效规划和采取行动而奋斗。我们认为,这一限制的一部分是由于LLMS缺乏适当的3D空间时空记忆模型。为此,我们首先引入了3DMem-Bench,这是一个综合基准,包括26 000多个轨道和2 892个包含的任务、问答和字幕,旨在评估代理人在3D环境中对长期记忆进行思考的能力。第二,我们提出3DLLMM-Mem,这是一个新的动态记忆管理和聚合模型,用于体现LLMS的空间时空推理和行动。我们的模式使用工作记忆符号,作为当前观察的询问,有选择地关注和整合从记忆中存储过去观察和互动的最有用的空间和时间特征。我们的方法使代理人能够侧重于任务相关信息,同时保持复杂、长与同步环境中的记忆效率。第二,我们建议3DLM-MMM,一个全新的动态记忆管理和融合模式展示了3DMS-DM 最具有挑战性的任务。
Article 280
Title@2025-05-28 (3): VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models | VScan: Rethinking Visual Token Reduction für effiziente große Vision-Sprache Modelle | Vscan:重新思考如何降低视力,以建立高效的大型视觉语言模型 2505.22654v1 |
Authors: Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4% of the original performance.
最近的大型视觉语言模型(LVLMs)通过纳入精细的视觉感知和编码,提高了对多种模式的理解程度。然而,由于视觉象征序列的视觉比重较长,对实时部署构成挑战,这些方法产生了巨大的计算成本。为此,以前的研究探索了在视觉编码器输出层或语言模型早期层运行不重要的视觉象征。在这项工作中,我们重新审视这些设计选择,并通过对视觉编码和语言解码阶段中视觉标志的处理方式进行全面的经验性研究来重新评估其有效性。根据这些见解,我们建议VScan采用两阶段的视觉象征值削减框架,解决象征性冗余问题,其方法是:(1) 在视觉编码过程中将补充性全球和局部扫描与象征性合并结合起来,(2) 在语言模型的中间层运行。 四个LVLVMMs的广泛实验结果验证了VScan在加速推断和显示其优于16个基准的当前状态上的业绩。值得注意的是,在应用LLAVA-NEX-7B值前的美元之前,VSOP\LCasimingal 10-listimmeal Stalling a 10-lishing prilling Statyal prilling in Statyal 10-lishyal $x 99和F91%
Article 281
Title@2025-05-28 (3): Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents
Title: Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents | Position: Ungewissheitsquantifizierung braucht eine Neubewertung für großsprachige Modellagenten | 位置:大语言示范物剂的不确定性量化需求评估 2505.22655v1 |
Authors: Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci
Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.
已知的大型语言模型(LLMs)和聊天室代理商有时提供错误的产出,最近发现这一点永远无法完全预防,因此,不确定性量化具有关键作用,目的是量化一个总数字或两个数字的模糊程度,以量化偏向性和认知性不确定性。本立场文件认为,对于LLM代理商在与用户沟通时所操作的开放和互动设置而言,这种传统的不确定性的分化太有限,我们需要研究如何丰富这一新设想的不确定性。我们审查文献,发现流行的关于偏向性和集中性不确定性的定义彼此直接矛盾,在互动式LM代理商设置中失去其意义。因此,我们提出三个新的研究方向,侧重于这种人类计算机互动中的不确定性:在用户不提供所有信息或确定第一选择的确切任务时,在互动学习中,询问后续问题,减少当前背景的不确定性和产出不确定性,利用丰富的语言和语音空间来表达不确定性,而不仅仅是数字。我们期望这些新的处理不确定性和传递不确定性的方法将更加透明,在LM代理商中,我们期望这些新的处理和传递不确定性的方式将使得LMM的不确定性更加可信和透明。
Article 282
Title@2025-05-28 (3): The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
Title: The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason | Das Klettern schnitzt Weisheit tiefer als der Gipfel: Über die lärmenden Belohnungen im Lernen zur Vernunft | 攀爬的雕刻比首脑会议更深的智慧:学习理性的吵闹奖励 2505.22653v1 |
Authors: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function’s outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to’‘-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM’s performance on open-ended tasks. These findings suggest the importance of improving models’ foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.
培训后大型语言模型(LLMS)最近关于通过强化学习进行推理的研究(LLL)最近关于通过强化学习(RL)的大型语言模型(LLMS)最近的研究通常侧重于能够准确核实和奖励的任务,例如解决数学问题。相比之下,我们的研究调查了奖励噪音的影响,这是使用奖励模式培训LLMS后对现实世界情景的一种更实际的考虑。我们发现LLMS对大量奖励噪音表现出强大的强力。例如,手工翻转数学任务中40%的奖励功能产出的40%仍然能够迅速实现趋同,使其数学任务的业绩从5%提高到72%,而数学任务的业绩则与经过严格正确核查和准确奖励的模型相比较。认识到推理过程对最终结果的重要性,我们把RPR-ROS-ROSL的成绩模型(即推算模型,RPR)与升级阶段的成绩奖赏模型结合起来,同时在RPR-RBS-RO的模型中,我们将RR-R-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-risma-risma-risal-risal-r-rismal-ral-risl-risl-rislisl)的成绩模型与这些在改进基础基础的模型的模型的模型的模型结合起来。
Article 283
Title@2025-05-28 (3): Sherlock: Self-Correcting Reasoning in Vision-Language Models
Title: Sherlock: Self-Correcting Reasoning in Vision-Language Models | Sherlock: Selbstkorrekte Vernunft in Vision-Sprachen-Modellen | 夏洛克:视觉语言模型中的自我校正理由 2505.22651v1 |
Authors: Yi Ding, Ruqi Zhang
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs’ self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
合理视觉语言模型(VLMS)在复杂的多式联运任务方面表现良好,然而,它们仍面临重大挑战:它们对推理错误高度敏感,需要大量附加说明的数据或准确的校验员,并努力超越特定领域。为解决这些局限性,我们探索自我校正作为加强推理VLMS(VLMS)的战略。我们首先深入分析推理VLMS的自我校正能力,并找出关键差距。根据我们的调查结果,我们引入夏洛克、自我校正和自我改进培训框架。夏洛克引入了轨级自我校正目标、基于视觉穿透的偏好数据构建方法以及用于偏好调整的动态美元。一旦模型获得自我校正能力,仅使用20公里随机抽样的数据,在没有外部监督的情况下继续自我校正。根据Llama3.2-Vision-11B模型,夏洛克在八项基准中取得了显著的成果,达到64.1的平均精确度,直接生成的数据和65.4美元数据,而自我修正1号LVAFAFRAFRA(202)之后,使用自我修正数据。
Article 284
Title@2025-05-28 (3): Training Language Models to Generate Quality Code with Program Analysis Feedback
Title: Training Language Models to Generate Quality Code with Program Analysis Feedback | Schulung von Sprachmodellen zur Generierung von Qualitätscodes mit Feedback zur Programmanalyse | 具有方案分析反馈的产生质量守则培训语言模式 2505.22704v1 |
Authors: Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, Jingbo Shang
Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.
具有大语言模型(LLMs)的代码生成(LLMs)通常称为 “ 共鸣编码 “ ,在生产中日益被采用,但未能确保代码质量,特别是在安全(例如SQL注入弱点)和可维护性(例如缺失类型说明)方面,特别是在安全(例如,SQL注入弱点)和可维护性(例如,缺失类型说明)方面;现有的方法,例如监督下的微调和基于规则的处理后处理,依靠劳动密集型说明或易碎的杂交,限制其可缩放性和有效性;我们提议建立真正的强化学习框架,鼓励LLMs利用程序分析-指导反馈生成生产质量代码。具体地说,Seal整合了两个自动信号:(1) 程序分析,发现安全性或可维护性缺陷,(2) 单位测试,确保功能正确性。与先前的工作不同,我们的框架是迅速的和无参考性的,使可扩展性监督能够不受人工干预。在多个数据集和模型尺度上进行的实验表明,在功能和代码质量的同时评估中实现最先进的方法。我们的工作缩小了速度和质量之间的差距。
Article 285
Title@2025-05-28 (3): WebDancer: Towards Autonomous Information Seeking Agency
Title: WebDancer: Towards Autonomous Information Seeking Agency | WebDancer: Auf dem Weg zu einer autonomen Informationsagentur | WebDancer:走向自主信息搜索机构 2505.22648v1 |
Authors: Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba-NLP/WebAgent.
解决复杂的现实世界问题需要深入的信息搜索和多步推理。以深层研究为例,代理系统最近的进展凸显了自主多步研究的潜力。在这项工作中,我们从数据中心和培训阶段的角度为建立端到端代理信息搜索代理提供了一个连贯的范例。我们的方法包括四个关键阶段:(1) 数据浏览构建,(2) 轨迹取样,(3) 监督有效寒冷启动的微调,(4) 强化普及学习。我们根据ReAct、WebDancer、GAIA和WebWalkerQA对具有挑战性的信息搜索基准的经验评估将这一框架立即纳入网络代理中,展示了WebDancer的出色业绩,取得了相当大的成果,并突出了我们培训范例的功效。对代理培训的进一步分析为开发更能的代理模型提供了宝贵的洞察力和可操作的系统路径。代码和演示将在https://github.com/Alibaba-NLP/WebAgent中发布。
Article 286
Title@2025-05-28 (3): Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
Title: Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese | Charakterisierung von Bias: Benchmarking von großen Sprachmodellen in vereinfachter versus traditionellem Chinesisch | 区分偏见:将大型语言模式与传统中文相比的简化程度基准化 2505.22645v1 |
Authors: Hanjia Lyu, Jiebo Luo, Jian Kang, Allison Koenecke
While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models – spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).
虽然对大型语言模型(LLM)的能力进行了简单化和传统中文的研究,但LLM公司是否在两种书面中文变体中表现出不同表现,目前尚不清楚LLM公司在两种书面中文变体中的表现是否不同。这种理解至关重要,因为LLM公司在答复的质量方面存在差异,忽视了简化与传统中文的不同文化背景,可能加剧LLM公司在教育或雇用等领域便利的决策方面的下游伤害。为了调查LLM公司的潜在业绩差异,我们设计了两个反映现实世界情景的基准任务:区域术语选择(促进LLM公司命名一个描述的、在中、台中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、中、我们分析显示的反应偏见,LMM的偏重既取决于任务,又取决于书面语言。
Article 287
Title@2025-05-28 (3): Learning Composable Chains-of-Thought
Title: Learning Composable Chains-of-Thought | Komposierbare Ketten lernen-von-Gedanken | 学习综合研究链 2505.22635v1 |
Authors: Fangcong Yin, Zeyu Leo Liu, Liu Leqi, Xi Ye, Greg Durrett
A common approach for teaching large language models (LLMs) to reason is to train on chain-of-thought (CoT) traces of in-distribution reasoning problems, but such annotated data is costly to obtain for every problem of interest. We want reasoning models to generalize beyond their training distribution, and ideally to generalize compositionally: combine atomic reasoning skills to solve harder, unseen reasoning tasks. We take a step towards compositional generalization of reasoning skills when addressing a target compositional task that has no labeled CoT data. We find that simply training models on CoT data of atomic tasks leads to limited generalization, but minimally modifying CoT formats of constituent atomic tasks to be composable can lead to improvements. We can train “atomic CoT” models on the atomic tasks with Composable CoT data and combine them with multitask learning or model merging for better zero-shot performance on the target compositional task. Such a combined model can be further bootstrapped on a small amount of compositional data using rejection sampling fine-tuning (RFT). Results on string operations and natural language skill compositions show that training LLMs on Composable CoT outperforms multitask learning and continued fine-tuning baselines within a given training data budget.
教授大型语言模型(LLMs)到理性的通用方法,是培训关于分配推理问题的思维链(CoT)痕迹,但这种附加说明的数据对于每个感兴趣的问题来说都是昂贵的。我们希望推理模型超越其培训分布范围加以概括,并最好概括其组成性:将原子推理技能结合起来,以解决更难、更难、更隐蔽的推理任务。我们采取一个步骤,在完成一个没有贴标签的CoT数据的目标构成任务时,使推理技能的构成性概括化概括化。我们发现,简单的原子任务COT数据培训模式导致有限的概括化,但最小地修改构成原子任务的COT格式以可兼容性能可导致改进。我们可以用可合成COT数据对原子任务进行“解剖式 CoT”模型培训,并将这些模型与多任务学习或模型结合起来,以便在目标构思任务上实现更好的零光性工作。这种综合模型可以进一步借助拒绝抽样微调(RFT)小量的构成数据。关于弦操作操作和自然语言构成的结果显示,在可合成CobetTRADLMSUDDLMSUDLODLS 继续改进预算基线。
Article 288
Title@2025-05-28 (3): Spatial Knowledge Graph-Guided Multimodal Synthesis
Title: Spatial Knowledge Graph-Guided Multimodal Synthesis | Raumwissen Graph-geführte multimodale Synthese | 空间知识图表辅助多模式合成 2505.22633v1 |
Authors: Yida Xue, Zhen Bi, Jinnan Yang, Jungang Lou, Huajun Chen, Ningyu Zhang
Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.
最近多式大语言模型(MLLM)的进步大大加强了它们的能力;然而,它们的空间认知能力仍是一个显著的局限性。为了应对这一挑战,多式数据合成提供了一个有希望的解决办法。然而,确保综合数据符合空间常识是一项非三重任务。在这项工作中,我们引入了SKG2Data,这是以知识到数据生成概念为基础的、以空间知识图为指导的新型多式合成方法。SKG2Data自动构建了一个空间知识图(SKG),以模拟人对空间方向和距离的类似认识,随后用于指导多式数据合成。广泛的实验表明,从不同类型空间知识(包括方向和距离)中合成的数据不仅加强了MLLMS的空间认知和推理能力,而且展示了强大的通用能力。我们希望,基于知识的数据合成理念能够推动空间情报的发展。
Article 289
Title@2025-05-28 (3): Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs
Title: Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs | Stochastische Chamäleons: irrelevanter Kontext Halluzinationen Offenbarung Klassenbasierte (Mis)Verallgemeinerung in LLMs | 电磁变色龙:无关联的地貌幻觉流星级(Mis) 2505.22630v1 |
Authors: Ziling Cheng, Meng Cao, Marc-Antoine Rondeau, Jackie Chi Kit Cheung
The widespread success of large language models (LLMs) on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that reproduce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term class-based (mis)generalization, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model’s internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits – one prioritizing direct query-based reasoning, the other incorporating contextual cues – whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues – what we term stochastic chameleons.
在NLP基准上,大型语言模型(LLMS)取得了广泛成功,与此同时,人们还担心LLMS主要发挥与培训前阶段相似的复制文本的随机鹦鹉的作用。但是,错误的性质是什么,这些错误是否具有任何规律性?在这项工作中,我们研究了不相关的背景幻觉,模型将误导性背景线索纳入预测中。通过行为分析,我们发现这些错误源于一个结构化但有缺陷的机制,我们用基于阶级(错误)的概括性来形容,在这种机制中,模型将抽象的班级提示与从查询或背景中提取的特征结合起来,以得出答案。此外,在Llama-3、Mistral和Pythia的机械化解释性实验中,39个事实回忆关系类型都显示出了任何规律性?在模型内部计算中,这些模型将误导性背景提示纳入到其预测中。通过行为分析,我们从低层次上对特征选择了具体的答案,而基于两种相互竞争的电路(一个优先考虑直接的推理,另一个包含背景线索的线标 – 其相对影响决定了Limcal-从我们所得出的直观观点。
Article 290
Title@2025-05-28 (3): Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions
Title: Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions | Chain-of-Talkers (CoTalk): Schnelle menschliche Anmerkung von Dense Image Captions | 谈话链(Contalk):人类对高密度图像描述的快速记号 2505.22627v1 |
Authors: Yijun Shen, Delong Chen, Fan Liu, Xingyu Wang, Chuanyi Zhang, Liang Yao, Yuhui Zheng
While densely annotated image captions significantly facilitate the learning of robust vision-language alignment, methodologies for systematically optimizing human annotation efforts remain underexplored. We introduce Chain-of-Talkers (CoTalk), an AI-in-the-loop methodology designed to maximize the number of annotated samples and improve their comprehensiveness under fixed budget constraints (e.g., total human annotation time). The framework is built upon two key insights. First, sequential annotation reduces redundant workload compared to conventional parallel annotation, as subsequent annotators only need to annotate the ``residual’’ – the missing visual information that previous annotations have not covered. Second, humans process textual input faster by reading while outputting annotations with much higher throughput via talking; thus a multimodal interface enables optimized efficiency. We evaluate our framework from two aspects: intrinsic evaluations that assess the comprehensiveness of semantic units, obtained by parsing detailed captions into object-attribute trees and analyzing their effective connections; extrinsic evaluation measures the practical usage of the annotated captions in facilitating vision-language alignment. Experiments with eight participants show our Chain-of-Talkers (CoTalk) improves annotation speed (0.42 vs. 0.30 units/sec) and retrieval performance (41.13\% vs. 40.52\%) over the parallel method.
虽然高度加注的图像说明大大便利了对稳健的视觉语言调整的学习,但系统优化人类笔记工作的方法仍未得到充分探讨。我们引入了 “ 语言链链 “ (CoTalk) ,这是在固定预算限制(如人类笔记总时间)下最大限度地增加附加说明的样本数量和提高其全面性的一种全方位的AI-loop方法。这个框架以两个关键见解为基础。首先,与常规平行注释相比,顺序注解减少了多余的工作量,因为随后的注解者只需对“正反”作注释,即以往说明没有覆盖的缺失的视觉信息。第二,人类过程文本输入更快,通过通过聊天输出多量的批注,从而能够优化效率。我们从两个方面评估我们的框架:通过对目标图解树的详细字幕进行分辨和分析其有效联系而获得的内在评价,因为随后的注解者只需对“正反方向”进行注释说明说明,即先前说明没有覆盖的缺失的视觉信息。第二,人类过程文本输入更快,通过阅读,同时通过谈话输出说明说明,从而实现最佳效率。 我们从两个方面评估框架,评估了语言结构。
Article 291
Title@2025-05-28 (3): Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Title: Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding | Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding | 快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM 2505.22618v1 |
Authors: Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
以传播为基础的大型语言模型(Difmission LLMS)显示对非自动生成具有平行解码能力的缓存生成文本很有希望,然而,开放源代码扩散的缓存实际推导速度往往落后于自动递解模型,因为缺少关键值(KV)缓存和在同时解码多个符号时质量退化。为了缩小这一差距,我们引入了一种针对双向扩散模型的新颖的块状近似 KV缓存机制,使缓存再利用的性能微乎其微地下降。此外,我们确定平行解码中产生质量退化的根本原因是有条件独立假设下象征性依赖的中断。为了解决这一问题,我们提议了一种有选择的自觉平行解码战略,即有选择地解码标志超过信任阈值、减少依赖侵犯和维持生成质量。 多种LLLLADA和DM模型的实验结果显示,在多个LLMM基准下达到\ textbf{27.6\time times duction}改进精确性损失最小,缩小了业绩差距,以自动递减缩缩模和铺设磁模。
Article 292
Title@2025-05-28 (3): The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Title: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models | Der Entropie-Mechanismus des Verstärkten Lernens für sinnvolle Sprachmodelle | 理由语言模式强化学习的全英机制 2505.22617v1 |
Authors: Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.
本文旨在克服与LLM公司推理LL(LL)的重大障碍,即政策的崩溃。这种现象在大型RL公司运行期间在无灵敏干预的情况下不断观察到,在早期培训阶段,政策通缩率急剧下降,这种下降的探索能力总是伴随着政策业绩的饱和。在实践中,我们建立了一个变异方程式R=-a*eH+b,在H和下游业绩R之间。这一经验法有力地表明,政策业绩从政策通缩中交易,从而因政策耗竭而受到瓶颈,而上限完全可以预测 H=0,R=a+b。我们的发现需要对不断探索RL的精度急剧下降,在早期培训阶段,我们从理论上和实验性两方面都调查了迷幻等动态。 我们的推论强调,政策通缩变化是由行动概率和逻辑变化驱动的,这在使用政策梯变算时与其优势成正比。 内行研究显示,简单的C可追溯性规则学期的价值观, 和Crentral disal disal decal aral deal developmental deal deal laction Procial deview lading.
Article 293
Title@2025-05-28 (3): Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Title: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning | Bridging Supervised Learning und Verstärkung Lernen in Mathe-Reasoning | 在数学原因方面的受监督学习和强化学习架桥 2505.18116v2 |
Authors: Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang
Reinforcement Learning (RL) has played a central role in the recent surge of LLMs’ math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) – a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs’ generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.
强化学习(RL)在LLM的数学能力最近激增中发挥了核心作用,通过二进制验证器信号使LLM的数学能力能够自我改进。相比之下,监督学习(SL)很少被考虑用于这种由核查驱动的培训,这主要是因为它严重依赖参考答案,无法反省错误。在这项工作中,我们质疑普遍的看法,即自我改进是专属于RL的,并提议负觉微调(NFT) – – 这种监督办法使LLM能够反省其失败,并在没有外部教师的情况下自行改进。在网上培训中,NFT不放弃自我生成的否定回答,而是在模拟这些否定回答时制定了隐含的消极政策。这一隐含的政策与我们积极LLM目标一样,即优化积极数据,使LLMM所有世代能够直接优化政策。我们在数学推理任务中就7B和32B模型进行实验。结果一贯表明,通过更多的负面反馈手段,NFT大大改进了SL基线,例如拒绝抽样抽样、匹配甚至超越了RL的硬盘的硬盘操作系统之间,在NPO和DGROPO的理论基础上完全体现了我们的S-GROPO和DGROGO等同的理论基础。
Article 294
Title@2025-05-28 (3): RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Title: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction | RICO: Verbesserung der Genauigkeit und Vollständigkeit in der Bildrekapitulation durch visuelle Rekonstruktion | RICO:通过视觉重建提高图像剪辑的准确性和完整性 2505.22613v1 |
Authors: Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.
为了解决这些限制,我们建议RICO,这是一个通过视觉重建完善字幕的新框架。具体地说,我们利用文本到图像模型将字幕重建成参考图像,并促使MLLM 来查明原始和重新组合图像之间的差异,以完善标题。这一过程是迭接进行的,进一步逐步促进产生更加忠实和全面的描述。为了减轻由迭接过程引起的额外计算成本,我们引入了RICO-Frash,它学会了利用DPO生成像RICO那样的字幕。广泛的实验表明,我们的方法大大改进了字幕的准确性和完整性,在Caps bench和Compre Cap. Code上比大多数基线高出大约10%。
Article 295
Title@2025-05-28 (3): Personalized Causal Graph Reasoning for LLMs: A Case Study on Dietary Recommendations
Title: Personalized Causal Graph Reasoning for LLMs: A Case Study on Dietary Recommendations | Personalisiertes Kausaldiagramm zur Begründung von LLMs: Eine Fallstudie zu Ernährungsempfehlungen | LLLM女士的个人因果图:关于饮食建议的案例研究 2503.00134v2 |
Authors: Zhongqi Yang, Amir Rahmani
Large Language Models (LLMs) effectively leverage common-sense knowledge for general reasoning, yet they struggle with personalized reasoning when tasked with interpreting multifactor personal data. This limitation restricts their applicability in domains that require context-aware decision-making tailored to individuals. This paper introduces Personalized Causal Graph Reasoning as an agentic framework that enhances LLM reasoning by incorporating personal causal graphs derived from data of individuals. These graphs provide a foundation that guides the LLM’s reasoning process. We evaluate it on a case study on nutrient-oriented dietary recommendations, which requires personal reasoning due to the implicit unique dietary effects. We propose a counterfactual evaluation to estimate the efficiency of LLM-recommended foods for glucose management. Results demonstrate that the proposed method efficiently provides personalized dietary recommendations to reduce average glucose iAUC across three time windows, which outperforms the previous approach. LLM-as-a-judge evaluation results indicate that our proposed method enhances personalization in the reasoning process.
大型语言模型(LLMs)有效地利用常识知识进行一般推理,但在负责解释多因素个人数据时,它们却与个性化推理相挣扎。这一限制限制了它们在需要针对个人作出符合具体情况的决策的领域的适用性。本文介绍了个性化因果图的原因,作为通过纳入来自个人数据的个人因果图来增强LM推理的代理框架。这些图表为指导LLM的推理过程提供了基础。我们评价了关于营养导向饮食建议的案例研究,因为隐含的独特饮食效应需要个人推理。我们提议进行反事实评估,以估计LLM-推荐食品用于葡萄糖管理的效率。结果表明,拟议的方法有效地提供了个性化饮食建议,以减少与以往方法相形色色的普通葡萄食谱。LLM-as-a-a-a-a-judge评价结果表明,我们拟议的方法加强了推理过程的个人化。
Article 296
Title@2025-05-28 (3): AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling
Title: AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling | AutoElicit: Mit großen Sprachmodellen für vorausschauende Modellierung von Expertenvoraussagen | 自动:在预测模拟中使用大语言模型,供专家使用 2411.17284v5 |
Authors: Alexander Capstick, Rahul G. Krishnan, Payam Barnaghi
Large language models (LLMs) acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency often hinder their direct application for predictive tasks where privacy and interpretability are paramount. In fields such as healthcare, biology, and finance, specialised and interpretable linear models still hold considerable value. In such domains, labelled data may be scarce or expensive to obtain. Well-specified prior distributions over model parameters can reduce the sample complexity of learning through Bayesian inference; however, eliciting expert priors can be time-consuming. We therefore introduce AutoElicit to extract knowledge from LLMs and construct priors for predictive models. We show these priors are informative and can be refined using natural language. We perform a careful study contrasting AutoElicit with in-context learning and demonstrate how to perform model selection between the two methods. We find that AutoElicit yields priors that can substantially reduce error over uninformative priors, using fewer labels, and consistently outperform in-context learning. We show that AutoElicit saves over 6 months of labelling effort when building a new predictive model for urinary tract infections from sensor recordings of people living with dementia.
大型语言模型(LLMS)在不同领域获得广泛的信息。然而,它们的计算复杂性、成本和缺乏透明度往往阻碍直接应用其直接应用,在隐私和可解释性至关重要的预测性任务中,隐私和可解释性至关重要。在保健、生物学和金融等领域,专门和可解释的线性模型仍然具有相当大的价值。在这类领域,贴标签的数据可能稀缺,或者要花费昂贵才能获得。在模型参数上明确指定的先前分布可以减少通过巴伊西亚推理学习的样本复杂性;然而,吸引专家前科可能耗费时间。因此,我们引入了从LLMS提取知识的自动透明性,并为预测性模型建立前科。我们展示了这些前科信息,可以使用自然语言加以改进。我们开展了一项细致的研究,将AutoEllect与Intext学习进行比较,并展示了如何在两种方法之间进行模型选择。我们发现,AutoELI产生先期,可以大大减少不具有透明度的前科特征的错误,使用较少的标签,并持续地在文字学习中排外。我们显示,在建立新的尿道感染感官感官记录中可以节省超过6个月的标签工作6个月的时间。
Article 297
Title@2025-05-28 (3): SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Title: SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement | SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement | Synworld: 用于改进制剂行动知识的虚拟情景合成 2504.03561v2 |
Authors: Runnan Fang, Xiaobin Wang, Yuan Liang, Shuofei Qiao, Jialong Wu, Zekun Xi, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.
在代理商及其环境之间的互动中,代理商通过规划和执行行动扩大其能力;然而,LLM代理商在部署于新环境或需要导航非常规行动空间时面临巨大挑战;为增强代理商自主探索环境、优化工作流程和增进对行动的理解的能力,我们提议SynWorld,这是一个允许代理商在行动空间内以多步行动方式综合可能情景的框架,并进行蒙特卡洛树搜索(MCTS)探索,以有效完善其在目前环境中的行动知识。我们的实验证明SynWorld是在新环境中学习行动知识的有效和一般方法。代码可在https://github.com/zjunlp/SynWorld上查阅。
Article 298
Title@2025-05-28 (3): Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning
Title: Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning | Self-Error-Instruct: Verallgemeinern von Fehlern für LLMs Mathematische Begründung | 自错误教学法: 数学理由LLMs 的错误一般化 2505.22591v1 |
Authors: Erxin Yu, Jing Li, Ming Liao, Qi Zhu, Boyang Xue, Minghui Xu, Baojun Wang, Lanqing Hong, Fei Mi, Lifeng Shang
Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model’s (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs’ mathematical reasoning through error generalization.
尽管大型语言模型在各个领域表现出很强的绩效,但它们仍然在数学推理中挣扎着许多坏案例。 以往通过仅仅从孤立的坏案例外推外推法从错误中学习综合培训数据的方法, 从而无法概括这些案例所固有的广泛模式。 本文展示了“ 自我错误” (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) ) (SEI) (SEI) ) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (SEI) (Self- ) (Selveror- Introduclemental ) (MLI) (MLI) ) 。 具体地, 我们探索了一种简单的学习过程, 以确保只有最有效的例子。 最后, 我们用这些数据来调整模型, 反复重复地重复地重复地重复这些数学推理算法。 我们运用了各种模型, 在各种模型, 和数学推理算。
Article 299
Title@2025-05-28 (3): Precise In-Parameter Concept Erasure in Large Language Models
Title: Precise In-Parameter Concept Erasure in Large Language Models | Präzise In-Parameter-Konzeptlöschung in großen Sprachmodellen | 大语言模型中精确的在写法内概念破损 2505.22586v1 |
Authors: Yoav Gur-Arieh, Clara Suslik, Yihuai Hong, Fazl Barez, Mor Geva
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.
大型语言模型(LLMS)通常在培训前阶段获得知识,而这种知识在下游部署中是不可取的,例如敏感信息或版权内容。现有的消除这种知识的方法依赖于微调、培训低级适配者或事实编辑,但这些方法要么过于粗糙,太浅,要么无效。在这项工作中,我们提议PISCS(概念时代的隐含光量抑制),这是一个新颖的框架,通过直接编辑在参数空间中编码这些模型,精确地从模型参数参数参数参数参数中删除全部概念。PISCS使用不灵动模型将 MLP矢量分解成可解释的特性,确定与使用自动可读性技术的目标概念相关的方法,并将它们从模型参数中删除。在Gemma 2和Llama 3.1上对各种概念的实验表明,PISCSCS在领先的消缩方法的功效方面成效不大,将目标概念概念的精确度降低到7.7%,同时大大改进消化特性(达到31%)和坚固性(达到38%)。总体而言,这些结果显示,基于地段的模型的方法能够更精确地进行精确的编辑。
Article 300
Title@2025-05-28 (3): ReLearn: Unlearning via Learning for Large Language Models
Title: ReLearn: Unlearning via Learning for Large Language Models | ReLearn: Entlernen über Learning for Large Language Models | Reearn:通过学习大语言模式来重新学习 2502.11190v3 |
Authors: Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.
nan
Article 301
Title@2025-05-28 (3): Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts
Title: Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts | Weniger, aber besser: Effiziente Mehrsprachige Erweiterung für LLMs über schichtweise Mixture-of-Experts | 减少但更好:通过多层混合技术高效率地多语种扩展LLMs 2505.22582v1 |
Authors: Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs. The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages. To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts). Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages. To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer. Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts. Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens. Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.
nan
Article 302
Title@2025-05-28 (3): Fusion Steering: Prompt-Specific Activation Control
Title: Fusion Steering: Prompt-Specific Activation Control | Fusionssteuerung: Prompt-spezifische Aktivierungskontrolle | 融合指导:即时具体活动控制 2505.22572v1 |
Authors: Waldemar Chang, Alhassan Yasin
We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.
nan
Article 303
Title@2025-05-28 (3): TLUE: A Tibetan Language Understanding Evaluation Benchmark
Title: TLUE: A Tibetan Language Understanding Evaluation Benchmark | TLUE: Ein Benchmark für die Bewertung der tibetischen Sprache | TLUE:西藏语言理解评估基准 2503.12051v3 |
Authors: Fan Gao, Cheng Huang, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Hao Wang Xiao Feng, Yongbin Yu
Large language models (LLMs) have made tremendous progress in recent years, but low-resource languages, such as Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of LLMs. To address this gap, we present TLUE (A Tibetan Language Understanding Evaluation Benchmark), the first large-scale benchmark for assessing LLMs’ capabilities in Tibetan. TLUE comprises two major components: (1) a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and (2) a safety benchmark covering 7 subdomains. We evaluate a diverse set of state-of-the-art LLMs. Experimental results demonstrate that most LLMs perform below the random baseline, highlighting the considerable challenges LLMs face in processing Tibetan, a low-resource language. TLUE provides an essential foundation for driving future research and progress in Tibetan language understanding and underscores the need for greater inclusivity in LLM development.
nan
Article 304
Title@2025-05-28 (3): Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings
Title: Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings | Denken große Sprachmodelle wie das Gehirn? Sentence-Level-Evidenz aus fMRI und Hierarchischen Einbettungen | 大语言模型是否像大脑一样思考? 2505.22563v1 |
Authors: Yu Lei, Xingyang Ge, Yi Zhang, Yiming Yang, Bolei Ma
Understanding whether large language models (LLMs) and the human brain converge on similar computational principles remains a fundamental and important question in cognitive neuroscience and AI. Do the brain-like patterns observed in LLMs emerge simply from scaling, or do they reflect deeper alignment with the architecture of human language processing? This study focuses on the sentence-level neural mechanisms of language models, systematically investigating how hierarchical representations in LLMs align with the dynamic neural responses during human sentence comprehension. By comparing hierarchical embeddings from 14 publicly available LLMs with fMRI data collected from participants, who were exposed to a naturalistic narrative story, we constructed sentence-level neural prediction models to precisely identify the model layers most significantly correlated with brain region activations. Results show that improvements in model performance drive the evolution of representational architectures toward brain-like hierarchies, particularly achieving stronger functional and anatomical correspondence at higher semantic abstraction levels.
nan
Article 305
Title@2025-05-28 (3): Preference Adaptive and Sequential Text-to-Image Generation
Title: Preference Adaptive and Sequential Text-to-Image Generation | Präferenz Adaptive und sequentielle Text-zu-Bild-Generierung | 适应性和顺序性文字到图像生成 2412.10419v2 |
Authors: Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, Craig Boutilier
We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user’s intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.
nan
Article 306
Title@2025-05-28 (3): ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM
Title: ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM | ClaimPKG: Verbesserung der Claim-Verifikation durch Pseudo-Subgraphen-Generation mit leichtgewichtiger Spezial-LLM | CLCPKG: 通过使用轻量级专门LLM的Pseudo子集成加强索赔核实 2505.22552v1 |
Authors: Hoang Pham, Thanh-Do Nguyen, Khac-Hoai Nam Bui
Integrating knowledge graphs (KGs) to enhance the reasoning capabilities of large language models (LLMs) is an emerging research challenge in claim verification. While KGs provide structured, semantically rich representations well-suited for reasoning, most existing verification methods rely on unstructured text corpora, limiting their ability to effectively leverage KGs. Additionally, despite possessing strong reasoning abilities, modern LLMs struggle with multi-step modular pipelines and reasoning over KGs without adaptation. To address these challenges, we propose ClaimPKG, an end-to-end framework that seamlessly integrates LLM reasoning with structured knowledge from KGs. Specifically, the main idea of ClaimPKG is to employ a lightweight, specialized LLM to represent the input claim as pseudo-subgraphs, guiding a dedicated subgraph retrieval module to identify relevant KG subgraphs. These retrieved subgraphs are then processed by a general-purpose LLM to produce the final verdict and justification. Extensive experiments on the FactKG dataset demonstrate that ClaimPKG achieves state-of-the-art performance, outperforming strong baselines in this research field by 9%-12% accuracy points across multiple categories. Furthermore, ClaimPKG exhibits zero-shot generalizability to unstructured datasets such as HoVer and FEVEROUS, effectively combining structured knowledge from KGs with LLM reasoning across various LLM backbones.
nan
Article 307
Title@2025-05-28 (3): Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs
Title: Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs | Emotion-o1: Adaptive lange Begründung für emotionales Verständnis in LLMs | 情感-o1:在LLMs中为情感理解提供适应性长的理由 2505.22548v1 |
Authors: Changhao Song, Yazhou Zhang, Peng Zhang
Emotion understanding includes basic tasks (e.g., sentiment/emotion classification) and advanced tasks (e.g., sarcasm/humor detection). Current methods rely on fixed-length CoT reasoning, failing to adapt to the varying complexity of emotions. We propose a task-adaptive reasoning framework that employs DeepSeek-R1 to generate variable-length reasoning chains for different emotion tasks. By combining fine-tuning with reinforcement learning, we design a composite reward function that balances four objectives: prediction accuracy, adaptive reasoning depth control, structural diversity in reasoning paths, and suppression of repetitive logic. This approach achieves dynamic context-sensitive inference while enabling LLMs to autonomously develop deep reasoning capabilities. Experimental results demonstrate consistent improvements in both Acc and F1 scores across four tasks: emotion, sentiment, humor, and sarcasm. Notably, peak enhancements reached 3.56% F1 (2.76% Acc) for basic tasks and 37.95% F1 (23.14% Acc) for advanced tasks. Our work bridges rigid CoT reasoning and emotional complexity through adaptive-depth analysis.
nan
Article 308
Title@2025-05-28 (3): Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments
Title: Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments | Moderating Harm: Benchmarking von großen Sprachmodellen für Cyberbullying Detection in YouTube Kommentare | 在YouTube评论中为网络欺欺欺欺欺欺欺欺欺欺欺欺欺凌探测大语言模式制定基准 2505.18927v2 |
Authors: Amel Muminovic
As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen’s kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.
nan
Article 309
Title@2025-05-28 (3): Thinking with Generated Images
Title: Thinking with Generated Images | Mit generierten Bildern denken | 与生成图像一起思考 2505.22525v1 |
Authors: Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu
We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.
nan
Article 310
Title@2025-05-28 (3): SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
Title: SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond | SynLogic: Synthesizing verifizierbare reasoning data at scale for Learning Logical Reasoning and Beyond | 协同Logic:在学习逻辑理由及以后的尺度上综合可核实的理由数据 2505.19641v3 |
Authors: Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He
Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.
nan
Article 311
Title@2025-05-28 (3): Multi-MLLM Knowledge Distillation for Out-of-Context News Detection
Title: Multi-MLLM Knowledge Distillation for Out-of-Context News Detection | Multi-MLLM-Wissensdestillation für Out-of-Context-Nachrichten-Erkennung | 多MLMM-MLM-MT-MLM-MT-MM-MM-MM-MM-MM-MM-MM-MT-MTLM-MM-MTM-MM-MM-MM-MTM-MM-MTFTFNTUTUTFTFTFMTUTFM-MTFM-MMM-MTM-MMM-MMMM-MMMM-MMMMMM-MMMMM-MMM-MMMM-MMM-MMM-MMM-MM-MMM-MM-M-M-MMMMMMMM-M-M-MMMMM-MM-M-MMM-MM-MMMMMMM-M-M-M-MM-MMMMMMM-MMM-M-MMMMM-MMMMMMMMMMMM-MMMMMMM-M-M-M-M-MMMMMMMM-MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM-MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识知识 2505.22517v1 |
Authors: Yimeng Gu, Zhao Tong, Ignacio Castro, Shu Wu, Gareth Tyson
Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context. Many existing works have leveraged multimodal large language models (MLLMs) for detecting out-of-context news. However, observing the limited zero-shot performance of smaller MLLMs, they generally require label-rich fine-tuning and/or expensive API calls to GPT models to improve the performance, which is impractical in low-resource scenarios. In contrast, we aim to improve the performance of small MLLMs in a more label-efficient and cost-effective manner. To this end, we first prompt multiple teacher MLLMs to generate both label predictions and corresponding rationales, which collectively serve as the teachers’ knowledge. We then introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM. In Stage 1, we apply LoRA fine-tuning to the student model using all training data. In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers’ predictions conflict. This two-stage strategy reduces annotation costs and helps the student model uncover subtle patterns in more challenging cases. Experimental results demonstrate that our approach achieves state-of-the-art performance using less than 10% labeled data.
nan
Article 312
Title@2025-05-28 (3): Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations
Title: Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations | Vernunft ist nicht alles, was Sie brauchen: Prüfung LLMs für Multi-Turn Mental Health Conversations | 理由并非你所需要的全部:多发性心理健康对话的检查长 2505.20201v2 |
Authors: Mohit Chandra, Siddharth Sriraman, Harneet Singh Khanuja, Yiqiao Jin, Munmun De Choudhury
Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient’s persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.
nan
Article 313
Title@2025-05-28 (3): Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models
Title: Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models | Closed-Form Training Dynamics Reveal Erlernte Funktionen und lineare Struktur in Word2Vec-ähnlichen Modellen | 类似Word2Vec 模型中的封闭形式培训动态观测发现特性和线形结构 2502.09863v2 |
Authors: Dhruva Karkada, James B. Simon, Yasaman Bahri, Michael R. DeWeese
Self-supervised word embedding algorithms such as word2vec provide a minimal setting for studying representation learning in language modeling. We examine the quartic Taylor approximation of the word2vec loss around the origin, and we show that both the resulting training dynamics and the final performance on downstream tasks are empirically very similar to those of word2vec. Our main contribution is to analytically solve for both the gradient flow training dynamics and the final word embeddings in terms of only the corpus statistics and training hyperparameters. The solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on Wikipedia, we find that each of the top linear subspaces represents an interpretable topic-level concept. Finally, we apply our theory to describe how linear representations of more abstract semantic concepts emerge during training; these can be used to complete analogies via vector addition.
nan
Article 314
Title@2025-05-28 (3): EvolveSearch: An Iterative Self-Evolving Search Agent
Title: EvolveSearch: An Iterative Self-Evolving Search Agent | EvolveSearch: Ein iterativer, sich selbst entwickelnder Suchagent | EvolveSearch: 动态自我演变搜索代理 2505.22501v1 |
Authors: Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, Fei Huang
The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7\% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.
nan
Article 315
Title@2025-05-28 (3): Nonlinear second-order dynamics describe labial constriction trajectories across languages and contexts
Title: Nonlinear second-order dynamics describe labial constriction trajectories across languages and contexts | Nichtlineare Dynamiken der zweiten Ordnung beschreiben labiale Constriction-Trajektorien über Sprachen und Kontexte hinweg | 非线性第二序列动态描述不同语言和背景的实验室收缩轨迹 2410.08351v3 |
Authors: Michael C. Stern, Jason A. Shaw
We investigate the dynamics of labial constriction trajectories during the production of /b/ and /m/ in English and Mandarin. We find that, across languages and contexts, the ratio of instantaneous displacement to instantaneous velocity generally follows an exponential decay curve from movement onset to movement offset. We formalize this empirical discovery in a differential equation and, in combination with an assumption of point attractor dynamics, derive a nonlinear second-order dynamical system describing labial constriction trajectories. The equation has only two parameters, T and r. T corresponds to the target state and r corresponds to movement rapidity. Thus, each of the parameters corresponds to a phonetically relevant dimension of control. Nonlinear regression demonstrates that the model provides excellent fits to individual movement trajectories. Moreover, trajectories simulated from the model qualitatively match empirical trajectories, and capture key kinematic variables like duration, peak velocity, and time to achieve peak velocity. The model constitutes a proposal for the dynamics of individual articulatory movements, and thus offers a novel foundation from which to understand additional influences on articulatory kinematics like prosody, inter-movement coordination, and stochastic noise.
nan
Article 316
Title@2025-05-28 (3): Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks
Title: Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks | Positionale Fragilität in LLMs: Wie Offset-Effekte unser Verständnis von Gedächtnisrisiken verändern | LLMM中的位置易碎性:如何重塑抵消效应,我们如何理解记忆风险 2505.13171v2 |
Authors: Yixuan Xu, Antoni-Joan Solergibert i Llaquet, Antoine Bosselut, Imanol Schlag
Large language models are known to memorize parts of their training data, posing risk of copyright violations. To systematically examine this risk, we pretrain language models (1B/3B/8B) from scratch on 83B tokens, mixing web-scale data with public domain books used to simulate copyrighted content at controlled frequencies at lengths at least ten times longer than prior work. We thereby identified the offset effect, a phenomenon characterized by two key findings: (1) verbatim memorization is most strongly triggered by short prefixes drawn from the beginning of the context window, with memorization decreasing counterintuitively as prefix length increases; and (2) a sharp decline in verbatim recall when prefix begins offset from the initial tokens of the context window. We attribute this to positional fragility: models rely disproportionately on the earliest tokens in their context window as retrieval anchors, making them sensitive to even slight shifts. We further observe that when the model fails to retrieve memorized content, it often produces degenerated text. Leveraging these findings, we show that shifting sensitive data deeper into the context window suppresses both extractable memorization and degeneration. Our results suggest that positional offset is a critical and previously overlooked axis for evaluating memorization risks, since prior work implicitly assumed uniformity by probing only from the beginning of training sequences.
nan
Article 317
Title@2025-05-28 (3): AdvAgent: Controllable Blackbox Red-teaming on Web Agents
Title: AdvAgent: Controllable Blackbox Red-teaming on Web Agents | AdvAgent: Kontrollierbare Blackbox Red-Teaming auf Web-Agenten | 助理:在网络代理上可控黑箱红队 2410.17401v3 |
Authors: Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, Bo Li
Foundation model-based agents are increasingly used to automate complex tasks, enhancing efficiency and productivity. However, their access to sensitive resources and autonomous decision-making also introduce significant security risks, where successful attacks could lead to severe consequences. To systematically uncover these vulnerabilities, we propose AdvAgent, a black-box red-teaming framework for attacking web agents. Unlike existing approaches, AdvAgent employs a reinforcement learning-based pipeline to train an adversarial prompter model that optimizes adversarial prompts using feedback from the black-box agent. With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability. Extensive evaluations demonstrate that AdvAgent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms. We release code at https://ai-secure.github.io/AdvAgent/.
nan
Article 318
Title@2025-05-28 (3): Effective Context in Neural Speech Models
Title: Effective Context in Neural Speech Models | Effektiver Kontext in neuralen Sprachmodellen | 神经语音模式的有效背景 2505.22487v1 |
Authors: Yen Meng, Sharon Goldwater, Hao Tang
Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short – similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.
nan
Article 319
Title@2025-05-28 (3): How Do LLMs Perform Two-Hop Reasoning in Context?
Title: How Do LLMs Perform Two-Hop Reasoning in Context? | Wie führen LLMs Zwei-Hop-Reasoning im Kontext durch? | LLMs如何在上下文中执行双重理由? 2502.13913v2 |
Authors: Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I. Jordan, Stuart Russell
``Socrates is human. All humans are mortal. Therefore, Socrates is mortal.’’ This form of argument illustrates a typical pattern of two-hop reasoning. Formally, two-hop reasoning refers to the process of inferring a conclusion by making two logical steps, each connecting adjacent concepts, such that the final conclusion depends on the integration of both steps. It is one of the most fundamental components of human reasoning and plays a crucial role in both formal logic and everyday decision-making. Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems when distractors are present. We observe on a synthetic dataset that pre-trained LLMs often resort to random guessing among all plausible conclusions. However, after few steps of fine-tuning, models achieve near-perfect accuracy and exhibit strong length generalization. To understand the underlying mechanisms, we train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow. We observe a clear progression in the attention logits throughout training. This pictures a sharp phase transition from an initial stage of random guessing to the emergence of a structured sequential query mechanism, where the model first retrieves the preceding and the bridge concepts in the early layers and then uses them to infer the final answer. Finally, we show that these dynamics can be captured by a minimal three-parameter attention-only network.
nan
Article 320
Title@2025-05-28 (3): FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation
Title: FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation | FitCF: Ein Framework für die automatische Feature-Importanz-geführte kontrafaktische Beispielgenerierung | FitCF: 自动地物、重要引导反事实实例生成框架 2501.00777v3 |
Authors: Qianli Wang, Nils Feldhus, Simon Ostermann, Luis Felipe Villa-Arenas, Sebastian Möller, Vera Schmitt
Counterfactual examples are widely used in natural language processing (NLP) as valuable data to improve models, and in explainable artificial intelligence (XAI) to understand model behavior. The automated generation of counterfactual examples remains a challenging task even for large language models (LLMs), despite their impressive performance on many tasks. In this paper, we first introduce ZeroCF, a faithful approach for leveraging important words derived from feature attribution methods to generate counterfactual examples in a zero-shot setting. Second, we present a new framework, FitCF, which further verifies aforementioned counterfactuals by label flip verification and then inserts them as demonstrations for few-shot prompting, outperforming two state-of-the-art baselines. Through ablation studies, we identify the importance of each of FitCF’s core components in improving the quality of counterfactuals, as assessed through flip rate, perplexity, and similarity measures. Furthermore, we show the effectiveness of LIME and Integrated Gradients as backbone attribution methods for FitCF and find that the number of demonstrations has the largest effect on performance. Finally, we reveal a strong correlation between the faithfulness of feature attribution scores and the quality of generated counterfactuals, which we hope will serve as an important finding for future research in this direction.
nan
Article 321
Title@2025-05-28 (3): ConKE: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning
Title: ConKE: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning | ConKE: Konzeptualisierung - Augmented Knowledge Editing in großen Sprachmodellen für Commonsense Reasoning | CONKE: 常识理由大语言模型中概念化-增强的知识编辑 2412.11418v2 |
Authors: Liyu Zhang, Weiqi Wang, Tianqing Fang, Yangqiu Song
Knowledge Editing (KE) aims to adjust a Large Language Model’s (LLM) internal representations and parameters to correct inaccuracies and improve output consistency without incurring the computational expense of re-training the entire model. However, editing commonsense knowledge still faces difficulties, including limited knowledge coverage in existing resources, the infeasibility of annotating labels for an overabundance of commonsense knowledge, and the strict knowledge formats of current editing methods. In this paper, we address these challenges by presenting ConceptEdit, a framework that integrates conceptualization and instantiation into the KE pipeline for LLMs to enhance their commonsense reasoning capabilities. ConceptEdit dynamically diagnoses implausible commonsense knowledge within an LLM using another verifier LLM and augments the source knowledge to be edited with conceptualization for stronger generalizability. Experimental results demonstrate that LLMs enhanced with ConceptEdit successfully generate commonsense knowledge with improved plausibility compared to other baselines and achieve stronger performance across multiple question answering benchmarks. Our data, code, and models are publicly available at https://github.com/HKUST-KnowComp/ConKE.
nan
Article 322
Title@2025-05-28 (3): Fostering Video Reasoning via Next-Event Prediction
Title: Fostering Video Reasoning via Next-Event Prediction | Förderung von Video-Reasoning durch Next-Event-Vorhersage | 通过下一个晚上的预测促进视频宣传 2505.22457v1 |
Authors: Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang
Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.
nan
Article 323
Title@2025-05-28 (3): Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Title: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO | Unüberwachte Nachschulung für Multi-Modal LLM Reasoning via GRPO | 无人监督的多模式LLM通过GROPO进行多模式LLM进修培训后培训 2505.22453v1 |
Authors: Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data–an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
nan
Article 324
Title@2025-05-28 (3): Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in PubMed Abstracts
Title: Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in PubMed Abstracts | Gender-Neutral Große Sprachmodelle für medizinische Anwendungen: Reduzierung von Bias in PubMed Abstracts | 医疗应用的性别-新大语言性别模式:在普布迈德摘要中减少偏见 2501.06365v2 |
Authors: Elizabeth Schaefer, Kirk Roberts
This paper presents a pipeline for mitigating gender bias in large language models (LLMs) used in medical literature by neutralizing gendered occupational pronouns. A dataset of 379,000 PubMed abstracts from 1965-1980 was processed to identify and modify pronouns tied to professions. We developed a BERT-based model, “Modern Occupational Bias Elimination with Refined Training,” or “MOBERT,” trained on these neutralized abstracts, and compared its performance with “1965BERT,” trained on the original dataset. MOBERT achieved a 70% inclusive replacement rate, while 1965BERT reached only 4%. A further analysis of MOBERT revealed that pronoun replacement accuracy correlated with the frequency of occupational terms in the training data. We propose expanding the dataset and refining the pipeline to improve performance and ensure more equitable language modeling in medical applications.
nan
Article 325
Title@2025-05-28 (3): RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning
Title: RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning | RAG-Zeval: Auf dem Weg zu einer robusten und interpretierbaren Bewertung von RAG-Antworten durch regelgeführte End-to-End-Relation | RAG-Zeval:努力通过最终至最终规则引导理由对RAG对策进行强力和解释性评价 2505.22430v1 |
Authors: Kun Li, Yunxiang Li, Tianhua Zhang, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models’ reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval’s superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.
nan
Article 326
Title@2025-05-28 (3): AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy
Title: AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy | AstroVisBench: Ein Code-Bench für wissenschaftliche Computing und Visualisierung in der Astronomie | AstroVisbench:天文科学计算和可视化标准 2505.20538v2 |
Authors: Sebastian Antony Joseph, Syed Murtaza Husain, Stella S. R. Offner, Stéphanie Juneau, Paul Torrey, Adam S. Bolton, Juan P. Farias, Niall Gaffney, Greg Durrett, Junyi Jessy Li
Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model’s ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
nan
Article 327
Title@2025-05-28 (3): Token embeddings violate the manifold hypothesis
Title: Token embeddings violate the manifold hypothesis | Token-Einbettungen verletzen die mannigfaltige Hypothese | 托肯嵌入违反多重假设 2504.01002v2 |
Authors: Michael Robinson, Sourya Dey, Tony Chiang
A full understanding of the behavior of a large language model (LLM) requires our understanding of its input token space. If this space differs from our assumptions, our understanding of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around each token has a relatively flat and smooth structure as the null hypothesis. Failing to reject the null is uninformative, but rejecting it at a specific token $\psi$ implies an irregularity in the token subspace in a $\psi$-neighborhood, $B(\psi)$. The structure assumed in the null is a generalization of a manifold with boundary called a \emph{smooth fiber bundle} (which can be split into two spatial regimes – small and large radius), so we denote our new hypothesis test as the ``fiber bundle hypothesis.’’ Failure to reject the null hypothesis is uninformative, but rejecting it at $\psi$ indicates a statistically significant irregularity at $B(\psi)$. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the evidence suggests that the token subspace is not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, if one prompt contains a token implicated by our test, the response to that prompt will likely exhibit less stability than the other.
nan
Article 328
Title@2025-05-28 (3): Scaling Reasoning without Attention
Title: Scaling Reasoning without Attention | Skalierung ohne Aufmerksamkeit | 无人注意的调整理由 2505.22425v1 |
Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong
Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24, 0.6\% on AIME 25, and 3.0\% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.
nan
Article 329
Title@2025-05-28 (3): Mitigating Overthinking in Large Reasoning Models via Manifold Steering
Title: Mitigating Overthinking in Large Reasoning Models via Manifold Steering | Überdenken in großen Vernunftmodellen durch Manifold Steering verhindern | 通过 MManicform 指导减轻大型理性模型中的过度思考 2505.22411v1 |
Authors: Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong
Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model’s activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.
nan
Article 330
Title@2025-05-28 (3): Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring
Title: Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring | Jenseits von externen Monitoren: Verbesserung der Transparenz von großen Sprachmodellen für eine einfachere Überwachung | 外部监测之外的外部监测:提高大语言模型的透明度,促进更易监测 2502.05242v2 |
Authors: Guanxu Chen, Dongrui Liu, Tao Luo, Lijie Hu, Jing Shao
Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs’ thinking process. Techniques based on LLMs’ hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs’ generalization ability through optimal transport theory.
nan
Article 331
Title@2025-05-28 (3): GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM
Title: GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM | GOAT-TTS: Expressive und realistische Sprachgenerierung über eine Dual-Branch LLM | GOAT-TTS:通过双层LLM, 表达和现实的发声 2504.12339v2 |
Authors: Yaodong Song, Hongjie Chen, Jie Lian, Yuxin Zhang, Guangmin Xia, Zehan Li, Genliang Zhao, Jian Kang, Jie Li, Yongxiang Li, Xuelong Li
While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM’s native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-n layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.
nan
Article 332
Title@2025-05-28 (3): Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Title: Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space | Breaking the Ceiling: Das Potenzial von Jailbreak-Angriffen durch Erweiterung des Strategieraums erkunden | 打破上限:通过扩大战略空间探索越狱袭击的可能性 2505.21277v2 |
Authors: Yao Huang, Yitong Sun, Shouwei Ruan, Yichi Zhang, Yinpeng Dong, Xingxing Wei
Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.
nan
Article 333
Title@2025-05-28 (3): Which Demographics do LLMs Default to During Annotation?
Title: Which Demographics do LLMs Default to During Annotation? | Welche Demographien haben LLMs während der Annotation voreingestellt? | 在批注期间,LLMs会默认给哪些人种? 2410.08820v3 |
Authors: Johannes Schäfer, Aidan Combs, Christopher Bagdon, Jiahui Li, Nadine Probol, Lynn Greschner, Sean Papay, Yarik Menchaca Resendiz, Aswathy Velutharambath, Amelie Wührl, Sabine Weber, Roman Klinger
Demographics and cultural background of annotators influence the labels they assign in text annotation – for instance, an elderly woman might find it offensive to read a message addressed to a “bro”, but a male teenager might find it appropriate. It is therefore important to acknowledge label variations to not under-represent members of a society. Two research directions developed out of this observation in the context of using large language models (LLM) for data annotations, namely (1) studying biases and inherent knowledge of LLMs and (2) injecting diversity in the output by manipulating the prompt with demographic information. We combine these two strands of research and ask the question to which demographics an LLM resorts to when no demographics is given. To answer this question, we evaluate which attributes of human annotators LLMs inherently mimic. Furthermore, we compare non-demographic conditioned prompts and placebo-conditioned prompts (e.g., “you are an annotator who lives in house number 5”) to demographics-conditioned prompts (“You are a 45 year old man and an expert on politeness annotation. How do you rate {instance}”). We study these questions for politeness and offensiveness annotations on the POPQUORN data set, a corpus created in a controlled manner to investigate human label variations based on demographics which has not been used for LLM-based analyses so far. We observe notable influences related to gender, race, and age in demographic prompting, which contrasts with previous studies that found no such effects.
nan
Article 334
Title@2025-05-28 (3): LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
Title: LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High | LLMs kämpfen, um falsche Annahmen zurückzuweisen, wenn Fehlinformationsstakes hoch sind | LLM LLM 努力拒绝错误信息摄入量高时的假假设 2505.22354v1 |
Authors: Judith Sieker, Clara Lachenmaier, Sina Zarrieß
This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI’s GPT-4-o, Meta’s LLama-3-8B, and MistralAI’s Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.
nan
Article 335
Title@2025-05-28 (3): Explicit Learning and the LLM in Machine Translation
Title: Explicit Learning and the LLM in Machine Translation | Explizites Lernen und das LLM in maschineller Übersetzung | 计算机翻译方面的明确学习和LLM 2503.09454v3 |
Authors: Malik Marmonier, Rachel Bawden, Benoît Sagot
This study explores an LLM’s ability to learn new languages using explanations found in a grammar book$\unicode{x2014}$a process we term “explicit learning.” To rigorously assess this ability, we design controlled translation experiments between English and constructed languages generated$\unicode{x2014}$by specific cryptographic means$\unicode{x2014}$out of Latin or French. Contrary to previous studies, our results demonstrate that LLMs do possess a measurable capacity for explicit learning. This ability, however, diminishes as the complexity of the linguistic phenomena to be learned increases. Supervised fine-tuning on ad hoc chains of thought significantly enhances LLM performance but struggles to generalize to typologically novel or more complex linguistic features. These findings point to the need for more diverse training sets and alternative fine-tuning strategies to further improve explicit learning by LLMs, benefiting low-resource languages typically described in grammar books but lacking extensive corpora.
nan
Article 336
Title@2025-05-28 (3): Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning
Title: Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning | Was hält Set Matters für LLM Unlearning auf? Eine Fallstudie über Entity Unlearning | 哪些保留LLM 重新学习的设置事项? 关于实体重新学习的案例研究 2502.11441v3 |
Authors: Hwan Chang, Hwanhee Lee
Large language models (LLMs) risk retaining unauthorized or sensitive information from their training data, which raises privacy concerns. LLM unlearning seeks to mitigate these risks by selectively removing specified data while maintaining overall model performance. However, most existing work focus on methods to achieve effective forgetting and does not provide a detailed analysis of the retain set, the portion of training data that is not targeted for removal. In this paper, we investigate the effects of unlearning on various subsets of the retain set through a case study on entity unlearning. We introduce the Syntactically Similar Neighbor Set, a group of queries that share similar syntactic structures with the data targeted for removal, and show that this subset suffers the greatest performance drop during unlearning. Moreover, when used for regularization, this set not only preserves performance on syntactically similar queries but also delivers comparable or improved results across other data subsets. Our results highlight that syntactic similarity is a critical factor, potentially more so than domain or entity relationships, in achieving effective and practical LLM unlearning.
nan
Article 337
Title@2025-05-28 (3): Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance
Title: Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance | Semantische Veränderung in Slowenien nachvollziehen: Ein neuartiger Datensatz und optimaler transportbasierter Abstand | 跟踪斯洛文尼亚语语语义变化:新数据集和最佳运输距离 2402.16596v2 |
Authors: Marko Pranjić, Kaja Dobrovoljc, Senja Pollak, Matej Martinc
In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insight into the evolution of language caused by changes in society and culture. We present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3,000 manually annotated sentence pairs. We analyze an important class of measures of semantic change metrics based on the Average pairwise distance and identify several limitations. To address these limitations, we propose a novel metric based on regularized optimal transport, which offers a more robust framework for quantifying semantic change. We provide a comprehensive evaluation of various existing semantic change detection methods and associated semantic change measures on our dataset. Through empirical testing, we demonstrate that our proposed approach, leveraging regularized optimal transport, achieves either matching or improved performance compared to baseline approaches.
nan
Article 338
Title@2025-05-28 (3): Text2Grad: Reinforcement Learning from Natural Language Feedback
Title: Text2Grad: Reinforcement Learning from Natural Language Feedback | Text2Grad: Stärkung des Lernens aus natürlicher Sprache Feedback | Text2Grad:从自然语言反馈中加强学习 2505.22338v1 |
Authors: Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model’s policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad
nan
Article 339
Title@2025-05-28 (3): Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
Title: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start | Multimodale Reasoning durch verstärktes Lernen mit kaltem Start fördern | 通过 “ 冷起 “ 的强化学习推进多模式理由 2505.22334v1 |
Authors: Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang
Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While “aha moment” patterns–where models exhibit self-correction through reflection–are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.
nan
Article 340
Title@2025-05-28 (3): LLMs Think, But Not In Your Flow: Reasoning-Level Personalization for Black-Box Large Language Models
Title: LLMs Think, But Not In Your Flow: Reasoning-Level Personalization for Black-Box Large Language Models | LLMs denken, aber nicht in Ihrem Fluss: Grund-Level-Personalisierung für Black-Box große Sprachmodelle | LLM Think, But not in your roll: 黑人大语言模型的理性程度个人化 2505.21082v2 |
Authors: Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee
Large language models (LLMs) have recently achieved impressive performance across a wide range of natural language tasks and are now widely used in real-world applications. Among them, black-box LLMs–served via APIs without access to model internals–are especially dominant due to their scalability and ease of deployment. Despite their strong capabilities, these models typically produce generalized responses that overlook personal preferences and reasoning styles. This has led to growing interest in black-box LLM personalization, which aims to tailor model outputs to user-specific context without modifying model parameters. However, existing approaches primarily focus on response-level personalization, attempting to match final outputs without modeling personal thought process. To address this limitation, we propose RPM, a framework for reasoning-level personalization that aligns the model’s reasoning process with a user’s personalized logic. RPM first constructs statistical user-specific factors by extracting and grouping response-influential features from user history. It then builds personalized reasoning paths that reflect how these factors are used in context. In the inference stage, RPM retrieves reasoning-aligned examples for new queries via feature-level similarity and performs inference conditioned on the structured factors and retrieved reasoning paths, enabling the model to follow user-specific reasoning trajectories. This reasoning-level personalization enhances both predictive accuracy and interpretability by grounding model outputs in user-specific logic through structured information. Extensive experiments across diverse tasks show that RPM consistently outperforms response-level personalization methods, demonstrating the effectiveness of reasoning-level personalization in black-box LLMs.
nan
Article 341
Title@2025-05-28 (3): Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering
Title: Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering | Prompt-basierte Persönlichkeit Profiling: Verstärkung Lernen für Relevanz Filtern | 即时个人特征分析:加强学习促进相关性过滤 2409.04122v2 |
Authors: Jan Hofmann, Cornelia Sindermann, Roman Klinger
Author profiling is the task of inferring characteristics about individuals by analyzing content they share. Supervised machine learning still dominates automatic systems that perform this task, despite the popularity of prompting large language models to address natural language understanding tasks. One reason is that the classification instances consist of large amounts of posts, potentially a whole user profile, which may exceed the input length of Transformers. Even if a model can use a large context window, the entirety of posts makes the application of API-accessed black box systems costly and slow, next to issues which come with such “needle-in-the-haystack” tasks. To mitigate this limitation, we propose a new method for author profiling which aims at distinguishing relevant from irrelevant content first, followed by the actual user profiling only with relevant data. To circumvent the need for relevance-annotated data, we optimize this relevance filter via reinforcement learning with a reward function that utilizes the zero-shot capabilities of large language models. We evaluate our method for Big Five personality trait prediction on two Twitter corpora. On publicly available real-world data with a skewed label distribution, our method shows similar efficacy to using all posts in a user profile, but with a substantially shorter context. An evaluation on a version of these data balanced with artificial posts shows that the filtering to relevant posts leads to a significantly improved accuracy of the predictions.
nan
Article 342
Title@2025-05-28 (3): NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment
Title: NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment | NLP für soziales Gut: Eine Übersicht über Herausforderungen, Chancen und verantwortungsvolle Umsetzung | NLP 社会公益:挑战、机会和负责任的部署调查 2505.22327v1 |
Authors: Antonia Karamolegkou, Angana Borah, Eunjung Cho, Sagnik Ray Choudhury, Martina Galletti, Rajarshi Ghosh, Pranav Gupta, Oana Ignat, Priyanka Kargupta, Neema Kotonya, Hemank Lamba, Sun-Joo Lee, Arushi Mangla, Ishani Mondal, Deniz Nazarova, Poli Nemkova, Dina Pisarevskaya, Naquee Rizwan, Nazanin Sabri, Dominik Stammbach, Anna Steinberg, David Tomás, Steven R Wilson, Bowen Yi, Jessica H Zhu, Arkaitz Zubiaga, Anders Søgaard, Alexander Fraser, Zhijing Jin, Rada Mihalcea, Joel R. Tetreault, Daryna Dementieva
Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Toma\v{s}ev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.
nan
Article 343
Title@2025-05-28 (3): Advancing Expert Specialization for Better MoE
Title: Advancing Expert Specialization for Better MoE | Advancing Experten-Spezialisierung für bessere MoE | 推进专家专业专业促进改善教育部 2505.22323v1 |
Authors: Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang
Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.
nan
Article 344
Title@2025-05-28 (3): Core Context Aware Transformers for Long Context Language Modeling
Title: Core Context Aware Transformers for Long Context Language Modeling | Core Context Aware Transformers für lange Kontext-Sprachenmodellierung | 长语语言建模核心认知变型器 2412.12465v2 |
Authors: Yaofo Chen, Zeng You, Shuhai Zhang, Haokun Li, Yirui Li, Yaowei Wang, Mingkui Tan
Transformer-based Large Language Models (LLMs) have exhibited remarkable success in extensive tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute attention. However, when the context length L becomes very large (e.g., 128K), the amount of potentially redundant information in the context tends to increase. The redundant context not only hampers the modeling representation performance but also incurs unnecessary computational and storage overhead. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling, comprising two complementary modules: 1) Globality-aware pooling module groups input tokens and dynamically compresses each group into one core token based on their significance. In this way, our method automatically focuses and strengthens core context while diminishing redundancy during the learning process, leading to effective long-term dependency modeling. 2) Locality-preserving module incorporates neighboring tokens to preserve local context for detailed representation. Notably, our CCA-Attention is able to replace the self-attention module in existing LLMs with minimal fine-tuning cost. Extensive experimental results show the superiority of our method in both long-context modeling and computational efficiency over state-of-the-art methods.
nan
Article 345
Title@2025-05-28 (3): Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Title: Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation | Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation | 对回收增加的一代输出进行实况调查的不确定性量化 2505.21072v2 |
Authors: Ekaterina Fadeeva, Aleksandr Rubashevskii, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
Large Language Models (LLMs) enhanced with external knowledge retrieval, an approach known as Retrieval-Augmented Generation (RAG), have shown strong performance in open-domain question answering. However, RAG systems remain susceptible to hallucinations: factually incorrect outputs that may arise either from inconsistencies in the model’s internal knowledge or incorrect use of the retrieved context. Existing approaches often conflate factuality with faithfulness to the retrieved context, misclassifying factually correct statements as hallucinations if they are not directly supported by the retrieval. In this paper, we introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs. FRANQ applies different Uncertainty Quantification (UQ) techniques to estimate factuality based on whether a statement is faithful to the retrieved context or not. To evaluate FRANQ and other UQ techniques for RAG, we present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging examples. Extensive experiments on long- and short-form QA across multiple datasets and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing methods.
nan
Article 346
Title@2025-05-28 (3): If Pigs Could Fly… Can LLMs Logically Reason Through Counterfactuals?
Title: If Pigs Could Fly… Can LLMs Logically Reason Through Counterfactuals? | Wenn Schweine fliegen könnten… können LLMs logischerweise durch Gegenfakten denken? | 如果猪能飞… 2505.22318v1 |
Authors: Ishwar B Balappanawar, Vamshi Krishna Bonagiri, Anish R Joishy, Manas Gaur, Krishnaprasad Thirunarayan, Ponnurangam Kumaraguru
Large Language Models (LLMs) demonstrate impressive reasoning capabilities in familiar contexts, but struggle when the context conflicts with their parametric knowledge. To investigate this phenomenon, we introduce CounterLogic, a dataset containing 1,800 examples across 9 logical schemas, explicitly designed to evaluate logical reasoning through counterfactual (hypothetical knowledge-conflicting) scenarios. Our systematic evaluation of 11 LLMs across 6 different datasets reveals a consistent performance degradation, with accuracies dropping by 27% on average when reasoning through counterfactual information. We propose Self-Segregate, a prompting method enabling metacognitive awareness (explicitly identifying knowledge conflicts) before reasoning. Our method dramatically narrows the average performance gaps from 27% to just 11%, while significantly increasing the overall accuracy (+7.5%). We discuss the implications of these findings and draw parallels to human cognitive processes, particularly on how humans disambiguate conflicting information during reasoning tasks. Our findings offer practical insights for understanding and enhancing LLMs reasoning capabilities in real-world applications, especially where models must logically reason independently of their factual knowledge.
nan
Article 347
Title@2025-05-28 (3): MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Title: MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections | MUDDFormer: Breaking Residual Engpässe in Transformatoren über Multiway Dynamic Dense Connections | MUDDFormer:通过多路动态感应连接在变形器中打破残余瓶颈 2502.12170v2 |
Authors: Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .
nan
Article 348
Title@2025-05-28 (3): Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching
Title: Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching | Kann Code-Switched Texts einen Wissensschalter in LLMs aktivieren? Eine Fallstudie zum Englisch-Koreanischen Code-Switching | 密码转换的文本能否激活LLML中的知识开关? 关于英朝法典转换的案例研究 2410.18436v2 |
Authors: Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, Dongha Lee
Recent large language models (LLMs) demonstrate multilingual abilities, yet they are English-centric due to dominance of English in training corpora. The limited resource for low-resource languages remains a crucial challenge. Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation and elicits language-specific knowledge in human communications. In light of this, we investigate whether code-switching can ‘activate’, or identify and leverage knowledge for reasoning when LLMs solve low-resource language tasks. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our results demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs especially on language-specific domains, suggesting the potential of code-switching on low-resource language tasks.
nan
Article 349
Title@2025-05-28 (3): LLäMmlein: Compact and Competitive German-Only Language Models from Scratch
Title: LLäMmlein: Compact and Competitive German-Only Language Models from Scratch | LLäMmlein: Kompakte und wettbewerbsfähige deutschsprachige Sprachmodelle von Scratch | LläMmlein:来自斯克拉奇的契约和竞争性独德语言模式 2411.11171v4 |
Authors: Jan Pfister, Julia Wunderle, Andreas Hotho
We create two German-only decoder models, LL"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models’ learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models’ quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.
nan
Article 350
Title@2025-05-28 (3): Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing
Title: Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing | Adaptive Entgiftung: Schutz der allgemeinen Fähigkeiten von LLMs durch Toxicity-Aware Knowledge Editing | 适应性解毒:通过毒理学知识编辑来保护长效虫的一般能力 2505.22298v1 |
Authors: Yifan Lu, Jing Li, Yigeng Zhou, Yihui Zhang, Wenya Wang, Xiucheng Li, Meishan Zhang, Fangming Liu, Jun Yu, Min Zhang
Large language models (LLMs) exhibit impressive language capabilities but remain vulnerable to malicious prompts and jailbreaking attacks. Existing knowledge editing methods for LLM detoxification face two major challenges. First, they often rely on entity-specific localization, making them ineffective against adversarial inputs without explicit entities. Second, these methods suffer from over-editing, where detoxified models reject legitimate queries, compromising overall performance. In this paper, we propose ToxEdit, a toxicity-aware knowledge editing approach that dynamically detects toxic activation patterns during forward propagation. It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively. This design ensures precise toxicity mitigation while preserving LLMs’ general capabilities. To more accurately assess over-editing, we also enhance the SafeEdit benchmark by incorporating instruction-following evaluation tasks. Experimental results on multiple LLMs demonstrate that our ToxEdit outperforms previous state-of-the-art methods in both detoxification performance and safeguarding general capabilities of LLMs.
nan
Article 351
Title@2025-05-28 (3): 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training
Title: 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training | 360-LlaMA-Fabrik: Plug & Play-Sequenz-Parallelität für langes Nachtraining | 360-LLamaMA-Factory: 长期培训之后的插件和播放序列平行主义 2505.22296v1 |
Authors: Haosheng Zou, Xiaowei Lv, Shousheng Jia, Xiangzheng Zhang
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies’ training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
nan
Article 352
Title@2025-05-28 (3): Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
Title: Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond | Light-R1: Curriculum SFT, DPO und RL für Long COT aus Scratch und darüber hinaus | Light-R1:SFT、DPO和RL课程,用于Scratch及以后的长期COT 2503.10460v4 |
Authors: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.
nan
Article 353
Title@2025-05-28 (3): Compensating for Data with Reasoning: Low-Resource Machine Translation with LLMs
Title: Compensating for Data with Reasoning: Low-Resource Machine Translation with LLMs | Kompensieren für Daten mit Vernunft: Low-Resource-Maschinenübersetzung mit LLMs | 以合理理由补偿数据:低资源机器翻译与LLMM 2505.22293v1 |
Authors: Samuel Frontull, Thomas Ströhle
Large Language Models (LLMs) have demonstrated strong capabilities in multilingual machine translation, sometimes even outperforming traditional neural systems. However, previous research has highlighted the challenges of using LLMs, particularly with prompt engineering, for low-resource languages. In this work, we introduce Fragment-Shot Prompting, a novel in-context learning method that segments input and retrieves translation examples based on syntactic coverage, along with Pivoted Fragment-Shot, an extension that enables translation without direct parallel data. We evaluate these methods using GPT-3.5, GPT-4o, o1-mini, LLaMA-3.3, and DeepSeek-R1 for translation between Italian and two Ladin variants, revealing three key findings: (1) Fragment-Shot Prompting is effective for translating into and between the studied low-resource languages, with syntactic coverage positively correlating with translation quality; (2) Models with stronger reasoning abilities make more effective use of retrieved knowledge, generally produce better translations, and enable Pivoted Fragment-Shot to significantly improve translation quality between the Ladin variants; and (3) prompt engineering offers limited, if any, improvements when translating from a low-resource to a high-resource language, where zero-shot prompting already yields satisfactory results. We publicly release our code and the retrieval corpora.
nan
Article 354
Title@2025-05-28 (3): Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling
Title: Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling | Das Unlösbare neu denken: Wenn In-Context Search Test-Time Scaling trifft | 重新思考无法解答的问题: 当 In-Ctext 搜索遇到测试时间缩放时 2505.22290v1 |
Authors: Fanzeng Xia, Yidong Luo, Tinko Sebastian Bartels, Yaqi Xu, Tongxin Li
Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs’ deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed “unsolvable” (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.
nan
Article 355
Title@2025-05-28 (3): Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review
Title: Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review | Natürliche Sprachverarbeitung zur Unterstützung der evidenzbasierten Medizin: Eine Bewertung | 支持循证医学的自然语言处理:范围审查 2505.22280v1 |
Authors: Zihan Xu, Haotian Ma, Gongbo Zhang, Yihao Ding, Chunhua Weng, Yifan Peng
Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate evidence in EBM. This survey presents an in-depth review of 129 research studies on leveraging NLP for EBM, illustrating its pivotal role in enhancing clinical decision-making processes. The paper systematically explores how NLP supports the five fundamental steps of EBM – Ask, Acquire, Appraise, Apply, and Assess. The review not only identifies current limitations within the field but also proposes directions for future research, emphasizing the potential for NLP to revolutionize EBM by refining evidence extraction, evidence synthesis, appraisal, summarization, enhancing data comprehensibility, and facilitating a more efficient clinical workflow.
nan
Article 356
Title@2025-05-28 (3): Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration
Title: Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration | Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration | 利用语言代理框架中的双重进程理论促进实时同时人类-AI合作 2502.11882v5 |
Authors: Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent’s System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent’s System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. DPT-Agent can effectively help LLMs convert correct slow thinking and reasoning into executable actions, thereby improving performance. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
nan
Article 357
Title@2025-05-28 (3): Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead
Title: Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead | Kartierung der Landschaft der afrikanischen NLP: Mapping Progress and Shaping the Road Ahead | 绘制非洲全国土地规划方案景观图:绘制进展图和绘制前面的道路图 2505.21315v2 |
Authors: Jesujoba O. Alabi, Michael A. Hedderich, David Ifeoluwa Adelani, Dietrich Klakow
With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 734 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.
nan
Article 358
Title@2025-05-28 (3): PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Title: PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy | PreP-OCR: Eine komplette Pipeline für die Wiederherstellung von Dokumentenbildern und verbesserte OCR-Genauigkeit | PreP-OCR:一个完整的恢复文件图像和增强OCR准确性管道 2505.20429v2 |
Authors: Shuhao Guan, Moule Lin, Cheng Xu, Xinyi Liu, Jinman Zhao, Jiexin Fan, Qi Xu, Derek Greene
This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.
nan
Article 359
Title@2025-05-28 (3): Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages
Title: Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages | Umfassende Bewertung der Lexikalen Normalisierung: Grenzen-Bewusste Ansätze für ungesegmentierte Sprachen | 综合评价词汇正常化:未分语言的边界意识方法 2505.22273v1 |
Authors: Shohei Higashiyama, Masao Utiyama
Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.
nan
Article 360
Title@2025-05-28 (3): Reward Generalization in RLHF: A Topological Perspective
Title: Reward Generalization in RLHF: A Topological Perspective | Lohnverallgemeinerung in RLHF: Eine topologische Perspektive | RLHF的奖励普遍化:地形学观点 2402.10184v7 |
Authors: Tianyi Qiu, Fanzhi Zeng, Jiaming Ji, Dong Yan, Kaile Wang, Jiayi Zhou, Yang Han, Josef Dai, Xuehai Pan, Yaodong Yang
Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theory of reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks to model the impact of dataset topologies on reward generalization. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that it achieves an average win rate of 65% against baselines, thus improving reward generalization for free via topology design, while reducing the amount of data requiring annotation.
nan
Article 361
Title@2025-05-28 (3): Odysseus Navigates the Sirens’ Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation
Title: Odysseus Navigates the Sirens’ Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation | Odysseus navigiert das Lied der Sirenen: Dynamische Fokusdekodierung für die faktuelle und vielfältige Open-Ended Text Generation | Odysseus 导航《锡伦斯之歌:事实和多样化的不限名额文本生成的动态焦点解码》 2503.08057v2 |
Authors: Wen Luo, Feifan Song, Wei Li, Guangyue Peng, Shaohang Wei, Houfeng Wang
Large Language Models (LLMs) are increasingly required to generate text that is both factually accurate and diverse across various open-ended applications. However, current stochastic decoding methods struggle to balance such objectives. We introduce Dynamic Focus Decoding (DFD), a novel plug-and-play stochastic approach that resolves this trade-off without requiring additional data, knowledge, or models. DFD adaptively adjusts the decoding focus based on distributional differences across layers, leveraging the modular and hierarchical nature of factual knowledge within LLMs. This dynamic adjustment improves factuality in knowledge-intensive decoding steps and promotes diversity in less knowledge-reliant steps. DFD can be easily integrated with existing decoding methods, enhancing both factuality and diversity with minimal computational overhead. Extensive experiments across seven datasets demonstrate that DFD significantly improves performance, providing a scalable and efficient solution for open-ended text generation.
nan
Article 362
Title@2025-05-28 (3): AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments
Title: AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments | KI für die Klimafinanzierung: Agentische Retrieval- und Multi-Step-Gründung für Frühwarnsystem-Investitionen | AI 气候融资:预警系统投资的 “ 恢复 “ 和 “ 多重理由 “ 2504.05104v2 |
Authors: Saeid Ario Vaghefi, Aymane Hachcham, Veronica Grasso, Jiska Manicus, Nakiete Msemo, Chiara Colesanti Senni, Markus Leippold
Tracking financial investments in climate adaptation is a complex and expertise-intensive task, particularly for Early Warning Systems (EWS), which lack standardized financial reporting across multilateral development banks (MDBs) and funds. To address this challenge, we introduce an LLM-based agentic AI system that integrates contextual retrieval, fine-tuning, and multi-step reasoning to extract relevant financial data, classify investments, and ensure compliance with funding guidelines. Our study focuses on a real-world application: tracking EWS investments in the Climate Risk and Early Warning Systems (CREWS) Fund. We analyze 25 MDB project documents and evaluate multiple AI-driven classification methods, including zero-shot and few-shot learning, fine-tuned transformer-based classifiers, chain-of-thought (CoT) prompting, and an agent-based retrieval-augmented generation (RAG) approach. Our results show that the agent-based RAG approach significantly outperforms other methods, achieving 87\% accuracy, 89\% precision, and 83\% recall. Additionally, we contribute a benchmark dataset and expert-annotated corpus, providing a valuable resource for future research in AI-driven financial tracking and climate finance transparency.
nan
Article 363
Title@2025-05-28 (3): Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models
Title: Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models | Testzeit-Impfung: Ein universelles Abwehr-Rahmenwerk gegen Jailbreaks für (Multimodale) große Sprachmodelle | 试验时间免疫:针对(穆斯林)大语言模式的防止越狱全面防御框架 2505.22271v1 |
Authors: Yongcan Yu, Yanbo Wang, Ran He, Jian Liang
While (multimodal) large language models (LLMs) have attracted widespread attention due to their exceptional capabilities, they remain vulnerable to jailbreak attacks. Various defense methods are proposed to defend against jailbreak attacks, however, they are often tailored to specific types of jailbreak attacks, limiting their effectiveness against diverse adversarial strategies. For instance, rephrasing-based defenses are effective against text adversarial jailbreaks but fail to counteract image-based attacks. To overcome these limitations, we propose a universal defense framework, termed Test-time IMmunization (TIM), which can adaptively defend against various jailbreak attacks in a self-evolving way. Specifically, TIM initially trains a gist token for efficient detection, which it subsequently applies to detect jailbreak activities during inference. When jailbreak attempts are identified, TIM implements safety fine-tuning using the detected jailbreak instructions paired with refusal answers. Furthermore, to mitigate potential performance degradation in the detector caused by parameter updates during safety fine-tuning, we decouple the fine-tuning process from the detection module. Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacy of TIM.
nan
Article 364
Title@2025-05-28 (3): Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Title: Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL | Denken lernen: Adaptive Reasoning in R1-Style-Modellen über Multi-Stage RL gestalten | 学习思考何时思考:通过多级 RL 在 R1- 标准模型中塑造适应性理性 2505.10832v2 |
Authors: Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis (“…”) into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https://github.com/ScienceOne-AI/AutoThink.
nan
Article 365
Title@2025-05-28 (3): MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps
Title: MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps | MRT bei SemEval-2025 Task 8: Maximierung der Erholung von Tischen mit mehreren Schritten | SemEval-2025 MRT 任务8:最大限度地从有多个步骤的表格中复苏 2505.22264v1 |
Authors: Maximiliano Hormazábal Lagos, Álvaro Bueno Saez, Héctor Cerezo-Costas, Pedro Alonso Doval, Jorge Alcalde Vesteiro
In this paper we expose our approach to solve the \textit{SemEval 2025 Task 8: Question-Answering over Tabular Data} challenge. Our strategy leverages Python code generation with LLMs to interact with the table and get the answer to the questions. The process is composed of multiple steps: understanding the content of the table, generating natural language instructions in the form of steps to follow in order to get the answer, translating these instructions to code, running it and handling potential errors or exceptions. These steps use open source LLMs and fine grained optimized prompts for each task (step). With this approach, we achieved a score of $70.50\%$ for subtask 1.
nan
Article 366
Title@2025-05-28 (3): Something’s Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks
Title: Something’s Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks | Irgendetwas ist Fishy In The Data Lake: Eine kritische Neubewertung der Tabelle Union Suche Benchmarks | “数据湖中的鱼:对表格联合搜索基准的重要重新评估” 2505.21329v2 |
Authors: Allaa Boutaleb, Bernd Amann, Hubert Naacke, Rafael Angarita
Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.
nan
Article 367
Title@2025-05-28 (3): Train Sparse Autoencoders Efficiently by Utilizing Features Correlation
Title: Train Sparse Autoencoders Efficiently by Utilizing Features Correlation | Bahnsparse Autoencoder effizient durch die Nutzung von Funktionen Korrelation | 通过使用地物关联, 高效地列列“ 分散的自动编译器” 。 2505.22255v1 |
Authors: Vadim Kurochkin, Yaroslav Aksenov, Daniil Laptev, Daniil Gavrilov, Nikita Balagansky
Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.
nan
Article 368
Title@2025-05-28 (3): Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
Title: Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition | Bewertung von LLMs in Speech wird oft abgeflacht: Testset Kontaminierung in großen Sprachmodellen für die Spracherkennung | 对演讲中LLMs的评价经常是片断的:在大语言语音识别模型中测试设置污染 2505.22251v1 |
Authors: Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya
Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure the impact of contamination, LLMs trained with or without contamination are compared, showing that a contaminated LLM is more likely to generate test sentences it has seen during training. Speech recognisers using contaminated LLMs shows only subtle differences in error rates, but assigns significantly higher probabilities to transcriptions seen during training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.
nan
Article 369
Title@2025-05-28 (3): Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices
Title: Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices | Bewertung kompakter LLMs für blitzfreie iberische Sprachaufgaben auf Endbenutzer-Geräten | 评价关于最终用户装置的零 - 低 - 低 - 高 - 伊比利亚语语言任务 2504.03312v2 |
Authors: Luís Couto Seller, Íñigo Sanz Torres, Adrián Vogel-Fernández, Carlos González Carballo, Pedro Miguel Sánchez Sánchez, Adrián Carruana Martín, Enrique de Miguel Ambite
Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning. However, their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices. This challenge is especially pronounced for under-resourced languages like those spoken in the Iberian Peninsula, where relatively limited linguistic resources and benchmarks hinder effective evaluation. This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages. The results reveal that while some models consistently excel in certain tasks, significant performance gaps remain, particularly for languages such as Basque. These findings highlight the need for further research on balancing model compactness with robust multilingual performance
nan
Article 370
Title@2025-05-28 (3): Overcoming Non-monotonicity in Transducer-based Streaming Generation
Title: Overcoming Non-monotonicity in Transducer-based Streaming Generation | Überwindung der Nichtmonotonizität in der Transducer-basierten Streaming-Generation | 克服基于基于跨国公司的溪流一代中的非分子性 2411.17170v2 |
Authors: Zhengrui Ma, Yang Feng, Min Zhang
Streaming generation models are utilized across fields, with the Transducer architecture being popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation. In this research, we address this issue by integrating Transducer’s decoding with the history of input stream via a learnable monotonic attention. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the monotonic context representations, thereby avoiding the need to enumerate the exponentially large alignment space during training. Extensive experiments show that our MonoAttn-Transducer effectively handles non-monotonic alignments in streaming scenarios, offering a robust solution for complex generation tasks.
nan
Article 371
Title@2025-05-28 (3): On Provable Length and Compositional Generalization
Title: On Provable Length and Compositional Generalization | Auf evable Länge und kompositorische Verallgemeinerung | 关于可预见长度和组 成 式 通 泛 化 2402.04875v6 |
Authors: Kartik Ahuja, Amin Mansouri
Out-of-distribution generalization capabilities of sequence-to-sequence models can be studied from the lens of two crucial forms of generalization: length generalization – the ability to generalize to longer sequences than ones seen during training, and compositional generalization: the ability to generalize to token combinations not seen during training. In this work, we provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models – deep sets, transformers, state space models, and recurrent neural nets – trained to minimize the prediction error. We show that \emph{limited capacity} versions of these different architectures achieve both length and compositional generalization provided the training distribution is sufficiently diverse. In the first part, we study structured limited capacity variants of different architectures and arrive at the generalization guarantees with limited diversity requirements on the training distribution. In the second part, we study limited capacity variants with less structural assumptions and arrive at generalization guarantees but with more diversity requirements on the training distribution. Further, we also show that chain-of-thought supervision enables length generalization in higher capacity counterparts of the different architectures we study.
nan
Article 372
Title@2025-05-28 (3): BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain
Title: BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain | BioHopR: Ein Benchmark für Multi-Hop, Multi-Answer Reasoning in der biomedizinischen Domäne | BioHopR:生物医学领域多层次、多层次原因基准 2505.22240v1 |
Authors: Yunsoo Kim, Yusuf Abdulle, Honghan Wu
Biomedical reasoning often requires traversing interconnected relationships across entities such as drugs, diseases, and proteins. Despite the increasing prominence of large language models (LLMs), existing benchmarks lack the ability to evaluate multi-hop reasoning in the biomedical domain, particularly for queries involving one-to-many and many-to-many relationships. This gap leaves the critical challenges of biomedical multi-hop reasoning underexplored. To address this, we introduce BioHopR, a novel benchmark designed to evaluate multi-hop, multi-answer reasoning in structured biomedical knowledge graphs. Built from the comprehensive PrimeKG, BioHopR includes 1-hop and 2-hop reasoning tasks that reflect real-world biomedical complexities. Evaluations of state-of-the-art models reveal that O3-mini, a proprietary reasoning-focused model, achieves 37.93% precision on 1-hop tasks and 14.57% on 2-hop tasks, outperforming proprietary models such as GPT4O and open-source biomedical models including HuatuoGPT-o1-70B and Llama-3.3-70B. However, all models exhibit significant declines in multi-hop performance, underscoring the challenges of resolving implicit reasoning steps in the biomedical domain. By addressing the lack of benchmarks for multi-hop reasoning in biomedical domain, BioHopR sets a new standard for evaluating reasoning capabilities and highlights critical gaps between proprietary and open-source models while paving the way for future advancements in biomedical LLMs.
nan
Article 373
Title@2025-05-28 (3): A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity
Title: A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity | Eine sprachlich motivierte Analyse der intonationalen Phrasierung in Text-to-Speech-Systemen: Lücken in der syntaktischen Sensibilität offenbaren | 以语言动机动动分析从文字到语音系统中的国与国之间的内对文到语音系统中的图片分析:在同步感应方面消除差距 2505.22236v1 |
Authors: Charlotte Pouw, Afra Alishahi, Willem Zuidema
We analyze the syntactic sensitivity of Text-to-Speech (TTS) systems using methods inspired by psycholinguistic research. Specifically, we focus on the generation of intonational phrase boundaries, which can often be predicted by identifying syntactic boundaries within a sentence. We find that TTS systems struggle to accurately generate intonational phrase boundaries in sentences where syntactic boundaries are ambiguous (e.g., garden path sentences or sentences with attachment ambiguity). In these cases, systems need superficial cues such as commas to place boundaries at the correct positions. In contrast, for sentences with simpler syntactic structures, we find that systems do incorporate syntactic cues beyond surface markers. Finally, we finetune models on sentences without commas at the syntactic boundary positions, encouraging them to focus on more subtle linguistic cues. Our findings indicate that this leads to more distinct intonation patterns that better reflect the underlying structure.
nan
Article 374
Title@2025-05-28 (3): Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
Title: Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models | Qualität Across-Sprachen beurteilen: Ein mehrsprachiger Ansatz zur Vorschulung von Datenfiltern mit Sprachmodellen | 判断各语文的质量:采用多种语文办法,利用语言模式进行培训前数据过滤 2505.22232v1 |
Authors: Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, Felix Stollenwerk, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
nan
Article 375
Title@2025-05-28 (3): Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis
Title: Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis | Advancing Hearing Assessment: ASR-basierter frequenzspezifischer Sprachtest zur Diagnose von Presbycusis | 推进听力评估:基于AR的诊断预视能力频率特定语音测试 2505.22231v1 |
Authors: Stefan Bleeck
Traditional audiometry often fails to fully characterize the functional impact of hearing loss on speech understanding, particularly supra-threshold deficits and frequency-specific perception challenges in conditions like presbycusis. This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. Our approach leverages ASR to simulate the perceptual effects of moderate sloping hearing loss by processing speech stimuli under controlled acoustic degradation and subsequently analyzing phoneme-level confusion patterns. Key findings indicate that simulated hearing loss introduces specific phoneme confusions, predominantly affecting high-frequency consonants (e.g., alveolar/palatal to labiodental substitutions) and leading to significant phoneme deletions, consistent with the acoustic cues degraded in presbycusis. A test battery curated from these ASR-derived confusions demonstrated diagnostic value, effectively differentiating between simulated normal-hearing and hearing-impaired listeners in a comprehensive simulation. This ASR-driven methodology offers a promising avenue for developing objective, granular, and frequency-specific hearing assessment tools that complement traditional audiometry. Future work will focus on validating these findings with human participants and exploring the integration of advanced AI models for enhanced diagnostic precision.
nan
Article 376
Title@2025-05-28 (3): Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks
Title: Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks | Ausgewogene Berechnungslast und Darstellungsexpressivität in parallelen Hybrid-Neuralen Netzwerken | 在平行混合神经网络中平衡计算负载和代表表达式 2505.19472v2 |
Authors: Mohammad Mahdi Moradi, Walid Ahmed, Shuangyue Wen, Sudhir Mudur, Weiwei Zhang, Yang Liu
Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).
nan
Article 377
Title@2025-05-28 (3): Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection
Title: Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection | Kontinuierliche Selbstverbesserung von großen Sprachmodellen durch Test-Zeit-Training mit Verifier-getriebener Probenauswahl | 通过测试时间培训不断自我改进大语言模型,并进行验证-驱动抽样选择 2505.19475v2 |
Authors: Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, Walid Ahmed
Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.
nan
Article 378
Title@2025-05-28 (3): You Do Not Fully Utilize Transformer’s Representation Capacity
Title: You Do Not Fully Utilize Transformer’s Representation Capacity | Sie nicht voll nutzen Transformer-Repräsentanz Kapazität | 您没有充分利用变换器的代表能力 2502.09245v2 |
Authors: Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
In contrast to RNNs, which compress their history into a single hidden state, Transformers can attend to all past tokens directly. However, standard Transformers rely solely on the hidden state from the previous layer to represent the entire context. We show that this design choice induces representation collapse and degrades performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a lightweight extension that leverages existing key-value buffers and learns per-head, per-layer routing weights to integrate representations from all previous layers with negligible overhead. Through extensive experiments-including language modeling, synthetic reasoning benchmarks, and very deep architectures-LIMe consistently achieves faster convergence, lower perplexity per FLOP, and substantial accuracy improvements on synthetic tasks while preserving higher value-vector entropy and improved token separability. Finally, our analysis of the learned routing weights reveals systematic reuse of both local and long-distance features, demonstrating how LIMe mitigates collapse, unlocks richer representations without increasing hidden-state size, and points to promising directions for future research.
nan
Article 379
Title@2025-05-28 (3): Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning
Title: Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning | Anpassung vorgebildeter Sprachmodelle für die Klassifizierung von Zitationen über selbstüberwachtes kontrastives Lernen | 调整通过自我监督反竞争学习的招录分类的训练前语言模式 2505.14471v2 |
Authors: Tong Li, Jiachuan Wang, Yongqi Zhang, Shuangyin Li, Lei Chen
Citation classification, which identifies the intention behind academic citations, is pivotal for scholarly analysis. Previous works suggest fine-tuning pretrained language models (PLMs) on citation classification datasets, reaping the reward of the linguistic knowledge they gained during pretraining. However, directly fine-tuning for citation classification is challenging due to labeled data scarcity, contextual noise, and spurious keyphrase correlations. In this paper, we present a novel framework, Citss, that adapts the PLMs to overcome these challenges. Citss introduces self-supervised contrastive learning to alleviate data scarcity, and is equipped with two specialized strategies to obtain the contrastive pairs: sentence-level cropping, which enhances focus on target citations within long contexts, and keyphrase perturbation, which mitigates reliance on specific keyphrases. Compared with previous works that are only designed for encoder-based PLMs, Citss is carefully developed to be compatible with both encoder-based PLMs and decoder-based LLMs, to embrace the benefits of enlarged pretraining. Experiments with three benchmark datasets with both encoder-based PLMs and decoder-based LLMs demonstrate our superiority compared to the previous state of the art. Our code is available at: github.com/LITONG99/Citss
nan
Article 380
Title@2025-05-28 (3): Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation
Title: Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation | Look & Mark: Leveraging Radiologe Eye Fixations und Bounding Boxen in multimodalen großen Sprachmodellen für die Erzeugung von Röntgenberichten im Brustkorb | Look & Mark: 将辐射学家眼修补和检查框用于胸前X光报告生成的多模式大语言模型中 2505.22222v1 |
Authors: Yunsoo Kim, Jinge Wu, Su-Hwan Kim, Pardeep Vasudev, Jiashu Shen, Honghan Wu
Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look & Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M’s potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.
nan
Article 381
Title@2025-05-28 (3): Advancing Sequential Numerical Prediction in Autoregressive Models
Title: Advancing Sequential Numerical Prediction in Autoregressive Models | Advancing Sequential Numerical Prediction in Autoregressive Modelle | 自动递减模型中推进序列序号预测 2505.13077v2 |
Authors: Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, Can Huang
Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover’s Distance (EMD) to preserve ordinal relationships between numerical values, and (2) sequence-level, where it penalizes the overall discrepancy between the predicted and actual sequences. This dual approach improves numerical prediction and integrates effectively with LLMs/MLLMs. Extensive experiments show significant performance improvements with NTIL.
nan
Article 382
Title@2025-05-28 (3): The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants
Title: The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants | Die Avengers: Ein einfaches Rezept für die Vereinigung kleinerer Sprachmodelle, um proprietäre Riesen herauszufordern | 《复仇者:将小型语言模式联合起来挑战产权巨人挑战小型语言模式的简单食谱》 2505.19797v2 |
Authors: Yiqun Zhang, Hao Li, Chenxu Wang, Linyao Chen, Qiaosheng Zhang, Peng Ye, Shi Feng, Daling Wang, Zhen Wang, Xinrun Wang, Jia Xu, Lei Bai, Wanli Ouyang, Shuyue Hu
As proprietary giants increasingly dominate the race for ever-larger language models, a pressing question arises for the open-source community: can smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers–a simple recipe that effectively leverages the collective intelligence of open-source, smaller language models. Our framework is built upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model’s performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response using the Self-Consistency or its multi-model variant. Remarkably, with 10 open-source models (~7B parameters each), the Avengers collectively outperforms GPT-4.1 on nine out of 15 datasets (spanning mathematics, code, logic, knowledge, and affective tasks). In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter–the number of clusters. We have open-sourced the code on GitHub: https://github.com/ZhangYiqun018/Avengers
nan
Article 383
Title@2025-05-28 (3): On the Within-class Variation Issue in Alzheimer’s Disease Detection
Title: On the Within-class Variation Issue in Alzheimer’s Disease Detection | Zur klasseninternen Variationsfrage bei der Alzheimer-Erkennung | 阿尔茨海默氏氏病检测的 类内变化变化问题 2409.16322v2 |
Authors: Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng
Alzheimer’s Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without. Different from conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Therefore, simplistic binary AD classification may overlook two crucial aspects: within-class heterogeneity and instance-level imbalance. In this work, we found using a sample score estimator can generate sample-specific soft scores aligning with cognitive scores. We subsequently propose two simple yet effective methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Based on the ADReSS and CU-MARVEL corpora, we demonstrated and analyzed the advantages of the proposed approaches in detection performance. These findings provide insights for developing robust and reliable AD detection models.
nan
Article 384
Title@2025-05-28 (3): Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
Title: Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity | Pangu Pro MoE: Mischung aus gruppierten Experten für effiziente Sparsamkeit | Pangu Pro MoE:高效公平问题专家组混合 2505.21411v2 |
Authors: Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, Fei Mi, Xiaojun Meng, Zhicheng Liu, Hanting Chen, Binfan Zheng, Can Chen, Youliang Yan, Ruiming Tang, Peifeng Qin, Xinghao Chen, Dacheng Tao, Yunhe Wang
The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.
nan
Article 385
Title@2025-05-28 (3): Pitfalls of Rule- and Model-based Verifiers – A Case Study on Mathematical Reasoning
Title: Pitfalls of Rule- and Model-based Verifiers – A Case Study on Mathematical Reasoning | Pitfalls of Rule- and Model-based Verifiers – Eine Fallstudie zur mathematischen Begründung | 规则和基于示范的验证符咒 – – 关于数学理由的个案研究 2505.22203v1 |
Authors: Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, Junxian He
Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct (i.e., false positives). This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique risks inherent to both rule-based and model-based verifiers, aiming to offer valuable insights to develop more robust reward systems in reinforcement learning.
nan
Article 386
Title@2025-05-28 (3): Let’s Predict Sentence by Sentence
Title: Let’s Predict Sentence by Sentence | Let’s Predict Satz durch Satz | 让我们按刑期预测判决 2505.22202v1 |
Authors: Hyeonbin Hwang, Byeongguk Jeon, Seungone Kim, Jiyeon Kim, Hoyeon Chang, Sohee Yang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo
Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.
nan
Article 387
Title@2025-05-28 (3): Machine Translation Models are Zero-Shot Detectors of Translation Direction
Title: Machine Translation Models are Zero-Shot Detectors of Translation Direction | Maschinelle Übersetzungsmodelle sind Null-Schuss-Detektoren der Übersetzungsrichtung | 机器翻译模型是翻译方向零热探测器 2401.06769v4 |
Authors: Michelle Wastl, Jannis Vamvas, Rico Sennrich
Detecting the translation direction of parallel text has applications for machine translation training and evaluation, but also has forensic applications such as resolving plagiarism or forgery allegations. In this work, we explore an unsupervised approach to translation direction detection based on the simple hypothesis that $p(\text{translation} | \text{original})>p(\text{original} | \text{translation})$, motivated by the well-known simplification effect in translationese or machine-translationese. In experiments with massively multilingual machine translation models across 20 translation directions, we confirm the effectiveness of the approach for high-resource language pairs, achieving document-level accuracies of 82–96% for NMT-produced translations, and 60–81% for human translations, depending on the model used. Code and demo are available at https://github.com/ZurichNLP/translation-direction-detection |
nan
Article 388
Title@2025-05-28 (3): ClonEval: An Open Voice Cloning Benchmark
Title: ClonEval: An Open Voice Cloning Benchmark | ClonEval: Eine offene Stimme Klon-Benchmark | ClonEval: 开放语音克隆基准 2504.20581v2 |
Authors: Iwona Christop, Tomasz Kuczyński, Marek Kubis
We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of results on the leaderboard.
nan
Article 389
Title@2025-05-28 (3): PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims
Title: PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims | PEDANTIC: Ein Datensatz für die automatische Prüfung der Wirksamkeit von Patentansprüchen | PEDANTIC: 自动审查专利索赔的缺陷数据集 2505.21342v2 |
Authors: Valentin Knappich, Annemarie Friedrich, Anna Hätty, Simon Razniewski
Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C {\S} 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (Patent Definiteness Examination Corpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline’s accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We will publicly release the dataset and code.
nan
Article 390
Title@2025-05-28 (3): Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon
Title: Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon | Breaking the Cloak! Enthüllung der chinesischen verhüllten Toxizität mit Homophon Graph und giftigem Lexikon | 破解衣物! 中华便衣毒物与同声图和毒毒词汇结合 2505.22184v1 |
Authors: Xuchen Ma, Jianxiang Yu, Wenming Shao, Bo Pang, Xiang Li
Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C$^2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C$^2$TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively.
nan
Article 391
Title@2025-05-28 (3): TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation
Title: TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation | TabXEval: Warum ist das ein schlechter Tisch? Eine eXhaustive Rubrik für die Tabellenbewertung | TabXEval: 为什么这是一张糟糕的桌子? 用于表格评价的 e Xhaustive Rubric 2505.22176v1 |
Authors: Vihang Pancholi, Jainit Bafna, Tejas Anvekar, Manish Shrivastava, Vivek Gupta
Evaluating tables qualitatively & quantitatively presents a significant challenge, as traditional metrics often fail to capture nuanced structural and content discrepancies. To address this, we introduce a novel, methodical rubric integrating multi-level structural descriptors with fine-grained contextual quantification, thereby establishing a robust foundation for comprehensive table comparison. Building on this foundation, we propose TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval initially aligns reference tables structurally via TabAlign & subsequently conducts a systematic semantic and syntactic comparison using TabCompare; this approach clarifies the evaluation process and pinpoints subtle discrepancies overlooked by conventional methods. The efficacy of this framework is assessed using TabXBench, a novel, diverse, multi-domain benchmark we developed, featuring realistic table perturbations and human-annotated assessments. Finally, a systematic analysis of existing evaluation methods through sensitivity-specificity trade-offs demonstrates the qualitative and quantitative effectiveness of TabXEval across diverse table-related tasks and domains, paving the way for future innovations in explainable table evaluation.
nan
Article 392
Title@2025-05-28 (3): Reverse Preference Optimization for Complex Instruction Following
Title: Reverse Preference Optimization for Complex Instruction Following | Reverse-Preference-Optimierung für komplexe Instruktionen | 复杂指令的逆偏偏优化 2505.22172v1 |
Authors: Xiang Huang, Ting-En Lin, Feiteng Fang, Yuchuan Wu, Hangyu Li, Yuzhong Qu, Fei Huang, Yongbin Li
Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.
nan
Article 393
Title@2025-05-28 (3): ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Title: ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments | ZuverlässigEval: Ein Rezept für die stochastische LLM-Bewertung über die Methode der Momente | 可靠有效:通过瞬间方法进行沙尘暴 LLM评价的食谱 2505.22169v1 |
Authors: Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
nan
Article 394
Title@2025-05-28 (3): Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
Title: Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search | Tempest: Autonomes Multi-Turn-Jailbreaking von großen Sprachmodellen mit Baumsuche | 暴风:利用树木搜索的大型语言模型的多发自动破获多语监狱 2503.10619v5 |
Authors: Andy Zhou, Ron Arel
We introduce Tempest, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Tempest expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Tempest reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Tempest achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
nan
Article 395
Title@2025-05-28 (3): Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes
Title: Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes | Kontinuierliche und diskrete Diffusion mit nicht gleichzeitigen Diffusionsprozessen | 与非平行扩散进程一起进行连续和分解的不连续和分解文本传播 2505.22165v1 |
Authors: Bocheng Li, Zhujin Gao, Linli Xu
Diffusion models have emerged as a promising approach for text generation, with recent works falling into two main categories: discrete and continuous diffusion models. Discrete diffusion models apply token corruption independently using categorical distributions, allowing for different diffusion progress across tokens but lacking fine-grained control. Continuous diffusion models map tokens to continuous spaces and apply fine-grained noise, but the diffusion progress is uniform across tokens, limiting their ability to capture semantic nuances. To address these limitations, we propose \textbf{\underline{N}}on-simultan\textbf{\underline{e}}ous C\textbf{\underline{o}}ntinuous \textbf{\underline{Diff}}usion Models (NeoDiff), a novel diffusion model that integrates the strengths of both discrete and continuous approaches. NeoDiff introduces a Poisson diffusion process for the forward process, enabling a flexible and fine-grained noising paradigm, and employs a time predictor for the reverse process to adaptively modulate the denoising progress based on token semantics. Furthermore, NeoDiff utilizes an optimized schedule for inference to ensure more precise noise control and improved performance. Our approach unifies the theories of discrete and continuous diffusion models, offering a more principled and effective framework for text generation. Experimental results on several text generation tasks demonstrate NeoDiff’s superior performance compared to baselines of non-autoregressive continuous and discrete diffusion models, iterative-based methods and autoregressive diffusion-based methods. These results highlight NeoDiff’s potential as a powerful tool for generating high-quality text and advancing the field of diffusion-based text generation.
nan
Article 396
Title@2025-05-28 (3): Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy
Title: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy | Stratifizierte selektive Probenahme für Instruction Tuning mit dedizierter Scoring-Strategie | 使用专用 Scoring 战略进行教学指示指示的分批选择性抽样 2505.22157v1 |
Authors: Paramita Mirza, Lucas Weber, Fabian Küch
Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this paper, we demonstrate that data selection can be both – efficient and universal – by using a multi-step pipeline in which we efficiently bin data points into groups, estimate quality using specialized models, and score difficulty with a robust, lightweight method. Task-based categorization allows us to control the composition of our final data – crucial for finetuning multi-purpose models. To guarantee diversity, we improve upon previous work using embedding models and a clustering algorithm. This integrated strategy enables high-performance fine-tuning with minimal overhead.
nan
Article 397
Title@2025-05-28 (3): Towards Practical Defect-Focused Automated Code Review
Title: Towards Practical Defect-Focused Automated Code Review | Auf dem Weg zu einer praktischen fehlerorientierten automatisierten Code-Überprüfung | 走向实际失效-受污染的自动编码审查 2505.17928v2 |
Authors: Junyi Lu, Lili Jiang, Xiaojia Li, Jianbing Fang, Fengjun Zhang, Li Yang, Chun Zuo
The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.
nan
Article 398
Title@2025-05-28 (3): InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing
Title: InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing | InComeS: Integration von Kompressions- und Auswahlmechanismen in LLMs für effiziente Modellbearbeitung | 因果:将压缩和甄选机制纳入高效模式编辑LLMLM 2505.22156v1 |
Authors: Shuaiyi Li, Zhisong Zhang, Yang Deng, Chenlong Deng, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Wai Lam
Although existing model editing methods perform well in recalling exact edit facts, they often struggle in complex scenarios that require deeper semantic understanding rather than mere knowledge regurgitation. Leveraging the strong contextual reasoning abilities of large language models (LLMs), in-context learning (ICL) becomes a promising editing method by comprehending edit information through context encoding. However, this method is constrained by the limited context window of LLMs, leading to degraded performance and efficiency as the number of edits increases. To overcome this limitation, we propose InComeS, a flexible framework that enhances LLMs’ ability to process editing contexts through explicit compression and selection mechanisms. Specifically, InComeS compresses each editing context into the key-value (KV) cache of a special gist token, enabling efficient handling of multiple edits without being restricted by the model’s context window. Furthermore, specialized cross-attention modules are added to dynamically select the most relevant information from the gist pools, enabling adaptive and effective utilization of edit information. We conduct experiments on diverse model editing benchmarks with various editing formats, and the results demonstrate the effectiveness and efficiency of our method.
nan
Article 399
Title@2025-05-28 (3): Incentivizing Strong Reasoning from Weak Supervision
Title: Incentivizing Strong Reasoning from Weak Supervision | Starke Vernunft von schwacher Aufsicht anregen | 以弱监管为强力理由的激励 2505.20072v2 |
Authors: Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, Bingbing Xu
Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/w2sr.
nan
Article 400
Title@2025-05-28 (3): Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language
Title: Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language | Flexible Werkzeugauswahl durch Low-dimensionale Attributausrichtung von Vision und Sprache | 通过视力和语言的低维属性一致进行灵活工具选择 2505.22146v1 |
Authors: Guangfu Hao, Haojie Wen, Liangxuna Guo, Yang Chen, Yanchao Bi, Shan Yu
Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Ablation studies revealed that manipulation-related attributes (graspability, hand-relatedness, elongation) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.
nan
Article 401
Title@2025-05-28 (3): LLMs Reproduce Stereotypes of Sexual and Gender Minorities
Title: LLMs Reproduce Stereotypes of Sexual and Gender Minorities | LLMs reproduzieren Stereotypen sexueller und geschlechtsspezifischer Minderheiten | LLMs 重塑对性和性别少数群体的陈规定型观念 2501.05926v2 |
Authors: Ruby Ostrow, Adam Lopez
A large body of research has found substantial gender bias in NLP systems. Most of this research takes a binary, essentialist view of gender: limiting its variation to the categories men and women, conflating gender with sex, and ignoring different sexual identities. But gender and sexuality exist on a spectrum, so in this paper we study the biases of large language models (LLMs) towards sexual and gender minorities beyond binary categories. Grounding our study in a widely used social psychology model – the Stereotype Content Model – we demonstrate that English-language survey questions about social perceptions elicit more negative stereotypes of sexual and gender minorities from both humans and LLMs. We then extend this framework to a more realistic use case: text generation. Our analysis shows that LLMs generate stereotyped representations of sexual and gender minorities in this setting, showing that they amplify representational harms in creative writing, a widely advertised use for LLMs.
nan
Article 402
Title@2025-05-28 (3): EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning
Title: EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning | EPO: Explizite politische Optimierung der strategischen Vernunft in LLMs durch Verstärkungslernen | EPO: 通过强化学习,在LLMs中明确政策优化战略理由 2502.12486v6 |
Authors: Xiaoqian Liu, Ke Wang, Yongbin Li, Yuchuan Wu, Wentao Ma, Aobo Kong, Fei Huang, Jianbin Jiao, Junge Zhang
Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business negotiations, which require strategic reasoning-an ability to navigate dynamic environments and align long-term goals amidst uncertainty. Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts. To address these issues, we propose explicit policy optimization (EPO) for strategic reasoning, featuring an LLM that provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior. To improve adaptability and policy transferability, we train the strategic reasoning model via multi-turn reinforcement learning (RL),utilizing process rewards and iterative self-play. Experiments across social and physical domains demonstrate EPO’s ability of long-term goal alignment through enhanced strategic reasoning, achieving state-of-the-art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in EPO and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real-world applications. Code and data are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/EPO.
nan
Article 403
Title@2025-05-28 (3): Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments
Title: Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments | Begrenzte Verallgemeinerbarkeit im Argumentbergbau: State-of-The-Art-Modelle lernen Datensätze, keine Argumente | 《争议采矿业的限制性通用性:国家与艺术中的模式学习数据集,非论据》 2505.22137v1 |
Authors: Marc Feger, Katarina Boland, Stefan Dietze
Identifying arguments is a necessary prerequisite for various tasks in automated discourse analysis, particularly within contexts such as political debates, online discussions, and scientific reasoning. In addition to theoretical advances in understanding the constitution of arguments, a significant body of research has emerged around practical argument mining, supported by a growing number of publicly available datasets. On these benchmarks, BERT-like transformers have consistently performed best, reinforcing the belief that such models are broadly applicable across diverse contexts of debate. This study offers the first large-scale re-evaluation of such state-of-the-art models, with a specific focus on their ability to generalize in identifying arguments. We evaluate four transformers, three standard and one enhanced with contrastive pre-training for better generalization, on 17 English sentence-level datasets as most relevant to the task. Our findings show that, to varying degrees, these models tend to rely on lexical shortcuts tied to content words, suggesting that apparent progress may often be driven by dataset-specific cues rather than true task alignment. While the models achieve strong results on familiar benchmarks, their performance drops markedly when applied to unseen datasets. Nonetheless, incorporating both task-specific pre-training and joint benchmark training proves effective in enhancing both robustness and generalization.
nan
Article 404
Title@2025-05-28 (3): RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding
Title: RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding | RAD: Redundanz-Bewusst-Destillation für Hybridmodelle über selbstspekulative Decodierung | RAD: 通过自投机代号为混合模型进行再利用-软件蒸馏 2505.22135v1 |
Authors: Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa
Hybrid models combining Transformers and State Space Models (SSMs) are promising for balancing performance and efficiency. However, optimizing these hybrid models, particularly by addressing the potential redundancy inherent within the Transformer components, remains a significant challenge. In this paper, we propose RAD (Redundancy-Aware Distillation), a novel framework that uses self-speculative decoding as a diagnostic tool to identify redundant attention layers within the model. These identified layers are then selectively replaced with SSM components, followed by targeted (self-)distillation. Specifically, RAD focuses knowledge transfer on the components identified as redundant, considering architectural changes and specific weight initialization strategies. We experimentally demonstrate that self-distillation using RAD significantly surpasses the performance of the original base model on mathematical and coding tasks. Furthermore, RAD is also effective in standard knowledge distillation settings, achieving up to approximately 2x faster convergence compared to baseline methods. Notably, while a baseline model distilled from a Llama-3.1 70B teacher achieves scores of 46.17 on GSM8K and 22.75 on CRUX, RAD achieves significantly higher scores of 71.27 on GSM8K and 28.25 on CRUX, even when using a much smaller Llama-3.1 8B teacher. RAD offers a new pathway for efficient optimization and performance enhancement in the distillation of hybrid models.
nan
Article 405
Title@2025-05-28 (3): EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning
Title: EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning | EULER: Verbesserung der vernünftigen Fähigkeit großer Sprachmodelle durch fehlerinduziertes Lernen | EULER:通过错误引起的学习提高大语言模式的理性能力 2505.22131v1 |
Authors: Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun
Large Language Models (LLMs) have demonstrated strong reasoning capabilities and achieved promising results in mathematical problem-solving tasks. Learning from errors offers the potential to further enhance the performance of LLMs during Supervised Fine-Tuning (SFT). However, the errors in synthesized solutions are typically gathered from sampling trails, making it challenging to generate solution errors for each mathematical problem. This paper introduces the Error-IndUced LEaRning (EULER) model, which aims to develop an error exposure model that generates high-quality solution errors to enhance the mathematical reasoning capabilities of LLMs. Specifically, EULER optimizes the error exposure model to increase the generation probability of self-made solution errors while utilizing solutions produced by a superior LLM to regularize the generation quality. Our experiments across various mathematical problem datasets demonstrate the effectiveness of the EULER model, achieving an improvement of over 4% compared to all baseline models. Further analysis reveals that EULER is capable of synthesizing more challenging and educational solution errors, which facilitate both the training and inference processes of LLMs. All codes are available at https://github.com/NEUIR/EULER.
nan
Article 406
Title@2025-05-28 (3): Towards Achieving Concept Completeness for Textual Concept Bottleneck Models
Title: Towards Achieving Concept Completeness for Textual Concept Bottleneck Models | Auf dem Weg zur Verwirklichung des Konzepts Vollständigkeit für textuelle Konzepte Engpassmodelle | 实现文本概念瓶颈模式概念完整性 2502.11100v3 |
Authors: Milan Bhan, Yann Choho, Pierre Moreau, Jean-Noel Vittaut, Nicolas Chesneau, Marie-Jeanne Lesot
Textual Concept Bottleneck Models (TCBMs) are interpretable-by-design models for text classification that predict a set of salient concepts before making the final prediction. This paper proposes Complete Textual Concept Bottleneck Model (CT-CBM), a novel TCBM generator building concept labels in a fully unsupervised manner using a small language model, eliminating both the need for predefined human labeled concepts and LLM annotations. CT-CBM iteratively targets and adds important and identifiable concepts in the bottleneck layer to create a complete concept basis. CT-CBM achieves striking results against competitors in terms of concept basis completeness and concept detection accuracy, offering a promising solution to reliably enhance interpretability of NLP classifiers.
nan
Article 407
Title@2025-05-28 (3): LoKI: Low-damage Knowledge Implanting of Large Language Models
Title: LoKI: Low-damage Knowledge Implanting of Large Language Models | LoKI: Low-Damage Knowledge Implanting von großen Sprachmodellen | LoKI: 低损害知识植入大语言模型 2505.22120v1 |
Authors: Runyu Wang, Peng Ping, Zhengyu Guo, Xiaoye Zhang, Quan Shi, Liting Zhou, Tianbo Ji
Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pre-training is overwritten. Current Parameter-Efficient Fine-Tuning (PEFT) methods for Large Language Models (LLMs), while efficient, often sacrifice general capabilities. To address the issue of CF in a general-purpose PEFT framework, we propose \textbf{Lo}w-damage \textbf{K}nowledge \textbf{I}mplanting (\textbf{LoKI}), a PEFT technique that is based on a mechanistic understanding of how knowledge is stored in transformer architectures. In two real-world scenarios, LoKI demonstrates task-specific performance that is comparable to or even surpasses that of full fine-tuning and LoRA-based methods across various model types, while significantly better preserving general capabilities. Our work connects mechanistic insights into LLM knowledge storage with practical fine-tuning objectives, achieving state-of-the-art trade-offs between task specialization and the preservation of general capabilities. Our implementation is publicly available as ready-to-use code\footnote{https://github.com/Nexround/LoKI}.
nan
Article 408
Title@2025-05-28 (3): Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches
Title: Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches | Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: Eine Geschichte von zwei Ansätzen | 多语种和跨语种检索实况调查索赔:两种方法的故事 2505.22118v1 |
Authors: Alan Ramponi, Marco Rovera, Robert Moro, Sara Tonelli
Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.
nan
Article 409
Title@2025-05-28 (3): Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model
Title: Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model | Multimodale Vorhersage von Sparse Intraoperativen Hypotonieereignissen durch Sprachmodell | 以语言模式为动力的草散的不合作和不连续活动多式预报 2505.22116v1 |
Authors: Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Qi Liu, Yanhu Xie
Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.
nan
Article 410
Title@2025-05-28 (3): Mitigating Text Toxicity with Counterfactual Generation
Title: Mitigating Text Toxicity with Counterfactual Generation | Eindämmung der Texttoxizität mit kontrafaktischer Generierung | 减少毒剂毒性,同时防止产生事实上的产生 2405.09948v3 |
Authors: Milan Bhan, Jean-Noel Vittaut, Nina Achache, Victor Legrand, Nicolas Chesneau, Annabelle Blangero, Juliette Murris, Marie-Jeanne Lesot
Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.
nan
Article 411
Title@2025-05-28 (3): CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature
Title: CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature | CHIMERA: Eine Wissensbasis der Ideenrekombination in der wissenschaftlichen Literatur | CHIMERA:科学文献中思想再融合的知识库 2505.20779v2 |
Authors: Noy Sternlicht, Tom Hope
A hallmark of human innovation is the process of recombination – creating original ideas by integrating elements of existing mechanisms and concepts. In this work, we automatically mine the scientific literature and build CHIMERA: a large-scale knowledge base (KB) of recombination examples. CHIMERA can be used to empirically explore at scale how scientists recombine concepts and take inspiration from different areas, or to train supervised machine learning models that learn to predict new creative cross-domain directions. To build this KB, we present a novel information extraction task of extracting recombination from scientific paper abstracts, collect a high-quality corpus of hundreds of manually annotated abstracts, and use it to train an LLM-based extraction model. The model is applied to a large corpus of papers in the AI domain, yielding a KB of over 28K recombination examples. We analyze CHIMERA to explore the properties of recombination in different subareas of AI. Finally, we train a scientific hypothesis generation model using the KB, which predicts new recombination directions that real-world researchers find inspiring. Our data and code are available at https://github.com/noy-sternlicht/CHIMERA-KB
nan
Article 412
Title@2025-05-28 (3): THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models
Title: THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models | THINK-Bench: Bewertung des Denkens Effizienz und nachdenkliche Qualität von Modellen großer Vernunft | 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 考虑 - 考虑 - 考虑 - 考虑 - 高 重大 理由 模型 质量 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - - - 思考 - 思考 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 评估 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 思考 - 评估 2505.22113v1 |
Authors: Zhiyuan Li, Yi Chang, Yuan Wu
Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.
nan
Article 413
Title@2025-05-28 (3): Redundancy Principles for MLLMs Benchmarks
Title: Redundancy Principles for MLLMs Benchmarks | Redundanzgrundsätze für MLLM-Benchmarks | MLLLMs基准标准的裁员原则 2501.13953v2 |
Authors: Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs’ performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively. The code is available at https://github.com/zzc-1998/Benchmark-Redundancy.
nan
Article 414
Title@2025-05-28 (3): Generative Framework for Personalized Persuasion: Inferring Causal, Counterfactual, and Latent Knowledge
Title: Generative Framework for Personalized Persuasion: Inferring Causal, Counterfactual, and Latent Knowledge | Generatives Rahmenwerk für personalisierte Überzeugung: Aufschluss über Kausal-, Gegen- und Latentenwissen | 个性化观察分析的生成框架:推断因果关系、反事实和隐藏知识 2504.13904v2 |
Authors: Donghuo Zeng, Roberto Legaspi, Yuewen Sun, Xinshuai Dong, Kazushi Ikeda, Peter Spirtes, Kun Zhang
We hypothesize that optimal system responses emerge from adaptive strategies grounded in causal and counterfactual knowledge. Counterfactual inference allows us to create hypothetical scenarios to examine the effects of alternative system responses. We enhance this process through causal discovery, which identifies the strategies informed by the underlying causal structure that govern system behaviors. Moreover, we consider the psychological constructs and unobservable noises that might be influencing user-system interactions as latent factors. We show that these factors can be effectively estimated. We employ causal discovery to identify strategy-level causal relationships among user and system utterances, guiding the generation of personalized counterfactual dialogues. We model the user utterance strategies as causal factors, enabling system strategies to be treated as counterfactual actions. Furthermore, we optimize policies for selecting system responses based on counterfactual data. Our results using a real-world dataset on social good demonstrate significant improvements in persuasive system outcomes, with increased cumulative rewards validating the efficacy of causal discovery in guiding personalized counterfactual inference and optimizing dialogue policies for a persuasive dialogue system.
nan
Article 415
Title@2025-05-28 (3): Curse of High Dimensionality Issue in Transformer for Long-context Modeling
Title: Curse of High Dimensionality Issue in Transformer for Long-context Modeling | Fluch der Hochdimensionalitätsfrage im Transformer für die Langkontextmodellierung | 变异器中高多维度问题的诅咒,用于长期建模 2505.22107v1 |
Authors: Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.
nan
Article 416
Title@2025-05-28 (3): Visuospatial Cognitive Assistant
Title: Visuospatial Cognitive Assistant | Visuospatial Cognitive Assistant | 活性呼吸空间感知助理 2505.12312v3 |
Authors: Qi Feng
Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.
nan
Article 417
Title@2025-05-28 (3): Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Title: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding | Deep Video Discovery: Agentische Suche mit Tool-Nutzung für Langzeit-Video-Verständnis | 深视频发现: 用于远程视频理解的工具的 Agric 搜索 2505.18079v2 |
Authors: Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.
nan
Article 418
Title@2025-05-28 (3): Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
Title: Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts | Auf dem Weg zur Visuospatialen Kognition durch hierarchische Fusion von visuellen Experten | 争取通过视觉专家的等级化融合实现纵向空间聚合 2505.12363v3 |
Authors: Qi Feng
While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.
nan
Article 419
Title@2025-05-28 (3): MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models
Title: MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models | MemOS: Ein Betriebssystem für die speichergesteigerte Generation (MAG) in großen Sprachmodellen | MemOS:大语言模型中记忆增强生成操作系统(MAG) 2505.22101v1 |
Authors: Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, Feiyu Xiong
Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.
nan
Article 420
Title@2025-05-28 (3): K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor
Title: K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor | K-COMP: Retrieval-Augmented Medical Domain Frage beantworten mit wissensinjizierten Kompressor | K- COMP: 以知识输入压缩器回答问题 2501.13567v3 |
Authors: Jeonghun Cho, Gary Geunbae Lee
Retrieval-augmented question answering (QA) integrates external information and thereby increases the QA accuracy of reader models that lack domain knowledge. However, documents retrieved for closed domains require high expertise, so the reader model may have difficulty fully comprehending the text. Moreover, the retrieved documents contain thousands of tokens, some unrelated to the question. As a result, the documents include some inaccurate information, which could lead the reader model to mistrust the passages and could result in hallucinations. To solve these problems, we propose K-comp (Knowledge-injected compressor) which provides the knowledge required to answer correctly. The compressor automatically generates the prior knowledge necessary to facilitate the answer process prior to compression of the retrieved passages. Subsequently, the passages are compressed autoregressively, with the generated knowledge being integrated into the compression process. This process ensures alignment between the question intent and the compressed context. By augmenting this prior knowledge and concise context, the reader models are guided toward relevant answers and trust the context.
nan
Article 421
Title@2025-05-28 (3): Enhancing Target-unspecific Tasks through a Features Matrix
Title: Enhancing Target-unspecific Tasks through a Features Matrix | Verbesserung von Ziel-unspezifischen Aufgaben durch eine Features Matrix | 通过特征矩阵,加强针对特定目标的任务 2505.03414v4 |
Authors: Fangming Cui, Yonggang Zhang, Xuan Wang, Xinmei Tian, Jun Yu
Recent developments in prompt learning of large Vision-Language Models (VLMs) have significantly improved performance in target-specific tasks. However, these prompting methods often struggle to tackle the target-unspecific or generalizable tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge. The general knowledge has a strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks (base-to-novel generalization, domain generalization, and cross-dataset generalization), achieving state-of-the-art performance.
nan
Article 422
Title@2025-05-28 (3): Knowledge Base Construction for Knowledge-Augmented Text-to-SQL
Title: Knowledge Base Construction for Knowledge-Augmented Text-to-SQL | Knowledge Base Construction für wissensbasierte Text-zu-SQL | 知识强化文字到SQL知识基础建设 2505.22096v1 |
Authors: Jinheon Baek, Horst Samulowitz, Oktie Hassanzadeh, Dharmashankar Subramanian, Sola Shirai, Alfio Gliozzo, Debarun Bhattacharjya
Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.
nan
Article 423
Title@2025-05-28 (3): Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning
Title: Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning | Lernen, Abfragen über Wissensdatenbanken zu routen, um schrittweise retrieval-augmented reasoning | 学习如何通过不同知识库的路径查询,以逐步检索推荐理由 2505.22095v1 |
Authors: Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Zhiyuan Liu, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun
Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.
nan
Article 424
Title@2025-05-28 (3): Visual Cues Support Robust Turn-taking Prediction in Noise
Title: Visual Cues Support Robust Turn-taking Prediction in Noise | Visuelle Queues unterstützen robuste Turn-Take Vorhersage in Lärm | 视觉剖面支持强力转动噪音预测 2505.22088v1 |
Authors: Sam O’Connor Russell, Naomi Harte
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.
nan
Article 425
Title@2025-05-28 (3): Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations
Title: Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations | Domain-spezifisches Pruning von großen Mixture-of-Experts-Modellen mit nur wenigen Demonstrationen | 大型混合型专家模型的域特定情景,少发示范 2504.06792v2 |
Authors: Zican Dong, Han Peng, Peiyu Liu, Wayne Xin Zhao, Dong Wu, Feng Xiao, Zhifeng Wang
Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1(671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few in-domain demonstrations, the model consistently activates a sparse and stable subset of experts on tasks within the same domain. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and L2 norm of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities before and after routed experts. Experiments on DeepSeek-R1 and DeepSeek-V3-0324 show that our method can achieve comparable performances and $2.99\times$ throughput under the same memory budget with full model with only half the experts.
nan
Article 426
Title@2025-05-28 (3): LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Title: LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation | LongReD: Degradierung von Langtext-Großen Sprachmodellen durch Restaurationsdestillation | LongReD:通过恢复蒸馏减少长长长大语言模型的短期退化 2502.07365v3 |
Authors: Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model’s short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.
nan
Article 427
Title@2025-05-28 (3): ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation
Title: ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation | ArgInstruct: Spezialisierte Instruktion Feintuning für Computerargumentierung | rgInstruct: 计算参数专业指示精度调整 2505.22076v1 |
Authors: Maja Stahl, Timon Ziegenbein, Joonsuk Park, Henning Wachsmuth
Training large language models (LLMs) to follow instructions has significantly enhanced their ability to tackle unseen tasks. However, despite their strong generalization capabilities, instruction-following LLMs encounter difficulties when dealing with tasks that require domain knowledge. This work introduces a specialized instruction fine-tuning for the domain of computational argumentation (CA). The goal is to enable an LLM to effectively tackle any unseen CA tasks while preserving its generalization capabilities. Reviewing existing CA research, we crafted natural language instructions for 105 CA tasks to this end. On this basis, we developed a CA-specific benchmark for LLMs that allows for a comprehensive evaluation of LLMs’ capabilities in solving various CA tasks. We synthesized 52k CA-related instructions, adapting the self-instruct process to train a CA-specialized instruction-following LLM. Our experiments suggest that CA-specialized instruction fine-tuning significantly enhances the LLM on both seen and unseen CA tasks. At the same time, performance on the general NLP tasks of the SuperNI benchmark remains stable.
nan
Article 428
Title@2025-05-28 (3): GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking
Title: GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking | GraphCheck: Langfristige Textbarrieren mit extrahiertem Wissen durchbrechen Graph-Powered Fact-Checking | 图表检查:利用提取知识图示根据事实进行实况调查打破长期文本障碍 2502.16514v4 |
Authors: Yingjian Chen, Haoran Liu, Yinhong Liu, Jinxiang Xie, Rui Yang, Han Yuan, Yanran Fu, Peng Yuan Zhou, Qingyu Chen, James Caverlee, Irene Li
Large language models (LLMs) are widely used, but they often generate subtle factual errors, especially in long-form text. These errors are fatal in some specialized domains such as medicine. Existing fact-checking with grounding documents methods face two main challenges: (1) they struggle to understand complex multihop relations in long documents, often overlooking subtle factual errors; (2) most specialized methods rely on pairwise comparisons, requiring multiple model calls, leading to high resource and computational costs. To address these challenges, we propose GraphCheck, a fact-checking framework that uses extracted knowledge graphs to enhance text representation. Graph Neural Networks further process these graphs as a soft prompt, enabling LLMs to incorporate structured knowledge more effectively. Enhanced with graph-based reasoning, GraphCheck captures multihop reasoning chains that are often overlooked by existing methods, enabling precise and efficient fact-checking in a single inference call. Experimental results on seven benchmarks spanning both general and medical domains demonstrate up to a 7.1% overall improvement over baseline models. Notably, GraphCheck outperforms existing specialized fact-checkers and achieves comparable performance with state-of-the-art LLMs, such as DeepSeek-V3 and OpenAI-o1, with significantly fewer parameters.
nan
Article 429
Title@2025-05-28 (3): PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Title: PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models | PRMBench: Ein feinkörniger und anspruchsvoller Benchmark für Prozess-Level-Reward-Modelle | PRMBBench:进程一级奖励模式的精细和质疑基准 2501.03124v4 |
Authors: Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs’ performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.
nan
Article 430
Title@2025-05-28 (3): Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO
Title: Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO | Jenseits der Pfadauswahl: Bessere LLMs für wissenschaftliche Information Extraktion mit MimicSFT und Relevanz und Regel-induziert (R$^2$)GRPO | 超出选择路径范围:与 MimicSFT和相关性及规则引起的科学信息提取更好的LLMs(2雷亚尔) 2505.22068v1 |
Authors: Ran Li, Shimin Di, Yuchen Liu, Chen Jing, Yu Qiu, Lei Chen
Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. R$^2$GRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. R$^2$GRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at https://github.com/ranlislz/R2GRPO.
nan
Article 431
Title@2025-05-28 (3): LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
Title: LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation | LINGOLY-TOO: Entwirren von Vernunft aus Wissen mit templatisierter Orthografie-Verschleißung | LINGOLY-TOO: 脱离与电磁矫形模糊学知识脱钩的原因 2503.02972v5 |
Authors: Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacsu, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
The expanding knowledge and memorisation capacity of frontier language models allows them to solve many reasoning tasks directly by exploiting prior knowledge, leading to inflated estimates of their reasoning abilities. We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language and designed to counteract the effect of non-reasoning abilities on reasoning estimates. Using linguistically informed rulesets, we permute reasoning problems written in real languages to generate numerous question variations. These permutations preserve the intrinsic reasoning steps required for each solution while reducing the likelihood problems are directly solvable with models’ knowledge. Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge. On a metric that rewards consistent reasoning, all models perform poorly and exhibit high variance across question permutations, indicating that Large Language Models’ (LLMs) reasoning faculty remains brittle. Overall, results on the benchmark reflect the recent progress of Inference-Time Compute (ITC) models but suggest ample room for further improvement. The benchmark is a step towards better measurement of reasoning abilities of LLMs and offers a cautionary tale on the importance of disentangling reasoning abilities from models’ internalised knowledge when developing reasoning benchmarks.
nan
Article 432
Title@2025-05-28 (3): Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks
Title: Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks | Walk&Retrieve: Einfache und dennoch effektive Null-Schuss-Erzeugung durch Knowledge Graph Walks | 漫步检索: 简单但有效的零光检索通过知识图表漫步生成 2505.16849v2 |
Authors: Martin Böckling, Heiko Paulheim, Andreea Iana
Large Language Models (LLMs) have showcased impressive reasoning abilities, but often suffer from hallucinations or outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) remedies these shortcomings by grounding LLM responses in structured external information from a knowledge base. However, many KG-based RAG approaches struggle with (i) aligning KG and textual representations, (ii) balancing retrieval accuracy and efficiency, and (iii) adapting to dynamically updated KGs. In this work, we introduce Walk&Retrieve, a simple yet effective KG-based framework that leverages walk-based graph traversal and knowledge verbalization for corpus generation for zero-shot RAG. Built around efficient KG walks, our method does not require fine-tuning on domain-specific data, enabling seamless adaptation to KG updates, reducing computational overhead, and allowing integration with any off-the-shelf backbone LLM. Despite its simplicity, Walk&Retrieve performs competitively, often outperforming existing RAG systems in response accuracy and hallucination reduction. Moreover, it demonstrates lower query latency and robust scalability to large KGs, highlighting the potential of lightweight retrieval strategies as strong baselines for future RAG research.
nan
Article 433
Title@2025-05-28 (3): Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home?
Title: Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home? | Schutz der Privatsphäre von Retrieval-Daten gegen Mitgliedschaft Inferenz Angriffe: Ist diese Frage zu nah zu Hause? | 保护检索数据隐私,防止成员推断攻击:这个查询是否离家太近? 2505.22061v1 |
Authors: Yujin Choi, Youngjoo Park, Junyoung Byun, Jaewook Lee, Jinseong Park
Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for specific, personalized applications. However, passing private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target datum exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce Mirabel, a similarity-based MIA detection framework designed for the RAG system. With the proposed Mirabel, we show that simple detect-and-hide strategies can successfully obfuscate attackers, maintain data utility, and remain system-agnostic. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing private RAG systems.
nan
Article 434
Title@2025-05-28 (3): A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
Title: A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment | Eine umfassende Umfrage in LLM(-Agent) Full Stack Sicherheit: Daten, Schulung und Bereitstellung | 用LLLM(-代理)全堆安全:数据、培训和部署进行的全面调查 2504.15585v3 |
Authors: Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Junyuan Mao, Yu Wang, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Wenjie Qu, Yue Liu, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Zhaoxin Fan, Yi Ding, Donghai Hong, Jiaming Ji, Yingxin Lai, Zitong Yu, Xinfeng Li, Yifan Jiang, Yanhui Li, Xinyu Deng, Junlin Wu, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Qiufeng Wang, Wenxuan Wang, Dongrui Liu, Yanwei Yue, Wenke Huang, Guancheng Wan, Heng Chang, Tianlin Li, Yi Yu, Chenghao Li, Jiawei Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Jiaheng Zhang, Tianwei Zhang, Xingjun Ma, Jindong Gu, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Lingjuan Lyu, Yuval Elovici, Bhavya Kailkhura, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, Xiaofeng Wang, Dacheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu
The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire “lifechain” of LLMs. To address this gap, this paper introduces, for the first time, the concept of “full-stack” safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.
nan
Article 435
Title@2025-05-28 (3): Benchmarking LLMs’ Swarm intelligence
Title: Benchmarking LLMs’ Swarm intelligence | Benchmarking der Swarm-Intelligenz der LLM | 基准确定LLLMs的Swarm情报 2505.04364v3 |
Authors: Kai Ruan, Mowen Huang, Ji-Rong Wen, Hao Sun
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict swarm-like constraints-limited local perception and communication-remains largely unexplored. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks (Pursuit, Synchronization, Foraging, Flocking, Transport) within a configurable 2D grid environment, forcing agents to rely solely on local sensory input ($k\times k$ view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Zero-shot evaluations of leading LLMs (e.g., deepseek-v3, o4-mini) reveal significant task-dependent performance variations. While some rudimentary coordination is observed, our results indicate that current LLMs significantly struggle with robust long-range planning and adaptive strategy formation under the uncertainty inherent in these decentralized scenarios. Assessing LLMs under such swarm-like constraints is crucial for understanding their utility in future decentralized intelligent systems. We release SwarmBench as an open, extensible toolkit-built on a customizable physical system-providing environments, prompts, evaluation scripts, and comprehensive datasets. This aims to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of emergent collective behavior under severe informational decentralization. Our code repository is available at https://github.com/x66ccff/swarmbench.
nan
Article 436
Title@2025-05-28 (3): WiseMind: Recontextualizing AI with a Knowledge-Guided, Theory-Informed Multi-Agent Framework for Instrumental and Humanistic Benefits
Title: WiseMind: Recontextualizing AI with a Knowledge-Guided, Theory-Informed Multi-Agent Framework for Instrumental and Humanistic Benefits | WiseMind: Rekontextualisieren von KI mit einem wissensorientierten, theorieinformierten Multi-Agenten-Rahmenwerk für instrumentelle und humanistische Vorteile | Wisemind: 重新将AI与知识指导、理论化的多机构工具与人文效益多机构框架重新翻版 2502.20689v2 |
Authors: Yuqi Wu, Guangya Wan, Jingjing Li, Shengming Zhao, Lingfeng Ma, Tianyi Ye, Ion Pop, Yanbo Zhang, Jie Chen
Translating state-of-the-art NLP into practice often stalls at the “last mile” owing to insufficient contextualization of the target domain’s knowledge, processes, and evaluation. Psychiatric differential diagnosis exemplifies this challenge: accurate assessments depend on nuanced clinical knowledge, a delicate cognitive-affective interview process, and downstream outcomes that extend far beyond benchmark accuracy. We present WiseMind, a systematic interdisciplinary contextualization framework that delivers both instrumental (diagnostic precision) and humanistic (empathy) gains. WiseMind comprises three components:(i) structured knowledge-guided proactive reasoning, which embeds DSM-5 criteria in a knowledge graph to steer questioning; (ii) a theory-informed dual-agent architecture that coordinates a “reasonable-mind” reasoning agent and an “emotional-mind” empathy agent, inspired by Dialectical Behavior Therapy; and (iii) a multi-faceted evaluation strategy covering simulated patients, user studies, clinician review, and ethical assessment. Tested on depression, anxiety, and bipolar disorder, WiseMind attains up to 84.2% diagnostic accuracy, which is comparable to human experts, while outperforming single-agent baselines in perceived empathy and trustworthiness. These results show that deep contextualization-across knowledge, process, and evaluation layers-can transform benchmark-driven NLP into clinically meaningful impact.
nan
Article 437
Title@2025-05-28 (3): Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective
Title: Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective | Bewertung von Impliziten Bias in großen Sprachmodellen durch Angriff aus einer psychometrischen Perspektive | 通过从心理角度进行攻击,评价大语言模型中隐含的偏见 2406.14023v3 |
Authors: Yuchen Wen, Keping Bi, Wei Chen, Jiafeng Guo, Xueqi Cheng
As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs’ implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs’ inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development. Our code, data and benchmarks are available at https://github.com/yuchenwen1/ImplicitBiasPsychometricEvaluation and https://github.com/yuchenwen1/BUMBLE.
nan
Article 438
Title@2025-05-28 (3): Voice Adaptation for Swiss German
Title: Voice Adaptation for Swiss German | Sprachanpassung für Schweizer Deutsch | 瑞士德语语音改造 2505.22054v1 |
Authors: Samuel Stucki, Jan Deriu, Mark Cieliebak
This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.
nan
Article 439
Title@2025-05-28 (3): CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Title: CoSER: Coordinating LLM-Based Persona Simulation of Established Roles | CoSER: Koordinierung der LLM-basierten Persona-Simulation etablierter Rollen | CSER: 协调LLM-以人为基础模拟既定角色 2502.09082v2 |
Authors: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, Yanghua Xiao
Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.
nan
Article 440
Title@2025-05-28 (3): In-context Language Learning for Endangered Languages in Speech Recognition
Title: In-context Language Learning for Endangered Languages in Speech Recognition | Im Zusammenhang mit dem Sprachenlernen für gefährdete Sprachen in der Spracherkennung | 在语音识别中为濒危语言进行内通语言学习 2505.20445v2 |
Authors: Zhaolin Li, Jan Niehues
With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs.
nan
Article 441
Title@2025-05-28 (3): KaFT: Knowledge-aware Fine-tuning for Boosting LLMs’ Domain-specific Question-Answering Performance
Title: KaFT: Knowledge-aware Fine-tuning for Boosting LLMs’ Domain-specific Question-Answering Performance | KaFT: Knowledge-aware Feinabstimmung zur Steigerung der Domain-spezifischen Frage-Antwort-Leistung von LLMs | KAFT: 提高LLM女士具体领域问题解答性能的有知识意识微调 2505.15480v2 |
Authors: Qihuang Zhong, Liang Ding, Xiantao Cai, Juhua Liu, Bo Du, Dacheng Tao
Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs’ internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs’ performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.
nan
Article 442
Title@2025-05-28 (3): Revisiting In-Context Learning with Long Context Language Models
Title: Revisiting In-Context Learning with Long Context Language Models | Das In-Context-Lernen mit langen Kontext-Sprachmodellen | 以长方语言模式重新研究内文学习 2412.16926v3 |
Authors: Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob Oh, Siddharth Dalmia, Prateek Kolhar
In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we discover that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.
nan
Article 443
Title@2025-05-28 (3): FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis
Title: FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis | FCKT: Feinkörniger Cross-Task-Wissenstransfer mit semantischem Kontrast-Lernen für gezielte Stimmungsanalyse | FCKT: 精细的跨任务知识转让,通过语义对抗学习进行有针对性的感应分析 2505.21040v2 |
Authors: Wei Chen, Zhao Zhang, Meng Yuan, Kepeng Xu, Fuzhen Zhuang
In this paper, we address the task of targeted sentiment analysis (TSA), which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding sentiments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer. While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer. To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, FCKT achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance. Experiments on three datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code is available on https://github.com/cwei01/FCKT.
nan
Article 444
Title@2025-05-28 (3): Wolf Hidden in Sheep’s Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models
Title: Wolf Hidden in Sheep’s Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models | Wolf versteckte sich in Schafsgesprächen: Auf dem Weg zu harmlosen datenbasierten Hintertürangriffen für Jailbreaking Large Language Models | 隐藏在羊羊的谈话中的狼:为破碎大语言模范破碎的监狱进行无恶意的以数据为基础的后门攻击 2505.17601v2 |
Authors: Jiawei Kong, Hao Fang, Xiaochen Yang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang
Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model’s safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel \clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.
nan
Article 445
Title@2025-05-28 (3): Jailbreak Distillation: Renewable Safety Benchmarking
Title: Jailbreak Distillation: Renewable Safety Benchmarking | Jailbreak Destillation: Benchmarking für erneuerbare Sicherheit | 蒸馏:可再生能源安全基准 2505.22037v1 |
Authors: Jingyu Zhang, Ahmed Elgohary, Xiawei Wang, A S M Iftekhar, Ahmed Magooda, Benjamin Van Durme, Daniel Khashabi, Kyle Jackson
Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that “distills” jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.
nan
Article 446
Title@2025-05-28 (3): Inference-time Alignment in Continuous Space
Title: Inference-time Alignment in Continuous Space | Inferenz-Zeit-Ausrichtung im Dauerraum | 连续空间的推推-时间对齐 2505.20081v2 |
Authors: Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng
Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/sea
nan
Article 447
Title@2025-05-28 (3): Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game
Title: Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game | Feinkörnige und thematische Bewertung von LLMs im Social Deduction Game | 社会下社会游戏LLMs的精细和专题评价 2408.09946v2 |
Authors: Byungjun Kim, Dayeon Seo, Bugeun Kim
Recent studies have investigated whether large language models (LLMs) can support obscure communication that requires specialized skills, such as inferring subtext or doublespeak. To conduct the investigation, researchers have used social deduction games (SDGs) as their experimental environment, in which players conceal and infer specific information. However, prior work has often overlooked how LLMs should be evaluated in such settings. Specifically, we point out two issues with the evaluation methods they employed. First, metrics used in prior studies are coarse-grained as they are based on overall game outcomes that often fail to capture event-level behaviors; Second, error analyses have lacked structured methodologies capable of producing insights that meaningfully support evaluation outcomes. To address these issues, we propose a macroscopic and systematic approach to the investigation. Specifically, we introduce seven fine-grained metrics that resolve the first issue. To tackle the second issue, we conducted a thematic analysis and identified four major reasoning failures that undermine LLMs’ performance in obscured communication.
nan
Article 448
Title@2025-05-28 (3): Shaping Shared Languages: Human and Large Language Models’ Inductive Biases in Emergent Communication
Title: Shaping Shared Languages: Human and Large Language Models’ Inductive Biases in Emergent Communication | Shaping Shared Languages: Induktive Biase von menschlichen und großen Sprachmodellen in Emergent Communication | 塑造共同语言:新兴交流中的人类和大语言模型的感性偏见 2503.04395v2 |
Authors: Tom Kouwenhoven, Max Peeperkorn, Roy de Kleijn, Tessa Verhoef
Languages are shaped by the inductive biases of their users. Using a classical referential game, we investigate how artificial languages evolve when optimised for inductive biases in humans and large language models (LLMs) via Human-Human, LLM-LLM and Human-LLM experiments. We show that referentially grounded vocabularies emerge that enable reliable communication in all conditions, even when humans \textit{and} LLMs collaborate. Comparisons between conditions reveal that languages optimised for LLMs subtly differ from those optimised for humans. Interestingly, interactions between humans and LLMs alleviate these differences and result in vocabularies more human-like than LLM-like. These findings advance our understanding of the role inductive biases in LLMs play in the dynamic nature of human language and contribute to maintaining alignment in human and machine communication. In particular, our work underscores the need to think of new LLM training methods that include human interaction and shows that using communicative success as a reward signal can be a fruitful, novel direction.
nan
Article 449
Title@2025-05-28 (3): VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Title: VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning | VRAG-RL: Empower Vision-Perception-Based RAG für visuell reiches Informationsverständnis über iteratives Reasoning mit Verstärkungslernen | VRAG-RL: 通过强化学习的迭代理由,增强基于愿景-观点的RAG, 以便通过强化学习获得视觉上丰富的信息了解 2505.22019v1 |
Authors: Qiuchen Wang, Ruixue Ding, Yu Zeng, Zehui Chen, Lin Chen, Shihang Wang, Pengjun Xie, Fei Huang, Feng Zhao
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users’ original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlink{https://github.com/Alibaba-NLP/VRAG}{https://github.com/Alibaba-NLP/VRAG}.
nan
Article 450
Title@2025-05-28 (3): CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models
Title: CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models | CoThink: Token-Efficient Reasoning über Instruct Models Guiding Reasoning Models | COTHING: 通过指示型号指导理由依据模型 2505.22017v1 |
Authors: Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun
Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling. However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency. By comparing these models with equally sized instruct models, we identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps. Since LLMs cannot assess the difficulty of a given problem, they tend to apply the same cautious reasoning strategy across all tasks, resulting in inefficient overthinking. To address this, we propose CoThink, an embarrassingly simple pipeline: an instruct model first drafts a high-level solution outline; a reasoning model then works out the solution. We observe that CoThink enables dynamic adjustment of reasoning depth based on input difficulty. Evaluated with three reasoning models DAPO, DeepSeek-R1, and QwQ on three datasets GSM8K, MATH500, and AIME24, CoThink reduces total token generation by 22.3% while maintaining pass@1 accuracy within a 0.42% margin on average. With reference to the instruct model, we formally define reasoning efficiency and observe a potential reasoning efficiency scaling law in LLMs.
nan
Article 451
Title@2025-05-28 (3): Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains
Title: Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains | Domaino1s: Leitende LLM-Gründung für erklärbare Antworten in High-Stakes-Domains | 域1:在高占用域中解释可解答案的 指导性LLM 2501.14431v2 |
Authors: Xu Chu, Zhijie Tan, Hanlin Xue, Guanyu Wang, Tong Mo, Weiping Li
Large Language Models (LLMs) are widely applied to downstream domains. However, current LLMs for high-stakes domain tasks, such as financial investment and legal QA, typically generate brief answers without reasoning processes and explanations. This limits users’ confidence in making decisions based on their responses. While original CoT shows promise, it lacks self-correction mechanisms during reasoning. This work introduces Domain$o1$s, which enhances LLMs’ reasoning capabilities on domain tasks through supervised fine-tuning and tree search. We construct CoT-stock-2k and CoT-legal-2k datasets for fine-tuning models that activate domain-specific reasoning steps based on their judgment. Additionally, we propose Selective Tree Exploration to spontaneously explore solution spaces and sample optimal reasoning paths to improve performance. We also introduce PROOF-Score, a new metric for evaluating domain models’ explainability, complementing traditional accuracy metrics with richer assessment dimensions. Extensive experiments on stock investment recommendation and legal reasoning QA tasks demonstrate Domaino1s’s leading performance and explainability. Our code is available at https://github.com/Hyalinesky/Domaino1s.
nan
Article 452
Title@2025-05-28 (3): CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
Title: CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models | CogniBench: Ein gesetzlich inspirierter Rahmen und Datensatz zur Bewertung der kognitiven Treue großer Sprachmodelle | CogniBench:评估大语言模型认知性信仰的受法律启发的框架和数据集 2505.20767v2 |
Authors: Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie
Faithfulness hallucination are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standard, existing benchmarks only contain “factual statements” that rephrase source materials without marking “cognitive statements” that make inference from the given context, making the consistency evaluation and optimization of cognitive statements difficult. Inspired by how an evidence is assessed in the legislative domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and create a benchmark dataset where we reveal insightful statistics. We design an annotation pipeline to create larger benchmarks for different LLMs automatically, and the resulting larger-scale CogniBench-L dataset can be used to train accurate cognitive hallucination detection model. We release our model and dataset at: https://github.com/FUTUREEEEEE/CogniBench
nan
Article 453
Title@2025-05-28 (3): Faster and Better LLMs via Latency-Aware Test-Time Scaling
Title: Faster and Better LLMs via Latency-Aware Test-Time Scaling | Schnellere und bessere LLMs über Latency-Aware Test-Time Scaling | 通过远程智能测试时间缩放,更快和更好LLMs 2505.19634v3 |
Authors: Zili Wang, Tianyu Zhang, Lei Zhu, Haoli Bai, Lu Hou, Shiming Xiang, Xianzhi Yu, Wulong Liu
Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.
nan
Article 454
Title@2025-05-28 (3): Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance
Title: Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance | Legal Assist KI: Nutzung von Transformer-basiertem Modell für effektive Rechtshilfe | AI:利用基于变换器的有效法律援助模式 2505.22003v1 |
Authors: Jatin Gupta, Akhil Sharma, Saransh Singhania, Ali Imam Abidi
Pursuit of accessible legal assistance in India faces a critical gap, as many citizens struggle to leverage their legal rights due to limited awareness and access to relevant legal information. This paper introduces Legal Assist AI, a transformer-based model designed to bridge this gap by offering effective legal assistance through large language models (LLMs). The system retrieves relevant legal information from a curated database and generates accurate responses, enabling effective assistance for diverse users, including legal professionals, scholars, and the general public. The model was fine-tuned on extensive datasets from the Indian legal domain, including Indian Constitution, Bharatiya Nyaya Sanhita (BNS), Bharatiya Nagarik Suraksha Sanhita (BNSS) and so forth, providing a robust understanding of the complexities of Indian law. By incorporating domain-specific legal datasets, the proposed model demonstrated remarkable efficiency and specialization in legal Question-Answering. The model was evaluated against state-of-the-art models such as GPT-3.5 Turbo and Mistral 7B, achieving a 60.08% score on the AIBE, outperforming its competitors in legal reasoning and accuracy. Unlike other models, Legal Assist AI avoided common issues such as hallucinations, making it highly reliable for practical legal applications. It showcases the model’s applicability in real-world legal scenarios, with future iterations aiming to enhance performance and expand its dataset to cover a broader range of multilingual and case-specific queries as well.
nan
Article 455
Title@2025-05-28 (3): Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations
Title: Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations | Vergleich von Moralwerten in westlichen englischsprachigen Gesellschaften und LLMs mit Word Associations | 比较西英语社会道德价值和LLMs与文字协会 2505.19674v2 |
Authors: Chaoyi Xiang, Chunhua Liu, Simon De Deyne, Lea Frermann
As the impact of large language models increases, understanding the moral values they reflect becomes ever more important. Assessing the nature of moral values as understood by these models via direct prompting is challenging due to potential leakage of human norms into model training data, and their sensitivity to prompt formulation. Instead, we propose to use word associations, which have been shown to reflect moral reasoning in humans, as low-level underlying representations to obtain a more robust picture of LLMs’ moral reasoning. We study moral differences in associations from western English-speaking communities and LLMs trained predominantly on English data. First, we create a large dataset of LLM-generated word associations, resembling an existing data set of human word associations. Next, we propose a novel method to propagate moral values based on seed words derived from Moral Foundation Theory through the human and LLM-generated association graphs. Finally, we compare the resulting moral conceptualizations, highlighting detailed but systematic differences between moral values emerging from English speakers and LLM associations.
nan
Article 456
Title@2025-05-28 (3): Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate
Title: Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate | Gefunden in Übersetzung: Mehrsprachige LLM-Konsistenz so einfach wie übersetzen dann bewerten | 在翻译中找到: 测量多语种LLM一致性, 简单如翻译,然后评价 2505.21999v1 |
Authors: Ashim Gupta, Maitrey Mehta, Zhichao Xu, Vivek Srikumar
Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM’s cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.
nan
Article 457
Title@2025-05-28 (3): Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data
Title: Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data | Leveraging Interview-informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data | 利用访谈形成的LLMs参与示范调查应对措施:从AI光学和人类数据中比较洞察力 2505.21997v1 |
Authors: Jihong Zhang, Xinya Liang, Anqi Deng, Nicole Bonge, Lin Tan, Ling Zhang, Nicole Zarrett
Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research.
nan
Article 458
Title@2025-05-28 (3): A Checks-and-Balances Framework for Context-Aware Ethical AI Alignment
Title: A Checks-and-Balances Framework for Context-Aware Ethical AI Alignment | Ein Checks-and-Balances-Framework für kontext-aware Ethische AI Alignment | 上下文软件道德操守统一校验和平衡框架 2502.00136v3 |
Authors: Edward Y. Chang
This paper introduces a checks-and-balances framework for ethical alignment of Large Language Models (LLMs), inspired by three-branch governmental systems. It implements three independent yet interacting components: LLMs as the executive branch for knowledge generation, DIKE as the legislative branch establishing ethical guardrails, and ERIS as the judicial branch for contextual interpretation. Beyond structural separation, we address a fundamental challenge: regulating emotion to shape behaviors. Drawing from psychological theories where managing emotional responses prevents harmful behaviors, we develop a self-supervised learning pipeline that maps emotions to linguistic behaviors, enabling precise behavioral modulation through emotional conditioning. By integrating this approach with adversarial testing, our framework demonstrates how DIKE and ERIS direct linguistic behaviors toward ethical outcomes while preserving independence throughout knowledge generation, ethical oversight, and contextual interpretation.
nan
Article 459
Title@2025-05-28 (3): How to Synthesize Text Data without Model Collapse?
Title: How to Synthesize Text Data without Model Collapse? | Wie können Sie Textdaten ohne Modellkollaps synthesieren? | 如何在没有模式折叠的情况下合成文本数据 ? 2412.14689v3 |
Authors: Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-${n}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves model performance.
nan
Article 460
Title@2025-05-28 (3): Learning Compositional Behaviors from Demonstration and Language
Title: Learning Compositional Behaviors from Demonstration and Language | Kompositionsverhalten aus Demonstration und Sprache lernen | 学习示范和语言的构成行为 2505.21981v1 |
Authors: Weiyu Liu, Neil Nie, Ruohan Zhang, Jiayuan Mao, Jiajun Wu
We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.
nan
Article 461
Title@2025-05-28 (3): Sun-Shine: A Foundation Large Language Model for Tibetan Culture and Heritage
Title: Sun-Shine: A Foundation Large Language Model for Tibetan Culture and Heritage | Sun-Shine: Ein großes Sprachmodell der Stiftung für tibetische Kultur und Kulturerbe | 阳光:西藏文化和遗产大语言模式基金会 2503.18288v3 |
Authors: Cheng Huang, Fan Gao, Yutong Liu, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu
Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan’s linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.
nan
Article 462
Title@2025-05-28 (3): Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Title: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset | Perle: Ein multimodaler kulturbewusster arabischer Unterrichtsdatensatz | 珍珠:多式文化-知识阿拉伯文教学数据集 2505.21979v1 |
Authors: Fakhraddin Alwajih, Samar Mohamed Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmeh, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Razan Khassib, Lina Hamad, Mohammed Anwar AL-Ghrawi, Fatimah Alshamari, Cheikh Malainine, Doaa Qawasmeh, Aminetou Yacoub, Tfeil moilid, Ruwa AbuHweidi, Ahmed Aboeitta, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Adel Ammar, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Sara Shatnawi, Alcides Alcoba Inciarte, AbdelRahim A. Elmadany, Mohamedou cheikh tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
nan
Article 463
Title@2025-05-28 (3): Advancing Reasoning in Large Language Models: Promising Methods and Approaches
Title: Advancing Reasoning in Large Language Models: Promising Methods and Approaches | Reasoning in großen Sprachmodellen fördern: Promising Methods and Approaches | 大语言模式的推进理由:有希望的方法和办法 2502.03671v2 |
Authors: Avinash Patil, Aryan Jadon
Large Language Models (LLMs) have succeeded remarkably in various natural language processing (NLP) tasks, yet their reasoning capabilities remain a fundamental challenge. While LLMs exhibit impressive fluency and factual recall, their ability to perform complex reasoning-spanning logical deduction, mathematical problem-solving, commonsense inference, and multi-step reasoning-often falls short of human expectations. This survey provides a comprehensive review of emerging techniques enhancing reasoning in LLMs. We categorize existing methods into key approaches, including prompting strategies (e.g., Chain-of-Thought reasoning, Self-Consistency, and Tree-of-Thought reasoning), architectural innovations (e.g., retrieval-augmented models, modular reasoning networks, and neuro-symbolic integration), and learning paradigms (e.g., fine-tuning with reasoning-specific datasets, reinforcement learning, and self-supervised reasoning objectives). Additionally, we explore evaluation frameworks used to assess reasoning in LLMs and highlight open challenges, such as hallucinations, robustness, and reasoning generalization across diverse tasks. By synthesizing recent advancements, this survey aims to provide insights into promising directions for future research and practical applications of reasoning-augmented LLMs.
nan
Article 464
Title@2025-05-28 (3): Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models
Title: Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models | Graph-beschränkte Vernunft: Treue Vernunft auf Wissensgraphen mit großen Sprachmodellen | 受图表限制的理由:关于大语言模型知识图的忠实理由 2410.13080v2 |
Authors: Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Yuan-Fang Li, Chen Gong, Shirui Pan
Large language models (LLMs) have demonstrated impressive reasoning abilities, but they still struggle with faithful reasoning due to knowledge gaps and hallucinations. To address these issues, knowledge graphs (KGs) have been utilized to enhance LLM reasoning through their structured knowledge. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this work, we introduce graph-constrained reasoning (GCR), a novel framework that bridges structured knowledge in KGs with unstructured reasoning in LLMs. To eliminate hallucinations, GCR ensures faithful KG-grounded reasoning by integrating KG structure into the LLM decoding process through KG-Trie, a trie-based index that encodes KG reasoning paths. KG-Trie constrains the decoding process, allowing LLMs to directly reason on graphs and generate faithful reasoning paths grounded in KGs. Additionally, GCR leverages a lightweight KG-specialized LLM for graph-constrained reasoning alongside a powerful general LLM for inductive reasoning over multiple reasoning paths, resulting in accurate reasoning with zero reasoning hallucination. Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.
nan
Article 465
Title@2025-05-28 (3): Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA
Title: Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA | Erfahrung Retrieval-Augmentation mit elektronischen Gesundheitsakten ermöglicht genaue Entladung QA | 使用电子健康记录使准确释放QA能够准确释放的经验回收-升级 2503.17933v2 |
Authors: Justice Ou, Tinglin Huang, Yilun Zhao, Ziyang Yu, Peiqing Lu, Rex Ying
To improve the reliability of Large Language Models (LLMs) in clinical applications, retrieval-augmented generation (RAG) is extensively applied to provide factual medical knowledge. However, beyond general medical knowledge from open-ended datasets, clinical case-based knowledge is also critical for effective medical reasoning, as it provides context grounded in real-world patient experiences.Motivated by this, we propose Experience Retrieval-Augmentation ExpRAG framework based on Electronic Health Record(EHR), aiming to offer the relevant context from other patients’ discharge reports. ExpRAG performs retrieval through a coarse-to-fine process, utilizing an EHR-based report ranker to efficiently identify similar patients, followed by an experience retriever to extract task-relevant content for enhanced medical reasoning.To evaluate ExpRAG, we introduce DischargeQA, a clinical QA dataset with 1,280 discharge-related questions across diagnosis, medication, and instruction tasks. Each problem is generated using EHR data to ensure realistic and challenging scenarios. Experimental results demonstrate that ExpRAG consistently outperforms a text-based ranker, achieving an average relative improvement of 5.2%, highlighting the importance of case-based knowledge for medical reasoning.
nan
Article 466
Title@2025-05-28 (3): Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack
Title: Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack | Die Bedrohung sehen: Schwachstellen in Visions-Sprachenmodellen für feindliche Angriffe | 目睹威胁:视觉-语言模型对对抗性攻击的脆弱性 2505.21967v1 |
Authors: Juan Ren, Mark Dras, Usman Naseem
Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model’s output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.
nan
Article 467
Title@2025-05-28 (3): Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing
Title: Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing | Heterogene Token-Übertragung in LLM-Wissensbearbeitung abmildern | 减轻LLLM知识编辑中变异式 Tok 超称 2502.00602v2 |
Authors: Tianci Liu, Ruirui Li, Zihan Dong, Hui Liu, Xianfeng Tang, Qingyu Yin, Linjun Zhang, Haoyu Wang, Jing Gao
Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates. Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: heterogeneous token overfitting (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates. To tackle this, we propose OVERTONE, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.
nan
Article 468
Title@2025-05-28 (3): MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing
Title: MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing | MapStory: LLM-Powered Text-Driven Map Animation Prototyping mit Human-in-the-Loop-Editing | 地图片断: 由LLM 授权的文本驱动地图动画动画与在 Loop 用户编译 2505.21966v1 |
Authors: Aditya Gunturu, Ben Pearman, Keiichi Ihara, Morteza Faraji, Bryan Wang, Rubaiat Habib Kazi, Ryo Suzuki
We introduce MapStory, an LLM-powered animation authoring tool that generates editable map animation sequences directly from natural language text. Given a user-written script, MapStory leverages an agentic architecture to automatically produce a scene breakdown, which decomposes the script into key animation building blocks such as camera movements, visual highlights, and animated elements. Our system includes a researcher component that accurately queries geospatial information by leveraging an LLM with web search, enabling the automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine-tune parameters of these blocks through an interactive timeline editor. We detail the system’s design and architecture, informed by formative interviews with professional animators and an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map-centric stories.
nan
Article 469
Title@2025-05-28 (3): UI-Evol: Automatic Knowledge Evolving for Computer Use Agents
Title: UI-Evol: Automatic Knowledge Evolving for Computer Use Agents | UI-Evol: Automatisches Knowledge Evolving für Computer Use Agents | UI-Evol:计算机使用代理自动知识演化 2505.21964v1 |
Authors: Ziyun Zhang, Xinyi Liu, Xiaoyi Zhang, Jun Wang, Gang Chen, Yan Lu
External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90\% correct knowledge yields only 41\% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.
nan
Article 470
Title@2025-05-28 (3): LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents
Title: LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents | LaMDAgent: Autonomer Rahmen für die Post-Training-Pipeline-Optimierung über LLM-Agenten | LaMMDAGenter:通过LLM代理机构优化培训后管道的自治框架 2505.21963v1 |
Authors: Taro Yano, Yoichi Ishibashi, Masafumi Oyamada
Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks. To further tailor LLMs to specific domains or applications, post-training techniques such as Supervised Fine-Tuning (SFT), Preference Learning, and model merging are commonly employed. While each of these methods has been extensively studied in isolation, the automated construction of complete post-training pipelines remains an underexplored area. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging strategies. In this work, we introduce LaMDAgent (short for Language Model Developing Agent), a novel framework that autonomously constructs and optimizes full post-training pipelines through the use of LLM-based agents. LaMDAgent systematically explores diverse model generation techniques, datasets, and hyperparameter configurations, leveraging task-based feedback to discover high-performing pipelines with minimal human intervention. Our experiments show that LaMDAgent improves tool-use accuracy by 9.0 points while preserving instruction-following capabilities. Moreover, it uncovers effective post-training strategies that are often overlooked by conventional human-driven exploration. We further analyze the impact of data and model size scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.
nan
Article 471
Title@2025-05-28 (3): EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles
Title: EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles | EnsemW2S: Verbesserung der Schwach-zu-Strong-Verallgemeinerung mit großsprachigen Modellensembles | EnsemW2S:用大语言模型组合加强弱至强的通用化 2505.21959v1 |
Authors: Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, Bang An, Bayan Bruss, John Langford, Furong Huang
With Large Language Models (LLMs) rapidly approaching and potentially surpassing human-level performance, it has become imperative to develop approaches capable of effectively supervising and enhancing these powerful models using smaller, human-level models exposed to only human-level data. We address this critical weak-to-strong (W2S) generalization challenge by proposing a novel method aimed at improving weak experts, by training on the same limited human-level data, enabling them to generalize to complex, super-human-level tasks. Our approach, called \textbf{EnsemW2S}, employs a token-level ensemble strategy that iteratively combines multiple weak experts, systematically addressing the shortcomings identified in preceding iterations. By continuously refining these weak models, we significantly enhance their collective ability to supervise stronger student models. We extensively evaluate the generalization performance of both the ensemble of weak experts and the subsequent strong student model across in-distribution (ID) and out-of-distribution (OOD) datasets. For OOD, we specifically introduce question difficulty as an additional dimension for defining distributional shifts. Our empirical results demonstrate notable improvements, achieving 4\%, and 3.2\% improvements on ID datasets and, upto 6\% and 2.28\% on OOD datasets for experts and student models respectively, underscoring the effectiveness of our proposed method in advancing W2S generalization.
nan
Article 472
Title@2025-05-28 (3): Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning
Title: Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning | Lösung von Wissenskonflikten in der bereichsspezifischen Datenauswahl: Eine Fallstudie zur medizinischen Instruktions-Tuning | 解决特定领域数据选择方面的知识冲突:关于医疗指示调整的个案研究 2505.21958v1 |
Authors: Qihuang Zhong, Liang Ding, Fei Liao, Juhua Liu, Bo Du, Dacheng Tao
Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs’ pretrained knowledge and context knowledge of instruction data, which could damage LLMs’ prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs’ actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs’ abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.
nan
Article 473
Title@2025-05-28 (3): VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
Title: VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning für die Sprachverarbeitung | VQ-CTAP: 处理发言的跨模式精细序列代表性学习 2408.05758v2 |
Authors: Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao
Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called “Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)”, which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/
nan
Article 474
Title@2025-05-28 (3): Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation
Title: Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation | Testzeitskalierung mit wiederholter Probenahme verbessert die Mehrsprachigkeitsgenerierung | 具有重复抽样的测试时间缩放改进多语种文本的生成 2505.21941v1 |
Authors: Ashim Gupta, Vivek Srikumar
Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.
nan
Article 475
Title@2025-05-28 (3): RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering
Title: RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering | RISE: Grundlegende Verbesserung durch iterative Selbst-Exploration in der Multi-Hop-Fragebeantwortung | RISE: 多呼问答问答中通过迭代自我探索提高合理性 2505.21940v1 |
Authors: Bolei He, Xinran He, Mengke Chen, Xianwei Xue, Ying Zhu, Zhenhua Ling
Large Language Models (LLMs) excel in many areas but continue to face challenges with complex reasoning tasks, such as Multi-Hop Question Answering (MHQA). MHQA requires integrating evidence from diverse sources while managing intricate logical dependencies, often leads to errors in reasoning. Retrieval-Augmented Generation (RAG), widely employed in MHQA tasks, faces challenges in effectively filtering noisy data and retrieving all necessary evidence, thereby limiting its effectiveness in addressing MHQA challenges. To address these challenges, we propose RISE:Reasoning Enhancement via Iterative Self-Exploration, a novel framework designed to enhance models’ reasoning capability through iterative self-exploration. Specifically, RISE involves three key steps in addressing MHQA tasks: question decomposition, retrieve-then-read, and self-critique. By leveraging continuous self-exploration, RISE identifies accurate reasoning paths, iteratively self-improving the model’s capability to integrate evidence, maintain logical consistency, and enhance performance in MHQA tasks. Extensive experiments on multiple MHQA benchmarks demonstrate that RISE significantly improves reasoning accuracy and task performance.
nan
Article 476
Title@2025-05-28 (3): EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
Title: EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios | EduBench: Ein umfassender Benchmarking-Datensatz zur Bewertung großer Sprachmodelle in unterschiedlichen Bildungsszenarien | EduBonnch:评估不同教育情景中大语言模式的综合基准数据集 2505.16160v3 |
Authors: Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Yang Gao, Heyan Huang
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.
nan
Article 477
Title@2025-05-28 (3): Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages
Title: Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages | Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages | 印度语文化上可调适的可调适文化语言专题翻译 2505.21937v1 |
Authors: Pratik Rakesh Singh, Kritarth Prasad, Mohammadi Zaki, Pankaj Wasnik
Translating multi-word expressions (MWEs) and idioms requires a deep understanding of the cultural nuances of both the source and target languages. This challenge is further amplified by the one-to-many nature of idiomatic translations, where a single source idiom can have multiple target-language equivalents depending on cultural references and contextual variations. Traditional static knowledge graphs (KGs) and prompt-based approaches struggle to capture these complex relationships, often leading to suboptimal translations. To address this, we propose IdiomCE, an adaptive graph neural network (GNN) based methodology that learns intricate mappings between idiomatic expressions, effectively generalizing to both seen and unseen nodes during training. Our proposed method enhances translation quality even in resource-constrained settings, facilitating improved idiomatic translation in smaller models. We evaluate our approach on multiple idiomatic translation datasets using reference-less metrics, demonstrating significant improvements in translating idioms from English to various Indian languages.
nan
Article 478
Title@2025-05-28 (3): RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Title: RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments | RedTeamCUA: Realistisches Adversarial Testen von Computer-Use-Agenten in hybriden Web-OS-Umgebungen | Red TeamCUA:对混合网络-OS环境的计算机使用代理器进行现实的反反向测试 2505.21936v1 |
Authors: Zeyi Liao, Jaylen Jones, Linxi Jiang, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, Huan Sun
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, with the recently released frontier Claude 4 Opus | CUA showing an alarming ASR of 48%, demonstrating that indirect prompt injection presents tangible risks for even advanced CUAs despite their capabilities and safeguards. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment. |
nan
Article 479
Title@2025-05-28 (3): Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets
Title: Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets | Effizientes Ensemble für die Feinabstimmung von Sprachmodellen auf mehreren Datensätzen | 多个数据集微调语言模型高效组合组合 2505.21930v1 |
Authors: Dongyue Li, Ziniu Zhang, Lu Wang, Hongyang R. Zhang
This paper develops an ensemble method for fine-tuning a language model to multiple datasets. Existing methods, such as quantized LoRA (QLoRA), are efficient when adapting to a single dataset. When training on multiple datasets of different tasks, a common setup in practice, it remains unclear how to design an efficient adaptation for fine-tuning language models. We propose to use an ensemble of multiple smaller adapters instead of a single adapter per task. We design an efficient algorithm that partitions $n$ datasets into $m$ groups, where $m$ is typically much smaller than $n$ in practice, and train one adapter for each group before taking a weighted combination to form the ensemble. The algorithm leverages a first-order approximation property of low-rank adaptation to quickly obtain the fine-tuning performances of dataset combinations since methods like LoRA stay close to the base model. Hence, we use the gradients of the base model to estimate its behavior during fine-tuning. Empirically, this approximation holds with less than $1\%$ error on models with up to $34$ billion parameters, leading to an estimation of true fine-tuning performances under $5\%$ error while speeding up computation compared to base fine-tuning by $105$ times. When applied to fine-tune Llama and GPT models on ten text classification tasks, our approach provides up to $10\%$ higher average test accuracy over QLoRA, with only $9\%$ more FLOPs. On a Llama model with $34$ billion parameters, an ensemble of QLoRA increases test accuracy by $3\%$ compared to QLoRA, with only $8\%$ more FLOPs.
nan
Article 480
Title@2025-05-28 (3): Personality-aware Student Simulation for Conversational Intelligent Tutoring Systems
Title: Personality-aware Student Simulation for Conversational Intelligent Tutoring Systems | Personalitätsbewusste Studentensimulation für gesprächsorientierte intelligente Tutoring-Systeme | 具有个性意识的学生模拟交流智能教学系统的学生模拟 2404.06762v2 |
Authors: Zhengyuan Liu, Stella Xin Yin, Geyu Lin, Nancy F. Chen
Intelligent Tutoring Systems (ITSs) can provide personalized and self-paced learning experience. The emergence of large language models (LLMs) further enables better human-machine interaction, and facilitates the development of conversational ITSs in various disciplines such as math and language learning. In dialogic teaching, recognizing and adapting to individual characteristics can significantly enhance student engagement and learning efficiency. However, characterizing and simulating student’s persona remain challenging in training and evaluating conversational ITSs. In this work, we propose a framework to construct profiles of different student groups by refining and integrating both cognitive and noncognitive aspects, and leverage LLMs for personality-aware student simulation in a language learning scenario. We further enhance the framework with multi-aspect validation, and conduct extensive analysis from both teacher and student perspectives. Our experimental results show that state-of-the-art LLMs can produce diverse student responses according to the given language ability and personality traits, and trigger teacher’s adaptive scaffolding strategies.
nan
Article 481
Title@2025-05-28 (3): SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
Title: SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior | SafetyAnalyst: Interpretierbare, transparente und Steerable Safety Moderation für KI-Verhalten | 安全分析器:AI行为行为解释性、透明性和可坚固性 2410.16665v3 |
Authors: Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine
The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community’s values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured “harm-benefit tree,” which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.
nan
Article 482
Title@2025-05-28 (3): Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning
Title: Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning | Beyond Completion: Ein Grundlagenmodell für allgemeine Wissensgraphen-Reasoning | 完成后完成:一般知识图理据基础模型 2505.21926v1 |
Authors: Yin Hua, Zhiqiang Liu, Mingyang Chen, Zheng Fang, Chi Man Wong, Lingxiao Li, Chi Man Vong, Huajun Chen, Wen Zhang
In natural language processing (NLP) and computer vision (CV), the successful application of foundation models across diverse tasks has demonstrated their remarkable potential. However, despite the rich structural and textual information embedded in knowledge graphs (KGs), existing research of foundation model for KG has primarily focused on their structural aspects, with most efforts restricted to in-KG tasks (e.g., knowledge graph completion, KGC). This limitation has hindered progress in addressing more challenging out-of-KG tasks. In this paper, we introduce MERRY, a foundation model for general knowledge graph reasoning, and investigate its performance across two task categories: in-KG reasoning tasks (e.g., KGC) and out-of-KG tasks (e.g., KG question answering, KGQA). We not only utilize the structural information, but also the textual information in KGs. Specifically, we propose a multi-perspective Conditional Message Passing (CMP) encoding architecture to bridge the gap between textual and structural modalities, enabling their seamless integration. Additionally, we introduce a dynamic residual fusion module to selectively retain relevant textual information and a flexible edge scoring mechanism to adapt to diverse downstream tasks. Comprehensive evaluations on 28 datasets demonstrate that MERRY outperforms existing baselines in most scenarios, showcasing strong reasoning capabilities within KGs and excellent generalization to out-of-KG tasks such as KGQA.
nan
Article 483
Title@2025-05-28 (3): Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy
Title: Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy | Modellierung und Optimierung von Benutzereinstellungen in AI-Copiloten: Eine umfassende Umfrage und Taxonomie | AI中模拟和优化用户首选模式:全面调查和分类 2505.21907v1 |
Authors: Saleh Afzoon, Zahra Jahanandish, Phuong Thao Huynh, Amin Beheshti, Usman Naseem
AI copilots, context-aware, AI-powered systems designed to assist users in tasks such as software development and content creation, are becoming integral to modern workflows. As these systems grow in capability and adoption, personalization has emerged as a cornerstone for ensuring usability, trust, and productivity. Central to this personalization is preference optimization: the ability of AI copilots to detect, interpret, and align with individual user preferences. While personalization techniques are well-established in domains like recommender systems and dialogue agents, their adaptation to interactive, real-time systems like AI copilots remains fragmented and underexplored. This survey addresses this gap by synthesizing research on how user preferences are captured, modeled, and refined within the design of AI copilots. We introduce a unified definition of AI copilots and propose a phase-based taxonomy of preference optimization strategies, structured around pre-interaction, mid-interaction, and post-interaction stages. We analyze techniques for acquiring preference signals, modeling user intent, and integrating feedback loops, highlighting both established approaches and recent innovations. By bridging insights from AI personalization, human-AI collaboration, and large language model adaptation, this survey provides a structured foundation for designing adaptive, preference-aware AI copilots. It offers a holistic view of the available preference resources, how they can be leveraged, and which technical approaches are most suited to each stage of system design.
nan
Article 484
Title@2025-05-28 (3): ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Title: ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models | ALPS: Aufmerksamkeit Lokalisierung und Pruning-Strategie zur effizienten Ausrichtung großer Sprachmodelle | ALPS: 高效统一大语言模式的注意地方化和审慎战略 2505.18799v2 |
Authors: Hao Chen, Haoze Li, Zhiqing Xiao, Lirong Gao, Qi Zhang, Xiaomeng Hu, Ningtao Wang, Xing Fu, Junbo Zhao
Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant training adjustment costs. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the \textit{\textbf{A}ttention \textbf{L}ocalization and \textbf{P}runing \textbf{S}trategy (\textbf{ALPS})}, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only \textbf{10\%} of attention parameters during fine-tuning while achieving a \textbf{2\%} performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment. The code is available at https://github.com/VoiceBeer/ALPS.
nan
Article 485
Title@2025-05-28 (3): Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development
Title: Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development | Co-Saving: Ressourcenschonende Multi-Agenten-Kollaboration für Software-Entwicklung | 共同节省:为开发软件进行有意识的资源、多机构协作 2505.21898v1 |
Authors: Rennai Qiu, Chen Qian, Ran Li, Yufan Dang, Weize Chen, Cheng Yang, Yingli Zhang, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system – Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of “shortcuts” – instructional transitions learned from historically successful trajectories – which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.
nan
Article 486
Title@2025-05-28 (3): Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs
Title: Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs | Suche und Verfeinerung während des Denkens: Autonome Retrieval-Augmented Reasoning von LLMs | 思考期间的搜索和记忆:自主检索-强化理据(LLM) 2505.11277v2 |
Authors: Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think’’ paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
nan
Article 487
Title@2025-05-28 (3): Language-Specific Latent Process Hinders Cross-Lingual Performance
Title: Language-Specific Latent Process Hinders Cross-Lingual Performance | Sprachspezifische latente Prozessverhinderer Cross-Lingual Performance | 语言特定边端进程 2505.13141v2 |
Authors: Zheng Wei Lim, Alham Fikri Aji, Trevor Cohn
Large language models (LLMs) are demonstrably capable of cross-lingual transfer, but can produce inconsistent output when prompted with the same queries written in different languages. To understand how language models are able to generalize knowledge from one language to the others, we apply the logit lens to interpret the implicit steps taken by LLMs to solve multilingual multi-choice reasoning questions. We find LLMs predict inconsistently and are less accurate because they rely on subspaces of individual languages, rather than working in a shared semantic space. While larger models are more multilingual, we show their hidden states are more likely to dissociate from the shared representation compared to smaller models, but are nevertheless more capable of retrieving knowledge embedded across different languages. Finally, we demonstrate that knowledge sharing can be modulated by steering the models’ latent processing towards the shared semantic space. We find reinforcing utilization of the shared space improves the models’ multilingual reasoning performance, as a result of more knowledge transfer from, and better output consistency with English.
nan
Article 488
Title@2025-05-28 (3): Self-Taught Agentic Long Context Understanding
Title: Self-Taught Agentic Long Context Understanding | Selbstlernendes Agentisches Langes Kontext-Verständnis | 自我教学 自我研究 长期背景了解 2502.15920v2 |
Authors: Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, Emad Barsoum
Answering complex, long-context questions remains a major challenge for large language models (LLMs) as it requires effective question clarifications and context retrieval. We propose Agentic Long-Context Understanding (AgenticLU), a framework designed to enhance an LLM’s understanding of such queries by integrating targeted self-clarification with contextual grounding within an agentic workflow. At the core of AgenticLU is Chain-of-Clarifications (CoC), where models refine their understanding through self-generated clarification questions and corresponding contextual groundings. By scaling inference as a tree search where each node represents a CoC step, we achieve 97.8% answer recall on NarrativeQA with a search depth of up to three and a branching factor of eight. To amortize the high cost of this search process to training, we leverage the preference pairs for each step obtained by the CoC workflow and perform two-stage model finetuning: (1) supervised finetuning to learn effective decomposition strategies, and (2) direct preference optimization to enhance reasoning quality. This enables AgenticLU models to generate clarifications and retrieve relevant context effectively and efficiently in a single inference pass. Extensive experiments across seven long-context tasks demonstrate that AgenticLU significantly outperforms state-of-the-art prompting methods and specialized long-context LLMs, achieving robust multi-hop reasoning while sustaining consistent performance as context length grows.
nan
Article 489
Title@2025-05-28 (3): Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline
Title: Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline | Pfade, die nicht genommen werden: Verstehen und Mending the Multilingual Factual Recall Pipeline | 未选择的路径:理解和终止多语种事实回回回回管道 2505.20546v2 |
Authors: Meng Lu, Ruochen Zhang, Carsten Eickhoff, Ellie Pavlick
Multilingual large language models (LLMs) often exhibit factual inconsistencies across languages, with significantly better performance in factual recall tasks in English than in other languages. The causes of these failures, however, remain poorly understood. Using mechanistic analysis techniques, we uncover the underlying pipeline that LLMs employ, which involves using the English-centric factual recall mechanism to process multilingual queries and then translating English answers back into the target language. We identify two primary sources of error: insufficient engagement of the reliable English-centric mechanism for factual recall, and incorrect translation from English back into the target language for the final answer. To address these vulnerabilities, we introduce two vector interventions, both independent of languages and datasets, to redirect the model toward better internal paths for higher factual consistency. Our interventions combined increase the recall accuracy by over 35 percent for the lowest-performing language. Our findings demonstrate how mechanistic insights can be used to unlock latent multilingual capabilities in LLMs.
nan
Article 490
Title@2025-05-28 (3): Large Vocabulary Size Improves Large Language Models
Title: Large Vocabulary Size Improves Large Language Models | Große Vokabelgröße verbessert große Sprachmodelle | 大型词汇量改进大语言模式 2406.16508v2 |
Authors: Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato
This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.
nan
Article 491
Title@2025-05-28 (3): Text Generation Beyond Discrete Token Sampling
Title: Text Generation Beyond Discrete Token Sampling | Textgenerierung jenseits diskreter Token-Probenahme | 文本生成超出分解调制当量抽样 2505.14827v2 |
Authors: Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao
In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution’s rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.
nan
Article 492
Title@2025-05-28 (3): Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation
Title: Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation | Einschließlich LLMs für großräumige Urban Complex Mobility Simulation | 大型城市综合流动模拟项目LLMs 2505.21880v1 |
Authors: Yu-Lun Song, Chung-En Tsern, Che-Cheng Wu, Yu-Ming Chang, Syuan-Bo Huang, Wei-Chu Chen, Michael Chia-Liang Lin, Yu-Ta Lin
This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.
nan
Article 493
Title@2025-05-28 (3): Evaluating the Retrieval Robustness of Large Language Models
Title: Evaluating the Retrieval Robustness of Large Language Models | Bewertung der Retrieval Robustheit großer Sprachmodelle | 评估大语言模型的检索能力 2505.21870v1 |
Authors: Shuyang Cao, Karthik Radhakrishnan, David Rosenberg, Steven Lu, Pengxiang Cheng, Lu Wang, Shiyue Zhang
Retrieval-augmented generation (RAG) generally enhances large language models’ (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model’s limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.
nan
Article 494
Title@2025-05-28 (3): Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering
Title: Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering | Behebung von Problemen mit der verlorenen Retrieval-Frage bei der Retrieval Augmented Multi-Hop-Fragebeantwortung | 减轻在检索增加的多层次问题解答中丢失的在追索中的问题 2502.14245v2 |
Authors: Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu
In this paper, we identify a critical problem, “lost-in-retrieval”, in retrieval-augmented multi-hop question answering (QA): the key entities are missed in LLMs’ sub-question decomposition. “Lost-in-retrieval” significantly degrades the retrieval performance, which disrupts the reasoning chain and leads to the incorrect answers. To resolve this problem, we propose a progressive retrieval and rewriting method, namely ChainRAG, which sequentially handles each sub-question by completing missing key entities and retrieving relevant sentences from a sentence graph for answer generation. Each step in our retrieval and rewriting process builds upon the previous one, creating a seamless chain that leads to accurate retrieval and answers. Finally, all retrieved sentences and sub-question answers are integrated to generate a comprehensive answer to the original question. We evaluate ChainRAG on three multi-hop QA datasets - MuSiQue, 2Wiki, and HotpotQA - using three large language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG consistently outperforms baselines in both effectiveness and efficiency.
nan
Article 495
Title@2025-05-28 (3): RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph
Title: RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph | RSCF: Relation-Semantik Konsequenter Filter für Entity-Einbettung von Wissensgrafik | RSCF: 用于实体嵌入知识图的 关系-语义一致性过滤器 2505.20813v2 |
Authors: Junsik Kim, Jinwook Park, Kangil Kim
In knowledge graph embedding, leveraging relation specific entity transformation has markedly enhanced performance. However, the consistency of embedding differences before and after transformation remains unaddressed, risking the loss of valuable inductive bias inherent in the embeddings. This inconsistency stems from two problems. First, transformation representations are specified for relations in a disconnected manner, allowing dissimilar transformations and corresponding entity embeddings for similar relations. Second, a generalized plug-in approach as a SFBR (Semantic Filter Based on Relations) disrupts this consistency through excessive concentration of entity embeddings under entity-based regularization, generating indistinguishable score distributions among relations. In this paper, we introduce a plug-in KGE method, Relation-Semantics Consistent Filter (RSCF). Its entity transformation has three features for enhancing semantic consistency: 1) shared affine transformation of relation embeddings across all relations, 2) rooted entity transformation that adds an entity embedding to its change represented by the transformed vector, and 3) normalization of the change to prevent scale reduction. To amplify the advantages of consistency that preserve semantics on embeddings, RSCF adds relation transformation and prediction modules for enhancing the semantics. In knowledge graph completion tasks with distance-based and tensor decomposition models, RSCF significantly outperforms state-of-the-art KGE methods, showing robustness across all relations and their frequencies.
nan
Article 496
Title@2025-05-28 (3): Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs
Title: Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs | Abstand zwischen relevanten Informationsstücken verursacht Bias im Langtext LLMs | 有关信息片件在长文本LLM中造成偏见的距离 2410.14641v3 |
Authors: Runchu Tian, Yanghao Li, Yuepeng Fu, Siyang Deng, Qinyu Luo, Cheng Qian, Shuo Wang, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Huadong Wang, Xiaojiang Liu
Positional bias in large language models (LLMs) hinders their ability to effectively process long inputs. A prominent example is the “lost in the middle” phenomenon, where LLMs struggle to utilize relevant information situated in the middle of the input. While prior research primarily focuses on single pieces of relevant information, real-world applications often involve multiple relevant information pieces. To bridge this gap, we present LongPiBench, a benchmark designed to assess positional bias involving multiple pieces of relevant information. Thorough experiments are conducted with five commercial and six open-source models. These experiments reveal that while most current models are robust against the “lost in the middle” issue, there exist significant biases related to the spacing of relevant information pieces. These findings highlight the importance of evaluating and reducing positional biases to advance LLM’s capabilities.
nan
Article 497
Title@2025-05-28 (3): Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries
Title: Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries | Prinzipierte Inhaltsauswahl zur Generierung unterschiedlicher und personalisierter Multi-Document-Zusammenfassungen | ” 创造多样化和个性化多文件摘要 “ 原则性内容选择 2505.21859v1 |
Authors: Vishakh Padmakumar, Zichao Wang, David Arbour, Jennifer Healey
While large language models (LLMs) are increasingly capable of handling longer contexts, recent work has demonstrated that they exhibit the “lost in the middle” phenomenon (Liu et al., 2024) of unevenly attending to different parts of the provided context. This hinders their ability to cover diverse source material in multi-document summarization, as noted in the DiverseSumm benchmark (Huang et al., 2024). In this work, we contend that principled content selection is a simple way to increase source coverage on this task. As opposed to prompting an LLM to perform the summarization in a single step, we explicitly divide the task into three steps – (1) reducing document collections to atomic key points, (2) using determinantal point processes (DPP) to perform select key points that prioritize diverse content, and (3) rewriting to the final summary. By combining prompting steps, for extraction and rewriting, with principled techniques, for content selection, we consistently improve source coverage on the DiverseSumm benchmark across various LLMs. Finally, we also show that by incorporating relevance to a provided user intent into the DPP kernel, we can generate personalized summaries that cover relevant source information while retaining coverage.
nan
Article 498
Title@2025-05-28 (3): Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures
Title: Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures | Mini-Batch Coresets für speichereffiziente Sprachmodellschulungen auf Datenmischungen | 记忆效率语言数据混合模型培训微型批量核心数据集 2407.19580v4 |
Authors: Dang Nguyen, Wenhan Yang, Rathul Anand, Yu Yang, Baharan Mirzasoleiman
Training with larger mini-batches improves the convergence rate and can yield superior performance. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs), due to the large GPU memory requirement. To address this problem, an effective approach is finding small mini-batch coresets that closely match the gradient of larger mini-batches. However, this approach becomes infeasible and ineffective for LLMs, due to the highly imbalanced mixture of sources in language data, use of the Adam optimizer, and the very large gradient dimensionality of LLMs. In this work, we address the above challenges by proposing Coresets for Training LLMs (CoLM). First, we show that mini-batch coresets found by gradient matching do not contain representative examples of the small sources w.h.p., and thus including all examples of the small sources in the mini-batch coresets is crucial for optimal performance. Second, we normalize the gradients by their historical exponential to find mini-batch coresets for training with Adam. Finally, we leverage zeroth-order methods to find smooth gradient of the last V-projection matrix and sparsify it to keep the dimensions with the largest normalized gradient magnitude. We apply CoLM to fine-tuning Phi-2, Phi-3, Zephyr, and Llama-3 models with LoRA on MathInstruct and SuperGLUE benchmark. Remarkably, CoLM reduces the memory requirement of fine-tuning by 2x and even outperforms training with 4x larger mini-batches. Moreover, CoLM seamlessly integrates with existing memory-efficient training methods like LoRA, further reducing the memory requirements of training LLMs. Our code is available at https://github.com/BigML-CS-UCLA/CoLM.
nan
Article 499
Title@2025-05-28 (3): CULEMO: Cultural Lenses on Emotion – Benchmarking LLMs for Cross-Cultural Emotion Understanding
Title: CULEMO: Cultural Lenses on Emotion – Benchmarking LLMs for Cross-Cultural Emotion Understanding | CULEMO: Kulturelle Objektive zur Emotion – Benchmarking LLMs für Cross-Cultural Emotion Understanding | CULEMO:情感文化引文 – – 衡量跨文化情感理解LMLL 2503.10688v3 |
Authors: Tadesse Destaw Belay, Ahmed Haj Ahmed, Alvin Grissom II, Iqra Ameer, Grigori Sidorov, Olga Kolesnikova, Seid Muhie Yimam
NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer from two major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required for deeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designed to evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmo comprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code are publicly available.
nan
Article 500
Title@2025-05-28 (3): Natural Language Reinforcement Learning
Title: Natural Language Reinforcement Learning | Natürliche Sprache Stärkung Lernen | 自然语言强化学习 2411.14251v3 |
Authors: Xidong Feng, Bo Liu, Yan Song, Haotian Fu, Ziyu Wan, Girish A. Koushik, Zhiyuan Hu, Mengyue Yang, Ying Wen, Jun Wang
Artificial intelligence progresses towards the “Era of Experience,” where agents are expected to learn from continuous, grounded interaction. We argue that traditional Reinforcement Learning (RL), which typically represents value as a scalar, can restrict agent’s deep understanding of environments and hinders the active, deliberative learning crucial for navigating this new paradigm. To address the issue, we introduce Natural Language Reinforcement Learning (NLRL), a framework that extends RL principles into natural language counterparts. Central to NLRL is the Language Value Function (LVF), which redefines value as an interpretable linguistic narrative articulating the rationale behind an evaluation. NLRL further extends this concept to core RL components, including policy, the Bellman equation, and policy iteration. Leveraging recent advancements in Large Language Models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value training through unsupervised environment interactions. Experiments over 4 multi-step agentic tasks demonstrate NLRL’s effectiveness, efficiency, and its potential to foster deeper understanding and more active learning strategies.
nan
Article 501
Title@2025-05-27 (2): Constrained Discrete Diffusion
Title: Constrained Discrete Diffusion | Beschränkte diskrete Diffusion | 限制的分解扩散 2503.09790v2 |
Authors: Michael Cardei, Jacob K Christopher, Thomas Hartvigsen, Brian R. Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto
Discrete diffusion models are a class of generative models that construct sequences by progressively denoising samples from a categorical noise distribution. Beyond their rapidly growing ability to generate coherent natural language, these models present a new and important opportunity to enforce sequence-level constraints, a capability that current autoregressive models cannot natively provide. This paper capitalizes on this opportunity by introducing Constrained Discrete Diffusion (CDD), a novel integration of differentiable constraint optimization within the diffusion process to ensure adherence to constraints, logic rules, or safety requirements for generated sequences. Unlike conventional text generators that often rely on post-hoc filtering or model retraining for controllable generation, CDD directly imposes constraints into the discrete diffusion sampling process, resulting in a training-free and effective approach. Experiments in toxicity-controlled text generation, property-constrained molecule design, and instruction-constrained text completion demonstrate that CDD achieves zero constraint violations in a diverse array of tasks while preserving fluency, novelty, and coherence while outperforming autoregressive and existing discrete diffusion approaches.
nan
Article 502
Title@2025-05-27 (2): From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
Title: From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization | Von EduVisBench zu EduVisAgent: Ein Benchmark- und Multi-Agent-Framework für eine sinnvolle pädagogische Visualisierung | 从Edu Visb bench到Edu Visbench-Edu VisbearAgender:有理性的可视化教育基准和多机构框架 2505.16832v2 |
Authors: Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming-lab/EduVisBench and https://github.com/aiming-lab/EduVisAgent.
nan
Article 503
Title@2025-05-27 (2): Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones
Title: Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones | Lassen Sie mich nachdenken! Eine lange Kette des Denkens kann es wert sein, auf jeden Fall viele kurze Menschen | 让我想想吧!一个长期的思考链 可能值得一试 有很多短一个 2505.21825v1 |
Authors: Parsa Mirtaheri, Ezra Edelman, Samy Jelassi, Eran Malach, Enric Boix-Adsera
Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.
nan
Article 504
Title@2025-05-27 (2): Understanding Synthetic Context Extension via Retrieval Heads
Title: Understanding Synthetic Context Extension via Retrieval Heads | Synthetische Kontexterweiterung über Rücklaufköpfe verstehen | 通过回收头目获取理解合成背景扩展 2410.22316v4 |
Authors: Xinyu Zhao, Fangcong Yin, Greg Durrett
Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of “needle” concepts to be retrieved and diversity of the surrounding “haystack” context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context, retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data have high overlap with retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are necessary and explain model performance, although they are not totally sufficient. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
nan
Article 505
Title@2025-05-27 (2): Representative Language Generation
Title: Representative Language Generation | Repräsentative Sprachgenerierung | 代 代 代 语 语 代 语 代 语 代 2505.21819v1 |
Authors: Charlotte Peale, Vinod Raman, Omer Reingold
We introduce “representative generation,” extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the “group closure dimension” as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.’s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.
nan
Article 506
Title@2025-05-27 (2): Revisiting Common Assumptions about Arabic Dialects in NLP
Title: Revisiting Common Assumptions about Arabic Dialects in NLP | Häufige Annahmen über arabische Dialekte in NLP erneut besuchen | 重新审视全国语言规划中阿拉伯语方言的通用假设 2505.21816v1 |
Authors: Amr Keleg, Sharon Goldwater, Walid Magdy
Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., ``Arabic dialects can be grouped into distinguishable regional dialects”) and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.
nan
Article 507
Title@2025-05-27 (2): Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking
Title: Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking | Scientific Paper Retrieval mit LLM-geführtem semantisch-basierendem Ranking | 具有LLM-Guided语义学排名的科学论文检索 2505.21815v1 |
Authors: Yunyi Zhang, Ruozhen Yang, Siqi Jiao, SeongKu Kang, Jiawei Han
Scientific paper retrieval is essential for supporting literature discovery and research. While dense retrieval methods demonstrate effectiveness in general-purpose tasks, they often fail to capture fine-grained scientific concepts that are essential for accurate understanding of scientific queries. Recent studies also use large language models (LLMs) for query understanding; however, these methods often lack grounding in corpus-specific knowledge and may generate unreliable or unfaithful content. To overcome these limitations, we propose SemRank, an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query’s information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy. Experiments show that SemRank consistently improves the performance of various base retrievers, surpasses strong existing LLM-based baselines, and remains highly efficient.
nan
Article 508
Title@2025-05-27 (2): ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails
Title: ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails | ThinkGuard: Besonnenes langsames Denken führt zu voreiligen Wärtern | 思考指南:慎重考虑的慢思考引领谨慎警卫车 2502.13458v2 |
Authors: Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, Muhao Chen
Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail’s cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.
nan
Article 509
Title@2025-05-27 (2): From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Title: From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs | Von der Anfahrt zu den Cones: Erforschung multidimensionaler Darstellungen von Propositional Facts in LLMs | ” 从方向到锥体:探索液晶中各种潜在事实的多层面代表 “ 2505.21800v1 |
Authors: Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O’Brien, Kevin Zhu, Vasu Sharma
Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model’s internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.
nan
Article 510
Title@2025-05-27 (2): Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?
Title: Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task? | Desecting the Ullman Variations with a SCALPEL: Warum scheitern LLMs bei Trivial Alterations to the False Belief Task? | 将乌尔曼变异与SCALPEL解剖:为什么LLMs在假信仰任务三维改造中失败? 2406.14737v2 |
Authors: Zhiqiang Pi, Annapurna Vadaparty, Benjamin K. Bergen, Cameron R. Jones
Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task, others have shown that their performance is not robust against trivial alterations to stimuli. In this paper, we introduce SCALPEL – a technique to incrementally modify stimuli to test different specific hypotheses about why LLMs fail – and apply this method to the “transparent-access” modification of the unexpected contents task. Our results suggest that LLMs often do poorly because they fail to make essential common-sense inferences, such as that seeing a transparent container implies recognizing its contents. We conclude that while modern LLMs go beyond mere pattern matching, they still fall short of robust human-like ToM. We argue that SCALPEL can help cognitive scientists examine LLMs’ capabilities in finer detail and provide insight into alternative mechanisms by which tasks that are used to assess human cognition might be completed.
nan
Article 511
Title@2025-05-27 (2): Controllable Context Sensitivity and the Knob Behind It
Title: Controllable Context Sensitivity and the Knob Behind It | Kontrollierbarer Kontext Empfindlichkeit und der Knob dahinter | 控制环境的感应度及其背后的Knob 2411.07404v3 |
Authors: Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell
When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental functionality, as it enables the model to excel at tasks like retrieval-augmented generation and question-answering. In this paper, we search for a knob which controls this sensitivity, determining whether language models answer from the context or their prior knowledge. To guide this search, we design a task for controllable context sensitivity. In this task, we first feed the model a context (Paris is in England) and a question (Where is Paris?); we then instruct the model to either use its prior or contextual knowledge and evaluate whether it generates the correct answer for both intents (either France or England). When fine-tuned on this task, instruction-tuned versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85-95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. Then, in each model, we identify a 1-D subspace in a single layer that encodes whether the model follows context or prior knowledge. Interestingly, while we identify this subspace in a fine-tuned model, we find that the exact same subspace serves as an effective knob in not only that model but also non-fine-tuned instruct and base models of that model family. Finally, we show a strong correlation between a model’s performance and how distinctly it separates context-agreeing from context-ignoring answers in this subspace. These results suggest a single subspace facilitates how the model chooses between context and prior knowledge, hinting at a simple fundamental mechanism that controls this behavior.
nan
Article 512
Title@2025-05-27 (2): Wanda++: Pruning Large Language Models via Regional Gradients
Title: Wanda++: Pruning Large Language Models via Regional Gradients | Wanda++: Beschneiden großer Sprachmodelle über regionale Gradienten | Wanda+++:通过区域渐变来保护大语言模式 2503.04992v3 |
Authors: Yifan Yang, Kai Zhen, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus Müller, Jonas M. Kübler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar
Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32\% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsity-aware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU.
nan
Article 513
Title@2025-05-27 (2): VeriTrail: Closed-Domain Hallucination Detection with Traceability
Title: VeriTrail: Closed-Domain Hallucination Detection with Traceability | VeriTrail: Closed-Domain Halluzination Erkennung mit Rückverfolgbarkeit | VeriTrail: 带可追踪性闭路致幻觉探测 2505.21786v1 |
Authors: Dasha Metropolitansky, Jonathan Larson
Even when instructed to adhere to source material, Language Models often generate unsubstantiated content - a phenomenon known as “closed-domain hallucination.” This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputs’ faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.
nan
Article 514
Title@2025-05-27 (2): Born a Transformer – Always a Transformer?
Title: Born a Transformer – Always a Transformer? | Geboren ein Transformer - immer ein Transformer? | 天生的变形人 - - 总是变形人? 2505.21785v1 |
Authors: Yana Veitsman, Mayank Jobanputra, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn
Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al. [2024]. We use the recently proposed C-RASP framework for studying length generalization [Huang et al., 2025b] to provide guarantees for each of our settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained Transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain Transformer capabilities, but does not overcome fundamental length-generalization limits.
nan
Article 515
Title@2025-05-27 (2): Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Title: Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation | Auf dem Weg zur Sicherheitsveranlagung in LLMs: KI-agentische Beratung für politisch eingebettete CoT-Datenerstellung | 走向LLM女士中的安全理由:为制定政策的COT数据编制进行AI-Agentic 考虑 2505.21784v1 |
Authors: Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris
Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE
nan
Article 516
Title@2025-05-27 (2): Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
Title: Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models | Wasserzeichen im Sand: Unmöglichkeit der starken Wasserzeichen für generative Modelle | 沙沙中的水印:在生成模型中使用强水标志的可能性 2311.04378v5 |
Authors: Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak
Watermarking generative models consists of planting a statistical signal (watermark) in a model’s output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a “quality oracle” that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a “perturbation oracle” which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
nan
Article 517
Title@2025-05-27 (2): Layers at Similar Depths Generate Similar Activations Across LLM Architectures
Title: Layers at Similar Depths Generate Similar Activations Across LLM Architectures | Ebenen in ähnlichen Tiefen erzeugen ähnliche Aktivierungen über LLM-Architekturen | 类似深度的图层在LLM 结构中生成类似活动 2504.08775v2 |
Authors: Christopher Wolfram, Aaron Schein
How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not “obvious” either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.
nan
Article 518
Title@2025-05-27 (2): GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task
Title: GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task | GMU-Systeme für die IWSLT 2025 Sprachübersetzung mit geringer Ressource geteilte Aufgabe | GMU 2025年IWSLT 低资源语音翻译共享任务 2505.21781v1 |
Authors: Chutong Meng, Antonios Anastasopoulos
This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. We trained systems for all language pairs, except for Levantine Arabic. We fine-tuned SeamlessM4T-v2 for automatic speech recognition (ASR), machine translation (MT), and end-to-end speech translation (E2E ST). The ASR and MT models are also used to form cascaded ST systems. Additionally, we explored various training paradigms for E2E ST fine-tuning, including direct E2E fine-tuning, multi-task training, and parameter initialization using components from fine-tuned ASR and/or MT models. Our results show that (1) direct E2E fine-tuning yields strong results; (2) initializing with a fine-tuned ASR encoder improves ST performance on languages SeamlessM4T-v2 has not been trained on; (3) multi-task training can be slightly helpful.
nan
Article 519
Title@2025-05-27 (2): When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
Title: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction | Wann geben LLMs ihre Fehler zu? Sie verstehen die Rolle des Modellglaubens bei der Retraktion | LLM女士何时承认其错误? 2505.16170v2 |
Authors: Yuqing Yang, Robin Jia
Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as “retraction” and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models’ internal belief: models fail to retract wrong answers that they “believe” to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.
nan
Article 520
Title@2025-05-27 (2): Calibrating LLM Confidence by Probing Perturbed Representation Stability
Title: Calibrating LLM Confidence by Probing Perturbed Representation Stability | Kalibrierung des LLM-Vertrauens durch Probing Perturbed Repräsentationsstabilität | 通过在有干扰的代表权方面确保稳定,以验证LLM信任度 2505.21772v1 |
Authors: Reza Khanmohammadi, Erfan Miahi, Mehrsa Mardikoraem, Simerjot Kaur, Ivan Brugere, Charese H. Smiley, Kundan Thind, Mohammad M. Ghassemi
Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model’s response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.
nan
Article 521
Title@2025-05-27 (2): BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum
Title: BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum | VerhaltenSFT: Behavioral Token Conditioning für klinische Wirkstoffe über das Proaktivitätsspektrum hinweg | 行为SFT:横跨主动性频谱的临床药剂行为定性 2505.21757v1 |
Authors: Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, Daniel McDuff, Cynthia Breazeal, Samir Tulebaev, Hae Won Park
Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs’ inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.
nan
Article 522
Title@2025-05-27 (2): FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Title: FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering | FRAMES-VQA: Benchmarking Fine-Tuning Robustheit über Multi-Modal Shifts in der visuellen Fragestellung | FRAMES-VQA:确定视觉问题解答中多模式变化的精确调整强度基准 2505.21755v1 |
Authors: Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .
nan
Article 523
Title@2025-05-27 (2): From prosthetic memory to prosthetic denial: Auditing whether large language models are prone to mass atrocity denialism
Title: From prosthetic memory to prosthetic denial: Auditing whether large language models are prone to mass atrocity denialism | Vom prothetischen Gedächtnis zur prothetischen Leugnung: Prüfung, ob große Sprachmodelle anfällig für Massenverleugnung sind | 从假肢记忆到否认假肢:审计大型语言模式是否容易发生大规模暴行否认行为 2505.21753v1 |
Authors: Roberto Ulloa, Eve M. Zucker, Daniel Bultmann, David J. Simon, Mykola Makhortykh
The proliferation of large language models (LLMs) can influence how historical narratives are disseminated and perceived. This study explores the implications of LLMs’ responses on the representation of mass atrocity memory, examining whether generative AI systems contribute to prosthetic memory, i.e., mediated experiences of historical events, or to what we term “prosthetic denial,” the AI-mediated erasure or distortion of atrocity memories. We argue that LLMs function as interfaces that can elicit prosthetic memories and, therefore, act as experiential sites for memory transmission, but also introduce risks of denialism, particularly when their outputs align with contested or revisionist narratives. To empirically assess these risks, we conducted a comparative audit of five LLMs (Claude, GPT, Llama, Mixtral, and Gemini) across four historical case studies: the Holodomor, the Holocaust, the Cambodian Genocide, and the genocide against the Tutsis in Rwanda. Each model was prompted with questions addressing common denialist claims in English and an alternative language relevant to each case (Ukrainian, German, Khmer, and French). Our findings reveal that while LLMs generally produce accurate responses for widely documented events like the Holocaust, significant inconsistencies and susceptibility to denialist framings are observed for more underrepresented cases like the Cambodian Genocide. The disparities highlight the influence of training data availability and the probabilistic nature of LLM responses on memory integrity. We conclude that while LLMs extend the concept of prosthetic memory, their unmoderated use risks reinforcing historical denialism, raising ethical concerns for (digital) memory preservation, and potentially challenging the advantageous role of technology associated with the original values of prosthetic memory.
nan
Article 524
Title@2025-05-27 (2): Revisiting Bi-Linear State Transitions in Recurrent Neural Networks
Title: Revisiting Bi-Linear State Transitions in Recurrent Neural Networks | Bi-Lineare State Transitions in recurrenten neuralen Netzwerken erneut besuchen | 在经常性神经网络中重新审查双利那尔州过渡 2505.21749v1 |
Authors: M. Reza Ebrahimi, Roland Memisevic
The role of hidden units in recurrent neural networks is typically seen as modeling memory, with research focusing on enhancing information retention through gating mechanisms. A less explored perspective views hidden units as active participants in the computation performed by the network, rather than passive memory stores. In this work, we revisit bi-linear operations, which involve multiplicative interactions between hidden units and input embeddings. We demonstrate theoretically and empirically that they constitute a natural inductive bias for representing the evolution of hidden states in state tracking tasks. These are the simplest type of task that require hidden units to actively contribute to the behavior of the network. We also show that bi-linear state updates form a natural hierarchy corresponding to state tracking tasks of increasing complexity, with popular linear recurrent networks such as Mamba residing at the lowest-complexity center of that hierarchy.
nan
Article 525
Title@2025-05-27 (2): General-Reasoner: Advancing LLM Reasoning Across All Domains
Title: General-Reasoner: Advancing LLM Reasoning Across All Domains | General-Reasoner: Bessere LLM-Reasonierung über alle Domains hinweg | 通用Reasoner:在所有领域推推推LLM 2505.14652v4 |
Authors: Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the “Zero” reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.
nan
Article 526
Title@2025-05-27 (2): Counterfactual Simulatability of LLM Explanations for Generation Tasks
Title: Counterfactual Simulatability of LLM Explanations for Generation Tasks | Counterfactual Simulatability von LLM-Erläuterungen für Generierungsaufgaben | 世代任务LLM解释的反事实模拟性 2505.21740v1 |
Authors: Marvin Limpijankit, Yanda Chen, Melanie Subbiah, Nicholas Deas, Kathleen McKeown
LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model’s output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.
nan
Article 527
Title@2025-05-27 (2): Non-Markovian Discrete Diffusion with Causal Language Models
Title: Non-Markovian Discrete Diffusion with Causal Language Models | Nicht-Markovianische Diskrepanz mit kausalen Sprachmodellen | 非马尔科维语非马尔科维语分辨语言模式的传播 2502.09767v2 |
Authors: Yangtian Zhang, Sizhuang He, Daniel Levine, Lawrence Zhao, David Zhang, Syed A Rizvi, Emanuele Zappala, Rex Ying, David van Dijk
Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi, a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.
nan
Article 528
Title@2025-05-27 (2): Assessing and Refining ChatGPT’s Performance in Identifying Targeting and Inappropriate Language: A Comparative Study
Title: Assessing and Refining ChatGPT’s Performance in Identifying Targeting and Inappropriate Language: A Comparative Study | Bewertung und Verfeinerung der Leistung von ChatGPT bei der Identifizierung von Targeting und unangemessener Sprache: Eine vergleichende Studie | 评估和完善聊天部在确定针对性和不适当语言方面的绩效:比较研究 2505.21710v1 |
Authors: Barbarestani Baran, Maks Isa, Vossen Piek
This study evaluates the effectiveness of ChatGPT, an advanced AI model for natural language processing, in identifying targeting and inappropriate language in online comments. With the increasing challenge of moderating vast volumes of user-generated content on social network sites, the role of AI in content moderation has gained prominence. We compared ChatGPT’s performance against crowd-sourced annotations and expert evaluations to assess its accuracy, scope of detection, and consistency. Our findings highlight that ChatGPT performs well in detecting inappropriate content, showing notable improvements in accuracy through iterative refinements, particularly in Version 6. However, its performance in targeting language detection showed variability, with higher false positive rates compared to expert judgments. This study contributes to the field by demonstrating the potential of AI models like ChatGPT to enhance automated content moderation systems while also identifying areas for further improvement. The results underscore the importance of continuous model refinement and contextual understanding to better support automated moderation and mitigate harmful online behavior.
nan
Article 529
Title@2025-05-27 (2): Do We Know What LLMs Don’t Know? A Study of Consistency in Knowledge Probing
Title: Do We Know What LLMs Don’t Know? A Study of Consistency in Knowledge Probing | Wissen wir, was LLMs nicht wissen? Eine Studie der Konsistenz in der Wissensprobe | 我们知道什么是不知道的LLLM不知道的吗?关于知识检验的一致性的研究。 2505.21701v1 |
Authors: Raoyuan Zhao, Abdullatif Köksal, Ali Modarressi, Michael A. Hedderich, Hinrich Schütze
The reliability of large language models (LLMs) is greatly compromised by their tendency to hallucinate, underscoring the need for precise identification of knowledge gaps within LLMs. Various methods for probing such gaps exist, ranging from calibration-based to prompting-based methods. To evaluate these probing methods, in this paper, we propose a new process based on using input variations and quantitative metrics. Through this, we expose two dimensions of inconsistency in knowledge gap probing. (1) Intra-method inconsistency: Minimal non-semantic perturbations in prompts lead to considerable variance in detected knowledge gaps within the same probing method; e.g., the simple variation of shuffling answer options can decrease agreement to around 40%. (2) Cross-method inconsistency: Probing methods contradict each other on whether a model knows the answer. Methods are highly inconsistent – with decision consistency across methods being as low as 7% – even though the model, dataset, and prompt are all the same. These findings challenge existing probing methods and highlight the urgent need for perturbation-robust probing frameworks.
nan
Article 530
Title@2025-05-27 (2): MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs
Title: MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs | MAKIEval: Ein multilingualer, automatischer WiKidata-basierter Rahmen für die Bewertung des kulturellen Bewusstseins für LLMs | MAKIEval:以多种语言自动维基数据为基础的LLMs文化认识评价框架 2505.21693v1 |
Authors: Raoyuan Zhao, Beiduo Chen, Barbara Plank, Michael A. Hedderich
Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.
nan
Article 531
Title@2025-05-27 (2): LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model
Title: LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model | LLMPR: Ein neuartiges LLM-getriebenes Transfer-Learning-basiertes Petitions-Ranking-Modell | LLMPR:基于请愿排级的新式LLM-驱动转移学习模式 2505.21689v1 |
Authors: Avijit Gayen, Somyajit Chakraborty, Mainak Sen, Soham Paul, Angshuman Jana
The persistent accumulation of unresolved legal cases, especially within the Indian judiciary, significantly hampers the timely delivery of justice. Manual methods of prioritizing petitions are often prone to inefficiencies and subjective biases further exacerbating delays. To address this issue, we propose LLMPR (Large Language Model-based Petition Ranking), an automated framework that utilizes transfer learning and machine learning to assign priority rankings to legal petitions based on their contextual urgency. Leveraging the ILDC dataset comprising 7,593 annotated petitions, we process unstructured legal text and extract features through various embedding techniques, including DistilBERT, LegalBERT, and MiniLM. These textual embeddings are combined with quantitative indicators such as gap days, rank scores, and word counts to train multiple machine learning models, including Random Forest, Decision Tree, XGBoost, LightGBM, and CatBoost. Our experiments demonstrate that Random Forest and Decision Tree models yield superior performance, with accuracy exceeding 99% and a Spearman rank correlation of 0.99. Notably, models using only numerical features achieve nearly optimal ranking results (R2 = 0.988, \r{ho} = 0.998), while LLM-based embeddings offer only marginal gains. These findings suggest that automated petition ranking can effectively streamline judicial workflows, reduce case backlog, and improve fairness in legal prioritization.
nan
Article 532
Title@2025-05-27 (2): Empirical analysis of binding precedent efficiency in Brazilian Supreme Court via case classification
Title: Empirical analysis of binding precedent efficiency in Brazilian Supreme Court via case classification | Empirische Analyse der verbindlichen Präzedenzeffizienz im brasilianischen Obersten Gerichtshof über die Fallklassifizierung | 通过案件分类对巴西最高法院具有约束力的先例效率进行经验分析 2407.07004v3 |
Authors: Raphaël Tinarrage, Henrique Ennes, Lucas Resck, Lucas T. Gomes, Jean R. Ponciano, Jorge Poco
Binding precedents (s'umulas vinculantes) constitute a juridical instrument unique to the Brazilian legal system and whose objectives include the protection of the Federal Supreme Court against repetitive demands. Studies of the effectiveness of these instruments in decreasing the Court’s exposure to similar cases, however, indicate that they tend to fail in such a direction, with some of the binding precedents seemingly creating new demands. We empirically assess the legal impact of five binding precedents, 11, 14, 17, 26, and 37, at the highest Court level through their effects on the legal subjects they address. This analysis is only possible through the comparison of the Court’s ruling about the precedents’ themes before they are created, which means that these decisions should be detected through techniques of Similar Case Retrieval, which we tackle from the angle of Case Classification. The contributions of this article are therefore twofold: on the mathematical side, we compare the use of different methods of Natural Language Processing – TF-IDF, LSTM, Longformer, and regex – for Case Classification, whereas on the legal side, we contrast the inefficiency of these binding precedents with a set of hypotheses that may justify their repeated usage. We observe that the TF-IDF models performed slightly better than LSTM and Longformer when compared through common metrics; however, the deep learning models were able to detect certain important legal events that TF-IDF missed. On the legal side, we argue that the reasons for binding precedents to fail in responding to repetitive demand are heterogeneous and case-dependent, making it impossible to single out a specific cause. We identify five main hypotheses, which are found in different combinations in each of the precedents studied.
nan
Article 533
Title@2025-05-27 (2): Probabilistic Reasoning with LLMs for k-anonymity Estimation
Title: Probabilistic Reasoning with LLMs for k-anonymity Estimation | Probabilistische Begründung mit LLMs für k-Anonymitätsschätzung | K-匿名性估计法LLMs的概率推理 2503.09674v3 |
Authors: Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu
Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.
nan
Article 534
Title@2025-05-27 (2): Language Model Alignment in Multilingual Trolley Problems
Title: Language Model Alignment in Multilingual Trolley Problems | Sprachmodellausrichtung in Mehrsprachigen Trolley-Problemen | 多语言小龙卷风问题语言模型对齐 2407.02273v6 |
Authors: Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf
We evaluate the moral alignment of LLMs with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs’ decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide. Our code and data are at https://github.com/causalNLP/moralmachine
nan
Article 535
Title@2025-05-27 (2): Rethinking the Outlier Distribution in Large Language Models: An In-depth Study
Title: Rethinking the Outlier Distribution in Large Language Models: An In-depth Study | Die Outlier-Distribution in großen Sprachmodellen neu denken: Eine vertiefte Studie | 重新思考大语言模型的外部分布:深入研究 2505.21670v1 |
Authors: Rahul Raman, Khushi Sharma, Sai Qian Zhang
Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother deployment on edge devices or specialized hardware. Recent studies have identified two common types of outliers in LLMs: massive activations and channel-wise outliers. While numerous quantization algorithms have been proposed to mitigate their effects and maintain satisfactory accuracy, few have thoroughly explored the root causes of these outliers in depth. In this paper, we conduct a comprehensive investigation into the formation mechanisms of these outliers and propose potential strategies to mitigate their occurrence. Ultimately, we introduce some efficient approaches to eliminate most massive activations and channel-wise outliers with minimal impact on accuracy.
nan
Article 536
Title@2025-05-27 (2): R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
Title: R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning | R1-Code-Interpreter: LLMs mit Code über überwachtes und verstärktes Lernen zur Vernunft trainieren | R1-Code-Code-解释:通过监督和强化学习,将培训的 “ 理性通识规范 “ 课程 2505.21668v1 |
Authors: Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Chuchu Fan
Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
nan
Article 537
Title@2025-05-27 (2): Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations
Title: Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations | Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen | 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v1 |
Authors: Zeinab Dehghani, Koorosh Aslansefat, Adil Khan, Mohammed Naveed Akram
Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.
nan
Article 538
Title@2025-05-27 (2): Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts
Title: Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts | Iterative Corpus-Verfeinerung für Material-Eigenschaftsvorhersage auf der Grundlage wissenschaftlicher Texte | 以科学文本为基础的材料财产预测材料性迭代公司精炼 2505.21646v1 |
Authors: Lei Zhang, Markus Stricker
The discovery and optimization of materials for specific applications is hampered by the practically infinite number of possible elemental combinations and associated properties, also known as the `combinatorial explosion’. By nature of the problem, data are scarce and all possible data sources should be used. In addition to simulations and experimental results, the latent knowledge in scientific texts is not yet used to its full potential. We present an iterative framework that refines a given scientific corpus by strategic selection of the most diverse documents, training Word2Vec models, and monitoring the convergence of composition-property correlations in embedding space. Our approach is applied to predict high-performing materials for oxygen reduction (ORR), hydrogen evolution (HER), and oxygen evolution (OER) reactions for a large number of possible candidate compositions. Our method successfully predicts the highest performing compositions among a large pool of candidates, validated by experimental measurements of the electrocatalytic performance in the lab. This work demonstrates and validates the potential of iterative corpus refinement to accelerate materials discovery and optimization, offering a scalable and efficient tool for screening large compositional spaces where reliable data are scarce or non-existent.
nan
Article 539
Title@2025-05-27 (2): WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Title: WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation | WISE: Eine weltweite wissensbasierte semantische Evaluation für die Text-zu-Bild-Generierung | WISE:为产生文字到图像制作而进行的世界知识化的语义评价 2503.07265v2 |
Authors: Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, Li Yuan
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.
nan
Article 540
Title@2025-05-27 (2): How does Alignment Enhance LLMs’ Multilingual Capabilities? A Language Neurons Perspective
Title: How does Alignment Enhance LLMs’ Multilingual Capabilities? A Language Neurons Perspective | Wie verbessert Alignment die Mehrsprachigkeitsfähigkeiten von LLMs? | 协调如何增强LLMM的多种语言能力? 2505.21505v1 |
Authors: Shimao Zhang, Zhejian Lai, Xiang Liu, Shuaijie She, Xiao Liu, Yeyun Gong, Shujian Huang, Jiajun Chen
Multilingual Alignment is an effective and representative paradigm to enhance LLMs’ multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some researches on language-specific neurons reveal that there are language-specific neurons that are selectively activated in LLMs when processing different languages. This provides a new perspective to analyze and understand LLMs’ mechanisms more specifically in multilingual scenarios. In this work, we propose a new finer-grained neuron identification algorithm, which detects language neurons~(including language-specific neurons and language-related neurons) and language-agnostic neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs’ internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ‘‘Spontaneous Multilingual Alignment’’. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights for better understanding multilingual alignment and multilingual capabilities of LLMs.
nan
Article 541
Title@2025-05-27 (2): Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making
Title: Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making | Schweigen ist kein Konsens: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making | 沉默不是共识:通过用于临床决策的Catfish代理商在多方代理LLMs中破坏协议的偏见 2505.21503v1 |
Authors: Yihan Wang, Qiao Yan, Zhenghao Xing, Lihao Liu, Junjun He, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng
Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi-agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect’’ in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context-aware interventions: (i) a complexity-aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone-calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical Q&A and three medical VQA benchmarks show that our approach consistently outperforms both single- and multi-agent LLMs frameworks, including leading commercial models such as GPT-4o and DeepSeek-R1.
nan
Article 542
Title@2025-05-27 (2): ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Title: ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models | ViewSpatial-Bench: Bewertung multi-perspektivischer räumlicher Lokalisierung in Vision-Sprachen-Modellen | 视野空间-空间区:在视觉-语言模型中评价多视角空间空间定位 2505.21500v1 |
Authors: Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera’s perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity’s spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs’ corresponding spatial comprehension capabilities.
nan
Article 543
Title@2025-05-27 (2): Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers | Paper2Poster: Auf dem Weg zur multimodalen Plakatautomatisierung aus wissenschaftlichen Papieren | Paper2Poster:从科学论文中走向多式海报自动化 2505.21497v1 |
Authors: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster’s ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
nan
Article 544
Title@2025-05-27 (2): UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
Title: UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents | UI-Genie: Ein selbstverbesserender Ansatz zur iterativen Steigerung von MLLM-basierten mobilen GUI-Agenten | UI-Genie: 一种自我改进的方法,用于在刺激下促进基于MLLLM的移动图形界面工具 2505.21496v1 |
Authors: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
nan
Article 545
Title@2025-05-27 (2): How does Misinformation Affect Large Language Model Behaviors and Preferences?
Title: How does Misinformation Affect Large Language Model Behaviors and Preferences? | Wie wirkt sich Misinformation auf das Verhalten und die Präferenzen von großen Sprachmodellen aus? | 错误信息如何影响大语言模式行为和偏好? 2505.21608v1 |
Authors: Miao Peng, Nuo Chen, Jianheng Tang, Jia Li
Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs’ behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs’ ability to detect misinformation. Our study provides valuable insights into LLMs’ interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at https://github.com/GKNL/MisBench.
nan
Article 546
Title@2025-05-27 (2): Reinforcing General Reasoning without Verifiers
Title: Reinforcing General Reasoning without Verifiers | Verstärkung der allgemeinen Vernunft ohne Prüfer | 加强一般理由说明,无验证人 2505.21493v1 |
Authors: Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
nan
Article 547
Title@2025-05-27 (2): Hardware-Efficient Attention for Fast Decoding
Title: Hardware-Efficient Attention for Fast Decoding | Hardware-Effiziente Aufmerksamkeit für schnelle Dekodierung | 快速下标记的硬件高效关注 2505.21487v1 |
Authors: Ted Zadouri, Hubert Strauss, Tri Dao
LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality. Experiments show that GTA matches Grouped-Query Attention (GQA) quality while using roughly half the KV cache and that GLA matches Multi-head Latent Attention (MLA) and is easier to shard. Our optimized GLA kernel is up to 2$\times$ faster than FlashMLA, for example, in a speculative decoding setting when the query length exceeds one. Furthermore, by fetching a smaller KV cache per device, GLA reduces end-to-end latency and increases throughput in online serving benchmarks by up to 2$\times$.
nan
Article 548
Title@2025-05-27 (2): Are Language Models Consequentialist or Deontological Moral Reasoners?
Title: Are Language Models Consequentialist or Deontological Moral Reasoners? | Sind Sprachmodelle konsequentistische oder deontologische Moralverursacher? | 语言模式是代名词还是代名词道德理由? 2505.21479v1 |
Authors: Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought tend to favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments. Our code is available at https://github.com/keenansamway/moral-lens .
nan
Article 549
Title@2025-05-27 (2): Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
Title: Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration | Halluzination in großen Vision-Sprachen durch adaptive Aufmerksamkeitskalibrierung abmildern | 通过适应性关注校准减轻大型视觉语言模型中的幻觉 2505.21472v1 |
Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu
Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current inference-time interventions, while training-free, struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding based on the model’s confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.
nan
Article 550
Title@2025-05-27 (2): Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration
Title: Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration | Skalierung externer Wissenseingaben über Kontext hinaus Windows von LLMs über Multi-Agent Collaboration | 通过多机构协作,在LLMM LMLM的 “ 背景视窗 “ 之外扩大外部知识投入 2505.21471v1 |
Authors: Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
With the rapid advancement of post-training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement, especially for tasks requiring significant amount of external knowledge. Existing context window extension methods inevitably cause information loss. LLM-based multi-agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing knowledge synchronization and reasoning processes. In this work, we develop a multi-agent framework, $\textbf{ExtAgents}$, to overcome the bottlenecks and enable better scalability in inference-time knowledge integration without longer-context training. Benchmarked with our enhanced multi-hop question answering test, $\textbf{$\boldsymbol{\infty}$Bench+}$, and other public test sets including long survey generation, ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls $\textit{within or exceeds the context window}$. Moreover, the method maintains high efficiency due to high parallelism. Further study in the coordination of LLM agents on increasing external knowledge input could benefit real-world applications.
nan
Article 551
Title@2025-05-27 (2): Beyond ‘Aha!’: Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
Title: Beyond ‘Aha!’: Toward Systematic Meta-Abilities Alignment in Large Reasoning Models | Jenseits von ‘Aha!’: Auf dem Weg zu systematischen Meta-Fähigkeiten Ausrichtung in großen vernünftigen Modellen | 超越“Aha! ” : 在大理由模型中实现系统化的元能力协调 2505.10554v2 |
Authors: Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model’s “aha moment”. However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs’ reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental “aha moments”. Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional gain in performance ceiling for both 7B and 32B models across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment
nan
Article 552
Title@2025-05-27 (2): Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion
Title: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion | Beschleunigung der Diffusions-Sprachmodell-Inferenz durch effizientes KV-Caching und geführte Diffusion | 通过高效的 KV 抓取和引导传播加速传播语言模式模型推导 2505.21467v1 |
Authors: Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta
Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.
nan
Article 553
Title@2025-05-27 (2): Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Title: Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions | Erinnerung an KI neu denken: Taxonomie, Operationen, Themen und Zukunftsrichtungen | AI:分类、操作、专题和未来方向 2505.00675v2 |
Authors: Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, Jeff Z. Pan
Memory is a fundamental component of AI systems, underpinning large language models (LLMs)-based agents. While prior surveys have focused on memory applications with LLMs (e.g., enabling personalized memory in conversational agents), they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric and contextual forms, and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression. We map these operations to the most relevant research topics across long-term, long-context, parametric modification, and multi-source memory. By reframing memory systems through the lens of atomic operations and representation types, this survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI, clarifying the functional interplay in LLMs based agents while outlining promising directions for future research\footnote{The paper list, datasets, methods and tools are available at \href{https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI}{https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI}.}.
nan
Article 554
Title@2025-05-27 (2): GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization
Title: GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization | GeLLMO: Verallgemeinern von großen Sprachmodellen für Multi-Property-Molekül-Optimierung | GELLMO:通用多财产分子优化大语言模型 2502.13398v2 |
Authors: Vishal Dey, Xiao Hu, Xia Ning
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs’ potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.
nan
Article 555
Title@2025-05-27 (2): ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Title: ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models | ID-Align: RoPE-Conscious Position Remapping für dynamische High-Resolution-Anpassung in Vision-Language-Modellen | 愿景语言模型中动态高分辨率适应的重新绘图 2505.21465v1 |
Authors: Bozhou Li, Wentao Zhang
Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench’s relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.
nan
Article 556
Title@2025-05-27 (2): Do LLMs Need to Think in One Language? Correlation between Latent Language and Task Performance
Title: Do LLMs Need to Think in One Language? Correlation between Latent Language and Task Performance | Müssen LLMs in einer Sprache denken? Korrelation zwischen latenter Sprache und Aufgabenleistung | LLM女士需要用一种语言思考吗? 2505.21458v1 |
Authors: Shintaro Ozaki, Tatsuya Hiraoka, Hiroto Otake, Hiroki Ouchi, Masaru Isonuma, Benjamin Heinzerling, Kentaro Inui, Taro Watanabe, Yusuke Miyao, Yohei Oseki, Yu Takagi
Large Language Models (LLMs) are known to process information using a proficient internal language consistently, referred to as latent language, which may differ from the input or output languages. However, how the discrepancy between the latent language and the input and output language affects downstream task performance remains largely unexplored. While many studies research the latent language of LLMs, few address its importance in influencing task performance. In our study, we hypothesize that thinking in latent language consistently enhances downstream task performance. To validate this, our work varies the input prompt languages across multiple downstream tasks and analyzes the correlation between consistency in latent language and task performance. We create datasets consisting of questions from diverse domains such as translation and geo-culture, which are influenced by the choice of latent language. Experimental results across multiple LLMs on translation and geo-culture tasks, which are sensitive to the choice of language, indicate that maintaining consistency in latent language is not always necessary for optimal downstream task performance. This is because these models adapt their internal representations near the final layers to match the target language, reducing the impact of consistency on overall performance.
nan
Article 557
Title@2025-05-27 (2): Thinking beyond the anthropomorphic paradigm benefits LLM research
Title: Thinking beyond the anthropomorphic paradigm benefits LLM research | Über das anthropomorphe Paradigma hinaus denken Vorteile der LLM-Forschung | 超越人类形态范式的思考 2502.09192v2 |
Authors: Lujain Ibrahim, Myra Cheng
Anthropomorphism, or the attribution of human traits to technology, is an automatic and unconscious response that occurs even in those with advanced technical expertise. In this position paper, we analyze hundreds of thousands of research articles to present empirical evidence of the prevalence and growth of anthropomorphic terminology in research on large language models (LLMs). We argue for challenging the deeper assumptions reflected in this terminology – which, though often useful, may inadvertently constrain LLM development – and broadening beyond them to open new pathways for understanding and improving LLMs. Specifically, we identify and examine five anthropomorphic assumptions that shape research across the LLM development lifecycle. For each assumption (e.g., that LLMs must use natural language for reasoning, or that they should be evaluated on benchmarks originally meant for humans), we demonstrate empirical, non-anthropomorphic alternatives that remain under-explored yet offer promising directions for LLM research and development.
nan
Article 558
Title@2025-05-27 (2): Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication
Title: Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication | Worte wie Messer: Rückseitig-Personalisierte Modellierung und Erkennung von gewalttätiger Kommunikation | 象Knives这样的词:后台化个人化和暴力通信建模和侦查 2505.21451v1 |
Authors: Jocelyn Shen, Akhila Yerukola, Xuhui Zhou, Cynthia Breazeal, Maarten Sap, Hae Won Park
Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection.
nan
Article 559
Title@2025-05-27 (2): One-shot Entropy Minimization
Title: One-shot Entropy Minimization | Ein Schuss Entropie Minimierung | 单向最小化 Entropy 最小化 2505.20282v2 |
Authors: Zitian Gao, Lynx Chen, Joey Zhou, Bryan Dai
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.
nan
Article 560
Title@2025-05-27 (2): When Two LLMs Debate, Both Think They’ll Win
Title: When Two LLMs Debate, Both Think They’ll Win | Wenn zwei LLMs diskutieren, denken beide, dass sie gewinnen werden | 当两个LLM 辩论, 双方都认为他们会赢 2505.19184v2 |
Authors: Pradyumna Shyama Prasad, Minh Nhat Nguyen
Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models’ private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.
nan
Article 561
Title@2025-05-27 (2): Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs
Title: Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs | Die Hitze aufdrehen: Min-p-Sampling für kreative und kohärente LLM-Ausgaben | 翻开热热:创意和一致的LLM产出的最小抽样 2407.01082v6 |
Authors: Minh Nhat Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, Ravid Shwartz-Ziv
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. Popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures which lead to incoherent or repetitive outputs. We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model’s confidence by using the top token’s probability as a scaling factor. Our experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing show that min-p sampling improves both the quality and diversity of generated text across different model families (Mistral and Llama 3) and model sizes (1B to 123B parameters), especially at higher temperatures. Human evaluations further show a clear preference for min-p sampling, in both text quality and creativity. Min-p sampling has been adopted by popular open-source LLM frameworks, including Hugging Face Transformers, VLLM, and many others, highlighting its considerable impact on improving text generation quality.
nan
Article 562
Title@2025-05-27 (2): ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition
Title: ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition | ANCHOLIK-NER: Ein Benchmark-Datensatz für Bangla Regional Named Entity Recognition | ANCHOLIK-NER:孟加拉地区命名实体识别基准数据集 2502.11198v3 |
Authors: Bidyarthi Paul, Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque, Shahriar Manzoor
Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences distributed across five regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models - Bangla BERT, Bangla BERT Base, and BERT Base Multilingual Cased - on this dataset. Our findings demonstrate that BERT Base Multilingual Cased performs best in recognizing named entities across regions, with significant performance observed in Mymensingh with an F1-score of 82.611%. Despite strong overall performance, challenges remain in region like Chittagong, where the models show lower precision and recall. Since no previous NER systems for Bangla regional dialects exist, our work represents a foundational step in addressing this gap. Future work will focus on improving model performance in underperforming regions and expanding the dataset to include more dialects, enhancing the development of dialect-aware NER systems.
nan
Article 563
Title@2025-05-27 (2): Towards Better Instruction Following Retrieval Models
Title: Towards Better Instruction Following Retrieval Models | Auf dem Weg zu einer besseren Instruktion nach den Modellen des Wiedereintritts | 在检索模型后改进教学 2505.21439v1 |
Authors: Yuchen Zhuang, Aaron Trinh, Rushi Qiang, Haotian Sun, Chao Zhang, Hanjun Dai, Bo Dai
Modern information retrieval (IR) models, trained exclusively on standard <query, passage> pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive <instruction, query, passage> triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR further enable efficient representation learning for smaller encoder-only models, facilitating direct embedding-based retrieval. Using this corpus, we train InF-Embed, an instruction-aware Embedding model optimized through contrastive learning and instruction-query attention mechanisms to align retrieval outcomes precisely with user intents. Extensive experiments across five instruction-based retrieval benchmarks demonstrate that InF-Embed significantly surpasses competitive baselines by 8.1% in p-MRR, measuring the instruction-following capabilities.
nan
Article 564
Title@2025-05-27 (2): Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge
Title: Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge | Agentisches medizinisches Wissen Grafiken verbessern medizinische Frageantworten: Die Lücke zwischen LLMs und sich entwickelndem medizinischem Wissen überbrücken | 药用知识图加强医疗问题的回答:缩小LLMM与不断发展的医学知识之间的差距 2502.13010v2 |
Authors: Mohammad Reza Rezaei, Reza Saadati Fard, Rahul G. Krishnan, Milad Lankarany
Large Language Models (LLMs) have significantly advanced medical question-answering by leveraging extensive clinical data and medical literature. However, the rapid evolution of medical knowledge and the labor-intensive process of manually updating domain-specific resources pose challenges to the reliability of these systems. To address this, we introduce Agentic Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates the construction and continuous updating of medical knowledge graphs, integrates reasoning, and retrieves current external evidence, such as PubMed and WikiSearch. By dynamically linking new findings and complex medical concepts, AMG-RAG not only improves accuracy but also enhances interpretability in medical queries. Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of 66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to 100 times larger. Notably, these improvements are achieved without increasing computational overhead, highlighting the critical role of automated knowledge graph generation and external evidence retrieval in delivering up-to-date, trustworthy medical insights.
nan
Article 565
Title@2025-05-27 (2): Transparent and Coherent Procedural Mistake Detection
Title: Transparent and Coherent Procedural Mistake Detection | Transparente und kohärente Verfahrensfehlererkennung | 透明和一致的程序错误侦测 2412.11927v2 |
Authors: Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that while VLMs struggle off-the-shelf, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods- though not without tradeoff. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
nan
Article 566
Title@2025-05-27 (2): R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
Title: R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing | R2R: Effizientes Navigieren unterschiedlicher Vernunftpfade mit klein-großen Model Token Routing | R2R: 以小型模型调速器有效导航差异性理性路径 2505.21600v1 |
Authors: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs’ reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce Roads to Rome (R2R), a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.
nan
Article 567
Title@2025-05-27 (2): Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives
Title: Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives | Datenmixtur für große Sprachmodelle neu denken: Eine umfassende Umfrage und neue Perspektiven | 重新思考大语言模型的数据组合:全面调查和新视角 2505.21598v1 |
Authors: Yajiao Liu, Congliang Chen, Junchi Yang, Ruoyu Sun
Training large language models with data collected from various domains can improve their performance on downstream tasks. However, given a fixed training budget, the sampling proportions of these different domains significantly impact the model’s performance. How can we determine the domain weights across different data domains to train the best-performing model within constrained computational resources? In this paper, we provide a comprehensive overview of existing data mixture methods. First, we propose a fine-grained categorization of existing methods, extending beyond the previous offline and online classification. Offline methods are further grouped into heuristic-based, algorithm-based, and function fitting-based methods. For online methods, we categorize them into three groups: online min-max optimization, online mixing law, and other approaches by drawing connections with the optimization frameworks underlying offline methods. Second, we summarize the problem formulations, representative algorithms for each subtype of offline and online methods, and clarify the relationships and distinctions among them. Finally, we discuss the advantages and disadvantages of each method and highlight key challenges in the field of data mixture.
nan
Article 568
Title@2025-05-27 (2): A Lightweight Method to Disrupt Memorized Sequences in LLM
Title: A Lightweight Method to Disrupt Memorized Sequences in LLM | Eine leichte Methode zum Disruptieren von gemerkten Sequenzen in LLM | LLM 中破坏记忆序列的轻量方法 2502.05159v2 |
Authors: Parjanya Prajakta Prashant, Kaustubh Ponkshe, Babak Salimi
As language models scale, their performance improves dramatically across a wide range of tasks, but so does their tendency to memorize and regurgitate parts of their training data verbatim. This tradeoff poses serious legal, ethical, and safety concerns, especially in real-world deployments. Existing mitigation techniques, such as differential privacy or model unlearning, often require retraining or access to internal weights making them impractical for most users. In this work, we introduce TokenSwap, a lightweight, post-hoc defense designed for realistic settings where the user can only access token-level outputs. Our key insight is that while large models are necessary for high task performance, small models (e.g., DistilGPT-2) are often sufficient to assign fluent, grammatically plausible probabilities to common function words - and crucially, they memorize far less. By selectively swapping token probabilities between models, TokenSwap preserves the capabilities of large models while reducing their propensity for verbatim reproduction. Evaluations on Pythia-6.9B and Llama-3-8B show up to a 10$\times$ drop in exact memorization with negligible task degradation. Our method offers a practical, accessible solution for mitigating memorized generation in deployed LLMs.
nan
Article 569
Title@2025-05-27 (2): Can Large Language Models Understand Symbolic Graphics Programs?
Title: Can Large Language Models Understand Symbolic Graphics Programs? | Können große Sprachmodelle symbolische Grafikprogramme verstehen? | 大语言模型能理解符号图形程序吗? 2408.08313v4 |
Authors: Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, Bernhard Schölkopf
Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM’s ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to “imagine” and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability – Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM’s understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
nan
Article 570
Title@2025-05-27 (2): Efficiently Scaling LLM Reasoning with Certaindex
Title: Efficiently Scaling LLM Reasoning with Certaindex | Effiziente Skalierung der LLM-Vernunft mit bestimmtem Dex | 高效扩增 LLM 使用 emitedex 说明 2412.20993v2 |
Authors: Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, Hao Zhang
Test-time reasoning algorithms such as chain-of-thought, self-consistency, and MCTS enhance LLM problem-solving but can wastefully generate many tokens without improving accuracy. At the same time, we observe that these algorithms exhibit answer stabilization: their intermediate solutions often cease to change after a certain point, and further investment of compute does not change their final answer. To quantify this phenomenon, we introduce Certaindex, an algorithm-agnostic metric measuring this evolving stability, signaling when further computation is unlikely to alter the final result. Certaindex is lightweight, can accelerate reasoning program inference via early exit, and further enables dynamic token allocation, gang scheduling, and many opportunities when integrated with real-world LLM serving systems. To quantify real-world benefits, we built Certaindex as a scheduler into Dynasor, our reasoning-aware LLM serving system, and demonstrate up to 50% compute savings and 3.3x higher throughput in real workloads with no accuracy drop. Our code is available at https://github.com/hao-ai-lab/Dynasor.git
nan
Article 571
Title@2025-05-27 (2): RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation
Title: RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation | RefTool: Modellverbesserung mit referenzgeführter Werkzeugerstellung | RefTool:在创建参考指导工具时加强示范理由 2505.21413v1 |
Authors: Xiao Liu, Da Yin, Zirui Wu, Yansong Feng
Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models’ internal knowledge and would fail in domains beyond the LLMs’ knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.
nan
Article 572
Title@2025-05-27 (2): How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
Title: How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation | Wie man sich vor 5G Strahlung schützt? LLM-Antworten auf Implizite Fehlinformationen untersuchen | 如何保护自己免受5G辐射? 调查隐蔽的错误信息的LLM反应 2503.09598v2 |
Authors: Ruohao Guo, Wei Xu, Alan Ritter
As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs’ capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.
nan
Article 573
Title@2025-05-27 (2): RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models
Title: RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models | RelationalFactQA: Ein Benchmark für die Bewertung tabellarischer Fakten aus großen Sprachmodellen | 关系事实QA:从大语言模型中评估列表事实检索的基准 2505.21409v1 |
Authors: Dario Satriani, Enzo Veltri, Donatello Santoro, Paolo Papotti
Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs’ ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.
nan
Article 574
Title@2025-05-27 (2): Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling
Title: Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling | Factual Self-Awareness in Sprachmodellen: Repräsentation, Robustheit und Skalierung | 语言模式中的事实自觉意识:代表性、强力和比例 2505.21399v1 |
Authors: Hovhannes Tamoyan, Subhabrata Dutta, Iryna Gurevych
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs’ internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer’s residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.
nan
Article 575
Title@2025-05-27 (2): DecisionFlow: Advancing Large Language Model as Principled Decision Maker
Title: DecisionFlow: Advancing Large Language Model as Principled Decision Maker | DecisionFlow: Großes Sprachmodell als prinzipieller Entscheidungsträger voranbringen | 决定Flow:作为有原则的决策人推进大语言模式 2505.21397v1 |
Authors: Xiusi Chen, Shanyong Wang, Cheng Qian, Hongru Wang, Peixuan Han, Heng Ji
In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model’s reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. We release the data and code at https://github.com/xiusic/DecisionFlow.
nan
Article 576
Title@2025-05-27 (2): Leveraging Large Language Models for Active Merchant Non-player Characters
Title: Leveraging Large Language Models for Active Merchant Non-player Characters | Nutzung großer Sprachmodelle für aktive Händler Nicht-Spieler-Charaktere | 利用大型语言模型为活跃的商机非玩家字符发挥杠杆作用 2412.11189v3 |
Authors: Byungjun Kim, Minju Kim, Dayeon Seo, Bugeun Kim
We highlight two significant issues leading to the passivity of current merchant non-player characters (NPCs): pricing and communication. While immersive interactions with active NPCs have been a focus, price negotiations between merchant NPCs and players remain underexplored. First, passive pricing refers to the limited ability of merchants to modify predefined item prices. Second, passive communication means that merchants can only interact with players in a scripted manner. To tackle these issues and create an active merchant NPC, we propose a merchant framework based on large language models (LLMs), called MART, which consists of an appraiser module and a negotiator module. We conducted two experiments to explore various implementation options under different training methods and LLM sizes, considering a range of possible game environments. Our findings indicate that finetuning methods, such as supervised finetuning (SFT) and knowledge distillation (KD), are effective in using smaller LLMs to implement active merchant NPCs. Additionally, we found three irregular cases arising from the responses of LLMs.
nan
Article 577
Title@2025-05-27 (2): Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science
Title: Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science | Verbesserung der Forschungsideenerzeugung durch Daten: Eine empirische Untersuchung in der Sozialwissenschaft | 《通过数据改进研究概念的产生:社会科学经验调查》 2505.21396v1 |
Authors: Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang
Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.
nan
Article 578
Title@2025-05-27 (2): Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Title: Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback | Align-SLM: Textlose gesprochene Sprachmodelle mit Verstärkung Lernen von KI Feedback | Aleign-SLM-Align-SLM:利用AI反馈学习强化的无文字口语模式 2411.01834v2 |
Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, Yile Gu, Ankur Gandhe, Hung-yi Lee, Ivan Bulyko
While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
nan
Article 579
Title@2025-05-27 (2): AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
Title: AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs | AutoJudger: Ein agentengestütztes Framework für effizientes Benchmarking von MLLMs | Autojudger: MLLMs 高效基准设定的代理驱动框架 2505.21389v1 |
Authors: Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei
Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model’s real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.
nan
Article 580
Title@2025-05-27 (2): VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
Title: VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models | VoxEval: Benchmarking des Wissensverständnisses Fähigkeiten von End-to-End gesprochenen Sprachmodellen | VoxEval:确定端至端口语语言模式知识理解能力基准 2501.04962v4 |
Authors: Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, Irwin King
With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs’ knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs’ knowledge understanding through pure speech interactions. Our benchmark 1) uniquely maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Systematic evaluation demonstrates that VoxEval presents significant challenges to current SLMs, revealing their sensitivity to varying audio conditions and highlighting the need to enhance reasoning capabilities in future development. We hope this benchmark could guide the advancement of more sophisticated and reliable SLMs. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval
nan
Article 581
Title@2025-05-27 (2): PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense
Title: PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense | PHISH in MESH: Koreanische Adversarial Phonetische Substitution und phonetisch-semantische Feature-Integration Verteidigung | MESH的PHISH:韩国反电话替代和音-声-声-声-声-声-声-声-地物融合国防 2505.21380v1 |
Authors: Byungjun Kim, Minju Kim, Hyeonchu Park, Bugeun Kim
As malicious users increasingly employ phonetic substitution to evade hate speech detection, researchers have investigated such strategies. However, two key challenges remain. First, existing studies have overlooked the Korean language, despite its vulnerability to phonetic perturbations due to its phonographic nature. Second, prior work has primarily focused on constructing datasets rather than developing architectural defenses. To address these challenges, we propose (1) PHonetic-Informed Substitution for Hangul (PHISH) that exploits the phonological characteristics of the Korean writing system, and (2) Mixed Encoding of Semantic-pHonetic features (MESH) that enhances the detector’s robustness by incorporating phonetic information at the architectural level. Our experimental results demonstrate the effectiveness of our proposed methods on both perturbed and unperturbed datasets, suggesting that they not only improve detection performance but also reflect realistic adversarial behaviors employed by malicious users.
nan
Article 582
Title@2025-05-27 (2): Analyzing values about gendered language reform in LLMs’ revisions
Title: Analyzing values about gendered language reform in LLMs’ revisions | Analysieren von Werten über die Reform der Geschlechtersprachen in LLM-Revisionen | 在LLLM女士的修订中分析关于性别语言改革的价值观 2505.21378v1 |
Authors: Jules Watson, Xi Wang, Raymond Liu, Suzanne Stevenson, Barend Beekhuizen
Within the common LLM use case of text revision, we study LLMs’ revision of gendered role nouns (e.g., outdoorsperson/woman/man) and their justifications of such revisions. We evaluate their alignment with feminist and trans-inclusive language reforms for English. Drawing on insight from sociolinguistics, we further assess if LLMs are sensitive to the same contextual effects in the application of such reforms as people are, finding broad evidence of such effects. We discuss implications for value alignment.
nan
Article 583
Title@2025-05-27 (2): Path Pooling: Training-Free Structure Enhancement for Efficient Knowledge Graph Retrieval-Augmented Generation
Title: Path Pooling: Training-Free Structure Enhancement for Efficient Knowledge Graph Retrieval-Augmented Generation | Pfad-Pooling: Training-freie Struktur-Verbesserung für effizientes Wissen Graph Retrieval-Augmented Generation | 集路道路:为高效知识图检索-启动型一代加强培训-免费结构 2503.05203v2 |
Authors: Hairu Wang, Yuan Feng, Xike Xie, S Kevin Zhou
Although Large Language Models achieve strong success in many tasks, they still suffer from hallucinations and knowledge deficiencies in real-world applications. Many knowledge graph-based retrieval-augmented generation (KG-RAG) methods enhance the quality and credibility of LLMs by leveraging structure and semantic information in KGs as external knowledge bases. However, these methods struggle to effectively incorporate structure information, either incurring high computational costs or underutilizing available knowledge. Inspired by smoothing operations in graph representation learning, we propose path pooling, a simple, training-free strategy that introduces structure information through a novel path-centric pooling operation. It seamlessly integrates into existing KG-RAG methods in a plug-and-play manner, enabling richer structure information utilization. Extensive experiments demonstrate that incorporating the path pooling into the state-of-the-art KG-RAG method consistently improves performance across various settings while introducing negligible additional cost.
nan
Article 584
Title@2025-05-27 (2): Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History
Title: Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History | LLM-Anpassung an soziodemographische Faktoren bewerten: Benutzerprofil vs. Dialoggeschichte | 评价LLLM适应社会人口因素:用户概况与对话历史 2505.21362v1 |
Authors: Qishuai Zhong, Zongmin Li, Siqi Fan, Aixin Sun
Effective engagement by large language models (LLMs) requires adapting responses to users’ sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs’ behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.
nan
Article 585
Title@2025-05-27 (2): Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning
Title: Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning | Select2Reason: Effiziente Instruction-Tuning-Datenauswahl für Long-CoT-Reasoning | 选择2Reason: 用于长期成本计算理由的高效指令导出数据选择 2505.17266v2 |
Authors: Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Xiaojun Wu, Honghao Liu, Hui Xiong, Jian Guo
A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.
nan
Article 586
Title@2025-05-27 (2): Frequency matters: Modeling irregular morphological patterns in Spanish with Transformers
Title: Frequency matters: Modeling irregular morphological patterns in Spanish with Transformers | Häufigkeitsfragen: Modellierung unregelmäßiger morphologischer Muster auf Spanisch mit Transformern | 频率事项:用变换器模拟西班牙文的非正常形态模式 2410.21013v4 |
Authors: Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney
Over the past decade, various studies have addressed how speakers solve the so-called `The Paradigm Cell Filling Problem’ (PCFP) \citep{ackerman2009parts} across different languages. The PCFP addresses a fundamental question in morphological processing: how do speakers accurately generate inflected forms of words when presented with incomplete paradigms? This problem is particularly salient when modeling complex inflectional systems. We focus on Spanish verbal paradigms, where certain verbs follow an irregular L-shaped pattern, where the first-person singular present indicative stem matches the stem used throughout the present subjunctive mood. We formulate the problem as a morphological reinflection task. Specifically, we investigate the role of input frequency in the acquisition of regular versus irregular L-shaped patterns in transformer models. By systematically manipulating the input distributions and analyzing model behavior, we reveal four key findings: 1) Models perform better on L-shaped verbs compared to regular verbs, especially in uneven frequency conditions; 2) Robust primacy effects are observed, but no consistent recency effects; 3) Memorization becomes more prominent as the proportion of L-shaped verbs increases; 4) There is a tendency to regularize L-shaped verbs when their consonant alternation pairs are rare or absent in the training data.
nan
Article 587
Title@2025-05-27 (2): Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning
Title: Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning | Nutzung von großen Sprachmodellen für Bengalische Mathematik-Wort-Probleme bei der Lösung der Kette der Gedankenveranlagung | 利用大语言模型解决孟加拉语数学字词与思维链理性的解决问题 2505.21354v1 |
Authors: Bidyarthi Paul, Jalisha Jashim Era, Mirazur Rahman Zim, Tahmid Sattar Aothoi, Faisal Muhammad Shah
Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language’s low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.
nan
Article 588
Title@2025-05-27 (2): The Multilingual Divide and Its Impact on Global AI Safety
Title: The Multilingual Divide and Its Impact on Global AI Safety | Die Mehrsprachigkeit und ihre Auswirkungen auf die globale KI-Sicherheit | 多语言鸿沟及其对全球独立国际协会安全的影响 2505.21344v1 |
Authors: Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet Üstün, Matthias Gallé, Marzieh Fadaee, Sara Hooker
Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the “language gap” in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.
nan
Article 589
Title@2025-05-27 (2): Leveraging large language models and traditional machine learning ensembles for ADHD detection from narrative transcripts
Title: Leveraging large language models and traditional machine learning ensembles for ADHD detection from narrative transcripts | Nutzung großer Sprachmodelle und traditioneller Machine-Learning-Ensembles zur ADHD-Erkennung aus erzählerischen Transkripten | 利用大型语言模式和传统机器学习群群,从叙述性记录誊本中探测ADHD 2505.21324v1 |
Authors: Yuxin Zhu, Yuting Guo, Noah Marchuck, Abeed Sarker, Yun Wang
Despite rapid advances in large language models (LLMs), their integration with traditional supervised machine learning (ML) techniques that have proven applicability to medical data remains underexplored. This is particularly true for psychiatric applications, where narrative data often exhibit nuanced linguistic and contextual complexity, and can benefit from the combination of multiple models with differing characteristics. In this study, we introduce an ensemble framework for automatically classifying Attention-Deficit/Hyperactivity Disorder (ADHD) diagnosis (binary) using narrative transcripts. Our approach integrates three complementary models: LLaMA3, an open-source LLM that captures long-range semantic structure; RoBERTa, a pre-trained transformer model fine-tuned on labeled clinical narratives; and a Support Vector Machine (SVM) classifier trained using TF-IDF-based lexical features. These models are aggregated through a majority voting mechanism to enhance predictive robustness. The dataset includes 441 instances, including 352 for training and 89 for validation. Empirical results show that the ensemble outperforms individual models, achieving an F$_1$ score of 0.71 (95\% CI: [0.60-0.80]). Compared to the best-performing individual model (SVM), the ensemble improved recall while maintaining competitive precision. This indicates the strong sensitivity of the ensemble in identifying ADHD-related linguistic cues. These findings demonstrate the promise of hybrid architectures that leverage the semantic richness of LLMs alongside the interpretability and pattern recognition capabilities of traditional supervised ML, offering a new direction for robust and generalizable psychiatric text classification.
nan
Article 590
Title@2025-05-27 (2): Interlocking-free Selective Rationalization Through Genetic-based Learning
Title: Interlocking-free Selective Rationalization Through Genetic-based Learning | Interlocking-free Selektive Rationalisierung durch gentechnisch-basiertes Lernen | 通过基于遗传的学习实现互连、无互闭和无互换的选择性合理化 2412.10312v2 |
Authors: Federico Ruggeri, Gaetano Signorelli
A popular end-to-end architecture for selective rationalization is the select-then-predict pipeline, comprising a generator to extract highlights fed to a predictor. Such a cooperative system suffers from suboptimal equilibrium minima due to the dominance of one of the two modules, a phenomenon known as interlocking. While several contributions aimed at addressing interlocking, they only mitigate its effect, often by introducing feature-based heuristics, sampling, and ad-hoc regularizations. We present GenSPP, the first interlocking-free architecture for selective rationalization that does not require any learning overhead, as the above-mentioned. GenSPP avoids interlocking by performing disjoint training of the generator and predictor via genetic global search. Experiments on a synthetic and a real-world benchmark show that our model outperforms several state-of-the-art competitors.
nan
Article 591
Title@2025-05-27 (2): Optimizing fMRI Data Acquisition for Decoding Natural Speech with Limited Participants
Title: Optimizing fMRI Data Acquisition for Decoding Natural Speech with Limited Participants | Optimierung der fMRI-Datenerfassung für die Dekodierung von Natural Speech mit begrenzten Teilnehmern | 优化FMRI数据获取,以便与有限参加者进行自然演讲 2505.21304v1 |
Authors: Louis Jalouzot, Alexis Thual, Yair Lakretz, Christophe Pallier, Bertrand Thirion
We investigate optimal strategies for decoding perceived natural speech from fMRI data acquired from a limited number of participants. Leveraging Lebel et al. (2023)’s dataset of 8 participants, we first demonstrate the effectiveness of training deep neural networks to predict LLM-derived text representations from fMRI activity. Then, in this data regime, we observe that multi-subject training does not improve decoding accuracy compared to single-subject approach. Furthermore, training on similar or different stimuli across subjects has a negligible effect on decoding accuracy. Finally, we find that our decoders better model syntactic than semantic features, and that stories containing sentences with complex syntax or rich semantic content are more challenging to decode. While our results demonstrate the benefits of having extensive data per participant (deep phenotyping), they suggest that leveraging multi-subject for natural speech decoding likely requires deeper phenotyping or a substantially larger cohort.
nan
Article 592
Title@2025-05-27 (2): How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian
Title: How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian | Wie Menschen und LLMs konzeptionelles Wissen organisieren: Untergeordnete Kategorien auf Italienisch erforschen | 人类和LLMs如何组织概念知识:探索意大利的次类 2505.21301v1 |
Authors: Andrea Pedrotti, Giulia Rambelli, Caterina Villani, Marianna Bolognesi
People can categorize the same entity at multiple taxonomic levels, such as basic (bear), superordinate (animal), and subordinate (grizzly bear). While prior research has focused on basic-level categories, this study is the first attempt to examine the organization of categories by analyzing exemplars produced at the subordinate level. We present a new Italian psycholinguistic dataset of human-generated exemplars for 187 concrete words. We then use these data to evaluate whether textual and vision LLMs produce meaningful exemplars that align with human category organization across three key tasks: exemplar generation, category induction, and typicality judgment. Our findings show a low alignment between humans and LLMs, consistent with previous studies. However, their performance varies notably across different semantic domains. Ultimately, this study highlights both the promises and the constraints of using AI-generated exemplars to support psychological and linguistic research.
nan
Article 593
Title@2025-05-27 (2): rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset
Title: rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset | rStar-Coder: Scaling Competitive Code Reasoning mit einem Large-Scale Verifizierten Datensatz | rStar-Coder:扩大竞争守则,以大型核实数据集为依据 2505.21297v1 |
Authors: Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, Mao Yang
Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.
nan
Article 594
Title@2025-05-27 (2): OR-Bench: An Over-Refusal Benchmark for Large Language Models
Title: OR-Bench: An Over-Refusal Benchmark for Large Language Models | OR-Bench: Ein überwiderlegbarer Benchmark für große Sprachmodelle | OR-Bench:大语言模式的过度拒绝基准 2405.20947v4 |
Authors: Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/bench-llms and our codebase is open-sourced at https://github.com/justincui03/or-bench. We hope this benchmark can help the community develop better safety aligned models.
nan
Article 595
Title@2025-05-27 (2): Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation
Title: Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation | Auf dem Weg zur Anpassung von Open Source großen Sprachmodellen für die Erstellung klinischer Notizen auf Expertenebene | 努力调整用于专家级临床笔记制作的开放源大语言模型 2405.00715v6 |
Authors: Hanyin Wang, Chufan Gao, Bolun Liu, Qiping Xu, Guleid Hussein, Mohamad El Labban, Kingsley Iheasirim, Hariprasad Korsapati, Chuck Outcalt, Jimeng Sun
Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pretraining, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (92.8%) of individual evaluations rated the notes generated by LLaMA-Clinic as “acceptable” or higher across three criteria: real-world readiness, completeness, and accuracy. In the more challenging “Assessment and Plan” section, LLaMA-Clinic matched physician-authored notes in real-world readiness score. We highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a “best practice” note format, rather than relying on LLMs to determine this for clinical practice.
nan
Article 596
Title@2025-05-27 (2): MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models
Title: MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models | MMUnlearner: Reformulierung multimodaler Maschinenentlernen im Zeitalter multimodaler großer Sprachmodelle | MMULALINER:在多模式大语言模式时代重新推出多模式机器 2502.11051v4 |
Authors: Jiahao Huo, Yibo Yan, Xu Zheng, Yuanhuiyi Lyu, Xin Zou, Zhihua Wei, Xuming Hu
Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to reformulate the task of multimodal MU in the era of MLLMs, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we develop a novel geometry-constrained gradient ascent method MMUnlearner. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code can be found in this URL.
nan
Article 597
Title@2025-05-27 (2): SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Title: SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs | SoftCoT: Soft Chain-of-Thought für effizientes Nachdenken mit LLMs | SoftCot: 寻求与LLMs高效合理解释的软链 2502.12134v2 |
Authors: Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token decoding, which constrains reasoning within the discrete vocabulary space and may not always be optimal. While recent efforts explore continuous-space reasoning, they often require full-model fine-tuning and suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zero-shot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the LLM. Specifically, we employ a lightweight fixed assistant model to speculatively generate instance-specific soft thought tokens as the initial chain of thoughts, which are then mapped into the LLM’s representation space via a trainable projection module. Experimental results on five reasoning benchmarks demonstrate that our method enhances LLM reasoning performance through supervised, parameter-efficient fine-tuning. Source code is available at https://github.com/xuyige/SoftCoT.
nan
Article 598
Title@2025-05-27 (2): Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs
Title: Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs | Feintuning auf unterschiedlichen aufschlussreichen Ketten treibt die Inferenz CoT-Verfeinerung in LLMs an | 对多种有理链条的精细调整 2407.03181v2 |
Authors: Haritz Puerto, Tilek Chubakov, Xiaodan Zhu, Harish Tayyar Madabushi, Iryna Gurevych
Requiring a large language model (LLM) to generate intermediary reasoning steps, known as Chain of Thought (CoT), has been shown to be an effective way of boosting performance. Previous approaches have focused on generating multiple independent CoTs, combining them through ensembling or other post-hoc strategies to enhance reasoning. In this work, we introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step, which is fundamentally different from prior work that primarily operate on parallel CoT generations. DCoT allows LLMs to gain the ability to perform within-inference refinement of reasoning chains without requiring external feedback. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales (1.3B to 70B). These improvements are particularly impactful for tasks with a large result state space, such as those involving numeric answers. Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models’ ability to refine an initial reasoning chain by generating a second, improved chain within the same inference step, demonstrating previously elusive self-improvement. Our code and data are publicly available at https://github.com/UKPLab/acl2025-diverse-cot.
nan
Article 599
Title@2025-05-27 (2): Multilingual Pretraining for Pixel Language Models
Title: Multilingual Pretraining for Pixel Language Models | Mehrsprachiges Vortraining für Pixel-Sprachenmodelle | 多语种像素语言模型的多语种预培训 2505.21265v1 |
Authors: Ilker Kesen, Jonas F. Lotz, Ingo Ziegler, Phillip Rust, Desmond Elliott
Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
nan
Article 600
Title@2025-05-27 (2): SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning
Title: SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning | SoftCoT++: Testzeitskalierung mit Soft Chain-of-Thought-Reasoning | SoftCot++: 带有软思考链原因的测试时间缩放 2505.11484v2 |
Authors: Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference, without altering the model’s parameters. While existing TTS methods operate in a discrete token space by generating more intermediate steps, recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance. Such latent thoughts encode informative thinking without the information loss associated with autoregressive token generation, sparking increased interest in continuous-space reasoning. Unlike discrete decoding, where repeated sampling enables exploring diverse reasoning paths, latent representations in continuous space are fixed for a given input, which limits diverse exploration, as all decoded paths originate from the same latent thought. To overcome this limitation, we introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths. Specifically, we perturb latent thoughts via multiple specialized initial tokens and apply contrastive learning to promote diversity among soft thought representations. Experiments across five reasoning benchmarks and two distinct LLM architectures demonstrate that SoftCoT++ significantly boosts SoftCoT and also outperforms SoftCoT with self-consistency scaling. Moreover, it shows strong compatibility with conventional scaling techniques such as self-consistency. Source code is available at https://github.com/xuyige/SoftCoT.
nan
Article 601
Title@2025-05-27 (2): ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision
Title: ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision | ReSCORE: Labelfreies iteratives Retriever-Training für Multi-Hop-Fragebeantwortung mit Relevanz-Konsistenz-Überwachung | RESCO:无标签的与相关性-一致性监督多窗口问题解答培训的循环探索性探索性培训 2505.21250v1 |
Authors: Dosung Lee, Wonjun Oh, Boyoung Kim, Minyoung Kim, Joonsuk Park, Paul Hongsuck Seo
Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries (reformulated) questions throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each documents relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance. Our implementation is available at: https://leeds1219.github.io/ReSCORE.
nan
Article 602
Title@2025-05-27 (2): Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
Title: Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings | Bewertung von LLMs in medizinischen Textzusammenfassungen: Die Rolle der Vokabelanpassung in hohen OOV-Einstellungen | 医学文本摘要:词汇适应在高OOV环境中的作用 2505.21242v1 |
Authors: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.
nan
Article 603
Title@2025-05-27 (2): LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners
Title: LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners | LMCD: Sprachmodelle sind Nullshot Kognitive Diagnose Lernende | LMCD: 语言模型是零光认知诊断学生 2505.21239v1 |
Authors: Yu He, Zihan Yao, Chentao Song, Tianyu Qi, Jun Liu, Ming Li, Qing Huang
Cognitive Diagnosis (CD) has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students’ cognitive states. However, traditional CD models often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features but fail to fully bridge the gap between semantic understanding and cognitive profiling. In this work, we propose Language Models as Zeroshot Cognitive Diagnosis Learners (LMCD), a novel framework designed to handle cold-start challenges by harnessing large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched contents of exercises and knowledge concepts (KCs), establishing stronger semantic links; and (2) Semantic-Cognitive Fusion, where LLMs employ causal attention mechanisms to integrate textual information and student cognitive states, creating comprehensive profiles for both students and exercises. These representations are efficiently trained with off-the-shelf CD models. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. The code is publicly available at https://github.com/TAL-auroraX/LMCD
nan
Article 604
Title@2025-05-27 (2): RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
Title: RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations | RASMALAI: Ressourcen für adaptive Sprachmodellierung in indischen Sprachen mit Akzenten und Intonationen | RASMAALAI:以印地安语言制作具有感应和感应的适应性演讲模型的资源 2505.18609v2 |
Authors: Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M Khapra
We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.
nan
Article 605
Title@2025-05-27 (2): Language Models Surface the Unwritten Code of Science and Society
Title: Language Models Surface the Unwritten Code of Science and Society | Sprachenmodelle stellen den ungeschriebenen Kodex von Wissenschaft und Gesellschaft dar | 《不成文科学与社会守则》 2505.18942v2 |
Authors: Honglin Bao, Siyang Wu, Jiwoong Choi, Yingrong Mao, James A. Evans
This paper calls on the research community not only to investigate how human biases are inherited by large language models (LLMs) but also to explore how these biases in LLMs can be leveraged to make society’s “unwritten code” - such as implicit stereotypes and heuristics - visible and accessible for critique. We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review - the factors that reviewers care about but rarely state explicitly due to normative scientific expectations. The idea of the framework is to push LLMs to speak out their heuristics through generating self-consistent hypotheses - why one paper appeared stronger in reviewer scoring - among paired papers submitted to 45 computer science conferences, while iteratively searching deeper hypotheses from remaining pairs where existing hypotheses cannot explain. We observed that LLMs’ normative priors about the internal characteristics of good science extracted from their self-talk, e.g. theoretical rigor, were systematically updated toward posteriors that emphasize storytelling about external connections, such as how the work is positioned and connected within and across literatures. This shift reveals the primacy of scientific myths about intrinsic properties driving scientific excellence rather than extrinsic contextualization and storytelling that influence conceptions of relevance and significance. Human reviewers tend to explicitly reward aspects that moderately align with LLMs’ normative priors (correlation = 0.49) but avoid articulating contextualization and storytelling posteriors in their review comments (correlation = -0.14), despite giving implicit reward to them with positive scores. We discuss the broad applicability of the framework, leveraging LLMs as diagnostic tools to surface the tacit codes underlying human society, enabling more precisely targeted responsible AI.
nan
Article 606
Title@2025-05-27 (2): GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding
Title: GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding | GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding | GALLa:改进源代码理解的通用大语言模型图 2409.04183v2 |
Authors: Ziyin Zhang, Hang Yu, Shijie Li, Peng Di, Jianguo Li, Rui Wang
Programming languages possess rich semantic information - such as data flow - that is represented by graphs and not available from the surface form of source code. Recent code language models have scaled to billions of parameters, but model source code solely as text tokens while ignoring any other structural information. Conversely, models that do encode structural information of code make modifications to the Transformer architecture, limiting their scale and compatibility with pretrained LLMs. In this work, we take the best of both worlds with GALLa - Graph Aligned Large Language Models. GALLa utilizes graph neural networks and cross-modal alignment technologies to inject the structural information of code into LLMs as an auxiliary task during finetuning. This framework is both model-agnostic and task-agnostic, as it can be applied to any code LLM for any code downstream task, and requires the structural graph data only at training time from a corpus unrelated to the finetuning data, while incurring no cost at inference time over the baseline LLM. Experiments on five code tasks with seven different baseline LLMs ranging in size from 350M to 14B validate the effectiveness of GALLa, demonstrating consistent improvement over the baseline, even for powerful models such as LLaMA3 and Qwen2.5-Coder.
nan
Article 607
Title@2025-05-27 (2): PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems
Title: PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems | PSRB: Ein umfassender Benchmark für die Bewertung persischer ASR-Systeme | PSRB:波斯ASR系统评价综合基准 2505.21230v1 |
Authors: Nima Sedghiyeh, Sara Sadeghi, Reza Khodadadi, Farzin Kashani, Omid Aghdaei, Somayeh Rahimi, Mohammad Sadegh Safari
Although Automatic Speech Recognition (ASR) systems have become an integral part of modern technology, their evaluation remains challenging, particularly for low-resource languages such as Persian. This paper introduces Persian Speech Recognition Benchmark(PSRB), a comprehensive benchmark designed to address this gap by incorporating diverse linguistic and acoustic conditions. We evaluate ten ASR systems, including state-of-the-art commercial and open-source models, to examine performance variations and inherent biases. Additionally, we conduct an in-depth analysis of Persian ASR transcriptions, identifying key error types and proposing a novel metric that weights substitution errors. This metric enhances evaluation robustness by reducing the impact of minor and partial errors, thereby improving the precision of performance assessment. Our findings indicate that while ASR models generally perform well on standard Persian, they struggle with regional accents, children’s speech, and specific linguistic challenges. These results highlight the necessity of fine-tuning and incorporating diverse, representative training datasets to mitigate biases and enhance overall ASR performance. PSRB provides a valuable resource for advancing ASR research in Persian and serves as a framework for developing benchmarks in other low-resource languages. A subset of the PSRB dataset is publicly available at https://huggingface.co/datasets/PartAI/PSRB.
nan
Article 608
Title@2025-05-27 (2): A Representation Level Analysis of NMT Model Robustness to Grammatical Errors
Title: A Representation Level Analysis of NMT Model Robustness to Grammatical Errors | Eine Darstellungsebenenanalyse von NMT-Modell Robustheit zu grammatischen Fehlern | 对NMT模型模型对表面错误的强度代表级别分析 2505.21224v1 |
Authors: Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis
Understanding robustness is essential for building reliable NLP systems. Unfortunately, in the context of machine translation, previous work mainly focused on documenting robustness failures or improving robustness. In contrast, we study robustness from a model representation perspective by looking at internal model representations of ungrammatical inputs and how they evolve through model layers. For this purpose, we perform Grammatical Error Detection (GED) probing and representational similarity analysis. Our findings indicate that the encoder first detects the grammatical error, then corrects it by moving its representation toward the correct form. To understand what contributes to this process, we turn to the attention mechanism where we identify what we term Robustness Heads. We find that Robustness Heads attend to interpretable linguistic units when responding to grammatical errors, and that when we fine-tune models for robustness, they tend to rely more on Robustness Heads for updating the ungrammatical word representation.
nan
Article 609
Title@2025-05-27 (2): Pretrained LLMs Learn Multiple Types of Uncertainty
Title: Pretrained LLMs Learn Multiple Types of Uncertainty | Pretrained LLMs lernen mehrere Arten von Unsicherheit | 事先培训的LLMs 学习多种不确定性 2505.21218v1 |
Authors: Roi Cohen, Omri Fahn, Gerard de Melo
Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we study how well LLMs capture uncertainty, without explicitly being trained for that. We show that, if considering uncertainty as a linear concept in the model’s latent space, it might indeed be captured, even after only pretraining. We further show that, though unintuitive, LLMs appear to capture several different types of uncertainty, each of which can be useful to predict the correctness for a specific task or benchmark. Furthermore, we provide in-depth results such as demonstrating a correlation between our correction prediction and the model’s ability to abstain from misinformation using words, and the lack of impact of model scaling for capturing uncertainty. Finally, we claim that unifying the uncertainty types as a single one using instruction-tuning or [IDK]-token tuning is helpful for the model in terms of correctness prediction.
nan
Article 610
Title@2025-05-27 (2): SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment
Title: SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment | SCIRGC: Multi-Granularitäts-Zitation Empfehlung und Zitation Sentence Preference Alignment | SCIRGC: 多岛屿引文建议和引文句次调整 2505.20103v2 |
Authors: Xiangyu Li, Jingqiang Chen
Citations are crucial in scientific research articles as they highlight the connection between the current study and prior work. However, this process is often time-consuming for researchers. In this study, we propose the SciRGC framework, which aims to automatically recommend citation articles and generate citation sentences for citation locations within articles. The framework addresses two key challenges in academic citation generation: 1) how to accurately identify the author’s citation intent and find relevant citation papers, and 2) how to generate high-quality citation sentences that align with human preferences. We enhance citation recommendation accuracy in the citation article recommendation module by incorporating citation networks and sentiment intent, and generate reasoning-based citation sentences in the citation sentence generation module by using the original article abstract, local context, citation intent, and recommended articles as inputs. Additionally, we propose a new evaluation metric to fairly assess the quality of generated citation sentences. Through comparisons with baseline models and ablation experiments, the SciRGC framework not only improves the accuracy and relevance of citation recommendations but also ensures the appropriateness of the generated citation sentences in context, providing a valuable tool for interdisciplinary researchers.
nan
Article 611
Title@2025-05-27 (2): Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
Title: Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs | Universal Reasoner: Ein einfacher, komponierbarer Plug-and-Play-Reasoner für gefrorene LLMs | 通用理由:冻结长效LMs的单一、可合成插管和布局理由 2505.19075v2 |
Authors: Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye
Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms existing baseline fine-tuning methods using the Llama3.2 model. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR
nan
Article 612
Title@2025-05-27 (2): Voting or Consensus? Decision-Making in Multi-Agent Debate
Title: Voting or Consensus? Decision-Making in Multi-Agent Debate | Abstimmung oder Konsens? Entscheidungsfindung in Multi-Agent-Debatte | 表决还是协商一致?多机构辩论中的决策 2502.19130v2 |
Authors: Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp
Much of the success of multi-agent debates depends on carefully choosing the right parameters. The decision-making protocol stands out as it can highly impact final model answers, depending on how decisions are reached. Systematic comparison of decision protocols is difficult because many studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making influences different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time - the decision protocol - to analyze how different methods affect the collaboration between agents and measure differences in knowledge and reasoning tasks. Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks compared to other decision protocols. Increasing the number of agents improves performance, while more discussion rounds before voting reduce it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.
nan
Article 613
Title@2025-05-27 (2): Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM’s Instruction-Following Capabilities
Title: Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM’s Instruction-Following Capabilities | Enthüllen von instruction-spezifischen Neuronen & Experten: Ein analytischer Rahmen für die instruction-following Fähigkeiten von LLM | 具体未完成的指示性具体神经和专家:LLM教学-执行能力分析框架 2505.21191v1 |
Authors: Junyan Zhang, Yubo Gao, Yibo Yan, Jungang Li, Zhaorui Hou, Sicheng Tao, Shuliang Liu, Song Dai, Yonghua Hei, Junzhuo Li, Xuming Hu
The finetuning of Large Language Models (LLMs) has significantly advanced their instruction-following capabilities, yet the underlying computational mechanisms driving these improvements remain poorly understood. This study systematically examines how fine-tuning reconfigures LLM computations by isolating and analyzing instruction-specific sparse components, i.e., neurons in dense models and both neurons and experts in Mixture-of-Experts (MoE) architectures. In particular, we introduce HexaInst, a carefully curated and balanced instructional dataset spanning six distinct categories, and propose SPARCOM, a novel analytical framework comprising three key contributions: (1) a method for identifying these sparse components, (2) an evaluation of their functional generality and uniqueness, and (3) a systematic comparison of their alterations. Through experiments, we demonstrate functional generality, uniqueness, and the critical role of these components in instruction execution. By elucidating the relationship between fine-tuning-induced adaptations and sparse computational substrates, this work provides deeper insights into how LLMs internalize instruction-following behavior for the trustworthy LLM community.
nan
Article 614
Title@2025-05-27 (2): Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation
Title: Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation | Lunguage: Ein Benchmark für strukturierte und sequentielle Chest-Röntgen-Interpretation | Lunguage:结构化和顺序式X射线X射线口译基准 2505.21190v1 |
Authors: Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Hyuk Gi Hong, Jung-Oh Lee, Hangyul Yoon, Eun Woo Doe, Jiyoun Kim, Harshita Sharma, Daniel C. Castro, Javier Alvarez-Valle, Edward Choi
Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage
nan
Article 615
Title@2025-05-27 (2): Exploring the Latent Capacity of LLMs for One-Step Text Generation
Title: Exploring the Latent Capacity of LLMs for One-Step Text Generation | Erforschung der Latent-Kapazität von LLMs für die einstufige Textgenerierung | 探索单步制文本生成LLMs的原始能力 2505.21189v1 |
Authors: Gleb Mezentsev, Ivan Oseledets
A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.
nan
Article 616
Title@2025-05-27 (2): PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing
Title: PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing | GiftSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing | 毒物群:通过示范众包普及有害信息合成 2505.21184v1 |
Authors: Yu Yan, Sheng Sun, Zhifei Zheng, Ziji Hao, Teli Liu, Min Liu
To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.
nan
Article 617
Title@2025-05-27 (2): Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning
Title: Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning | Gehen Sie, bevor Sie laufen! Concise LLM Reasoning via Verstärkung Learning | 走在跑步前! 通过强化学习解密 LLM 教学 2505.21178v1 |
Authors: Mingyang Song, Mao Zheng
As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model’s reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the “walk before you run” principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.
nan
Article 618
Title@2025-05-27 (2): TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment
Title: TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment | TAT-R1: Terminologie-Bewusste Übersetzung mit Verstärkungslernen und Wortausrichtung | TAT-R1:用强化学习和字词一致来翻译名词-软件 2505.21172v1 |
Authors: Zheng Li, Mao Zheng, Mingyang Song, Wenjie Yang
Recently, deep reasoning large language models(LLMs) like DeepSeek-R1 have made significant progress in tasks such as mathematics and coding. Inspired by this, several studies have employed reinforcement learning(RL) to enhance models’ deep reasoning capabilities and improve machine translation(MT) quality. However, the terminology translation, an essential task in MT, remains unexplored in deep reasoning LLMs. In this paper, we propose \textbf{TAT-R1}, a terminology-aware translation model trained with reinforcement learning and word alignment. Specifically, we first extract the keyword translation pairs using a word alignment model. Then we carefully design three types of rule-based alignment rewards with the extracted alignment relationships. With those alignment rewards, the RL-trained translation model can learn to focus on the accurate translation of key information, including terminology in the source text. Experimental results show the effectiveness of TAT-R1. Our model significantly improves terminology translation accuracy compared to the baseline models while maintaining comparable performance on general translation tasks. In addition, we conduct detailed ablation studies of the DeepSeek-R1-like training paradigm for machine translation and reveal several key findings.
nan
Article 619
Title@2025-05-27 (2): M-Wanda: Improving One-Shot Pruning for Multilingual LLMs
Title: M-Wanda: Improving One-Shot Pruning for Multilingual LLMs | M-Wanda: Bessere One-Shot Pruning für mehrsprachige LLMs | M-Wanda:改进多语种LLM的单制环流 2505.21171v1 |
Authors: Rochelle Choenni, Ivan Titov
Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.
nan
Article 620
Title@2025-05-27 (2): Leveraging GANs for citation intent classification and its impact on citation network analysis
Title: Leveraging GANs for citation intent classification and its impact on citation network analysis | Nutzung von GANs für die Klassifizierung von Zitierzielen und deren Auswirkungen auf die Analyse von Zitiernetzwerken | 利用GANs利用GANs进行引用意图分类及其对引用网络分析的影响 2505.21162v1 |
Authors: Davi A. Bezerra, Filipi N. Silva, Diego R. Amancio
Citations play a fundamental role in the scientific ecosystem, serving as a foundation for tracking the flow of knowledge, acknowledging prior work, and assessing scholarly influence. In scientometrics, they are also central to the construction of quantitative indicators. Not all citations, however, serve the same function: some provide background, others introduce methods, or compare results. Therefore, understanding citation intent allows for a more nuanced interpretation of scientific impact. In this paper, we adopted a GAN-based method to classify citation intents. Our results revealed that the proposed method achieves competitive classification performance, closely matching state-of-the-art results with substantially fewer parameters. This demonstrates the effectiveness and efficiency of leveraging GAN architectures combined with contextual embeddings in intent classification task. We also investigated whether filtering citation intents affects the centrality of papers in citation networks. Analyzing the network constructed from the unArXiv dataset, we found that paper rankings can be significantly influenced by citation intent. All four centrality metrics examined- degree, PageRank, closeness, and betweenness - were sensitive to the filtering of citation types. The betweenness centrality displayed the greatest sensitivity, showing substantial changes in ranking when specific citation intents were removed.
nan
Article 621
Title@2025-05-27 (2): Behavioral Analysis of Information Salience in Large Language Models
Title: Behavioral Analysis of Information Salience in Large Language Models | Verhaltensanalyse des Informationsgehalts in großen Sprachmodellen | 对大语言模式信息价值的行为分析 2502.14613v2 |
Authors: Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert
Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.
nan
Article 622
Title@2025-05-27 (2): Assessment of L2 Oral Proficiency using Speech Large Language Models
Title: Assessment of L2 Oral Proficiency using Speech Large Language Models | Bewertung der oralen Sprachkenntnisse von L2 anhand von sprachgroßen Sprachmodellen | 使用语言大语言模式评估L2口语能力 2505.21148v1 |
Authors: Rao Ma, Mengjie Qian, Siyuan Tang, Stefano Bannò, Kate M. Knill, Mark J. F. Gales
The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.
nan
Article 623
Title@2025-05-27 (2): Adaptive Deep Reasoning: Triggering Deep Thinking When Needed
Title: Adaptive Deep Reasoning: Triggering Deep Thinking When Needed | Adaptive Deep Reasoning: Tief denken auslösen, wenn nötig | 适应性深层理性:需要时触发深思考 2505.20101v2 |
Authors: Yunhao Wang, Yuhao Zhang, Tinghao Yu, Can Xu, Feng Zhang, Fengzong Lian
Large language models (LLMs) have shown impressive capabilities in handling complex tasks through long-chain reasoning. However, the extensive reasoning steps involved can significantly increase computational costs, posing challenges for real-world deployment. Recent efforts have focused on optimizing reasoning efficiency by shortening the Chain-of-Thought (CoT) reasoning processes through various approaches, such as length-aware prompt engineering, supervised fine-tuning on CoT data with variable lengths, and reinforcement learning with length penalties. Although these methods effectively reduce reasoning length, they still necessitate an initial reasoning phase. More recent approaches have attempted to integrate long-chain and short-chain reasoning abilities into a single model, yet they still rely on manual control to toggle between short and long CoT. In this work, we propose a novel approach that autonomously switches between short and long reasoning chains based on problem complexity. Our method begins with supervised fine-tuning of the base model to equip both long-chain and short-chain reasoning abilities. We then employ reinforcement learning to further balance short and long CoT generation while maintaining accuracy through two key strategies: first, integrating reinforcement learning with a long-short adaptive group-wise reward strategy to assess prompt complexity and provide corresponding rewards; second, implementing a logit-based reasoning mode switching loss to optimize the model’s initial token choice, thereby guiding the selection of the reasoning type. Evaluations on mathematical datasets demonstrate that our model can dynamically switch between long-chain and short-chain reasoning modes without substantially sacrificing performance. This advancement enhances the practicality of reasoning in large language models for real-world applications.
nan
Article 624
Title@2025-05-27 (2): Hallucinations are inevitable but can be made statistically negligible. The “innate” inevitability of hallucinations cannot explain practical LLM issues
Title: Hallucinations are inevitable but can be made statistically negligible. The “innate” inevitability of hallucinations cannot explain practical LLM issues | Halluzinationen sind unvermeidlich, können aber statistisch vernachlässigbar gemacht werden. Die “angeborene” Unvermeidbarkeit von Halluzinationen kann praktische LLM-Probleme nicht erklären | 幻觉的“内在”不可避免性无法解释实际的LLM问题。 2502.12187v2 |
Authors: Atsushi Suzuki, Yulan He, Feng Tian, Zhongyuan Wang
Hallucinations, a phenomenon where a language model (LM) generates nonfactual content, pose a significant challenge to the practical deployment of LMs. While many empirical methods have been proposed to mitigate hallucinations, recent studies established a computability-theoretic result showing that any LM will inevitably generate hallucinations on an infinite set of inputs, regardless of the quality and quantity of training datasets and the choice of the language model architecture and training and inference algorithms. Although the computability-theoretic result may seem pessimistic, its significance in practical viewpoints has remained unclear. This paper claims that those “innate” inevitability results from computability theory and diagonal argument, in principle, cannot explain practical issues of LLMs. We demonstrate this claim by presenting a positive theoretical result from a probabilistic perspective. Specifically, we prove that hallucinations can be made statistically negligible, provided that the quality and quantity of the training data are sufficient. Interestingly, our positive result coexists with the computability-theoretic result, implying that while hallucinations on an infinite set of inputs cannot be entirely eliminated, their probability can always be reduced by improving algorithms and training data. By evaluating the two seemingly contradictory results through the lens of information theory, we argue that our probability-theoretic positive result better reflects practical considerations than the computability-theoretic negative result.
nan
Article 625
Title@2025-05-27 (2): Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis
Title: Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis | Leveraging LLM und selbstüberwachte Trainingsmodelle für die Spracherkennung in chinesischen Dialekten: Eine vergleichende Analyse | 利用LLM和中国语语音识别自驾培训模式:比较分析 2505.21138v1 |
Authors: Tianyi Xu, Hongjie Chen, Wang Qing, Lv Hang, Jian Kang, Li Jie, Zhennan Lin, Yongxiang Li, Xie Lei
Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre- training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research
nan
Article 626
Title@2025-05-27 (2): Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction
Title: Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction | Scaling und Prompting für eine verbesserte Korrektur von End-to-End-Spoken-grammatischen Fehlern | 缩放和提示改进端至端口语语语法错误校正 2505.21137v1 |
Authors: Mengjie Qian, Rao Ma, Stefano Bannò, Kate M. Knill, Mark J. F. Gales
Spoken Grammatical Error Correction (SGEC) and Feedback (SGECF) are crucial for second language learners, teachers and test takers. Traditional SGEC systems rely on a cascaded pipeline consisting of an ASR, a module for disfluency detection (DD) and removal and one for GEC. With the rise of end-to-end (E2E) speech foundation models, we investigate their effectiveness in SGEC and feedback generation. This work introduces a pseudo-labelling process to address the challenge of limited labelled data, expanding the training data size from 77 hours to approximately 2500 hours, leading to improved performance. Additionally, we prompt an E2E Whisper-based SGEC model with fluent transcriptions, showing a slight improvement in SGEC performance, with more significant gains in feedback generation. Finally, we assess the impact of increasing model size, revealing that while pseudo-labelled data does not yield performance gain for a larger Whisper model, training with prompts proves beneficial.
nan
Article 627
Title@2025-05-27 (2): Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling
Title: Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling | Effiziente Längenverallgemeinerbare Aufmerksamkeit über Causal Retrieval für die Lang-Kontext-Sprachenmodellierung | 长文本语言建模通过 “ 目的检索 “ 吸引长文本语言建模 2410.01651v3 |
Authors: Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu
Despite the success of Transformers, handling long contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention. Thus Transformers often require post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 times the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-k relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 times the training length.
nan
Article 628
Title@2025-05-27 (2): Creativity in LLM-based Multi-Agent Systems: A Survey
Title: Creativity in LLM-based Multi-Agent Systems: A Survey | Kreativität in LLM-basierten Multi-Agent-Systemen: Eine Umfrage | 以LLM为基础的多种机构系统中的创造性:调查 2505.21116v1 |
Authors: Yi-Cheng Lin, Kang-Chieh Chen, Zhe-Yan Li, Tzu-Heng Wu, Tzu-Hsuan Wu, Kuan-Yu Chen, Hung-yi Lee, Yun-Nung Chen
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emph{creativity}, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks. This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.
nan
Article 629
Title@2025-05-27 (2): Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
Title: Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA | Wird es morgen noch wahr sein? Mehrsprachige Evergreen-Frageklassifikation zur Verbesserung des Vertrauenswürdigen QA | 提高可信赖的质量保证的多语种长青问题分类 2505.21115v1 |
Authors: Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii
Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions – whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.
nan
Article 630
Title@2025-05-27 (2): Does quantization affect models’ performance on long-context tasks?
Title: Does quantization affect models’ performance on long-context tasks? | Beeinflusst die Quantisierung die Performance von Modellen bei langen Kontextaufgaben? | 量化是否影响模型在长期任务方面的绩效? 2505.20276v2 |
Authors: Anmol Mekala, Anirudh Atmakuru, Yixiao Song, Marzena Karpinska, Mohit Iyyer
Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.
nan
Article 631
Title@2025-05-27 (2): A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction
Title: A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction | Ein leichtes Multi-Expert Generatives Sprachmodellsystem für Engineering Information and Knowledge Extraction | 工程信息和知识采掘轻量多专家生成语言示范系统 2505.21109v1 |
Authors: Bogdan Bogachov, Yaoyao Fiona Zhao
Despite recent advancements in domain adaptation techniques for large language models, these methods remain computationally intensive, and the resulting models can still exhibit hallucination issues. Most existing adaptation methods do not prioritize reducing the computational resources required for fine-tuning and inference of language models. Hallucination issues have gradually decreased with each new model release. However, they remain prevalent in engineering contexts, where generating well-structured text with minimal errors and inconsistencies is critical. This work introduces a novel approach called the Small Language Graph (SLG), which is a lightweight adaptation solution designed to address the two key challenges outlined above. The system is structured in the form of a graph, where each node represents a lightweight expert - a small language model fine-tuned on specific and concise texts. The results of this study have shown that SLG was able to surpass conventional fine-tuning methods on the Exact Match metric by 3 times. Additionally, the fine-tuning process was 1.7 times faster compared to that of a larger stand-alone language model. These findings introduce a potential for small to medium-sized engineering companies to confidently use generative AI technologies, such as LLMs, without the necessity to invest in expensive computational resources. Also, the graph architecture and the small size of expert nodes offer a possible opportunity for distributed AI systems, thus potentially diverting the global need for expensive centralized compute clusters.
nan
Article 632
Title@2025-05-27 (2): Thinker: Learning to Think Fast and Slow
Title: Thinker: Learning to Think Fast and Slow | Denker: Schnell und langsam denken lernen | 思考者:学会快速和缓慢思考 2505.21097v1 |
Authors: Stephen Chung, Wenyu Du, Jie Fu
Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.
nan
Article 633
Title@2025-05-27 (2): BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge
Title: BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge | BLUCK: Ein Benchmark-Datensatz für Bengalische Sprachkenntnisse und kulturelles Wissen | BLUK:孟加拉语言理解和文化知识基准数据集 2505.21092v1 |
Authors: Daeen Kabir, Minhajur Rahman Chowdhury Mahim, Sheikh Shafayat, Adnan Sadik, Arian Ahmed, Eunsu Kim, Alice Oh
In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh’s culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs’ performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali’s status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.
nan
Article 634
Title@2025-05-27 (2): Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)
Title: Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs) | Position ist Macht: Systemprompts als Mechanismus von Bias in großen Sprachmodellen (LLMs) | 位置是电源:系统提示作为大语言模型比阿语机制(LLMs) 2505.21091v1 |
Authors: Anna Neumann, Elisabeth Kirsten, Muhammad Bilal Zafar, Jatinder Singh
System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others’ additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user’s ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.
nan
Article 635
Title@2025-05-27 (2): Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch
Title: Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch | Lösende LLM-Vernunftfähigkeit durch skalierbare Fragesynthese von Scratch | 从 Scratch 通过可缩放问题合成解排 LLM 解排功能性LLM 2410.18693v2 |
Authors: Yuyang Ding, Xinyu Shi, Xiaobo Liang, Juntao Li, Zhaopeng Tu, Qiaoming Zhu, Min Zhang
Improving the mathematical reasoning capabilities of Large Language Models (LLMs) is critical for advancing artificial intelligence. However, access to extensive, diverse, and high-quality reasoning datasets remains a significant challenge, particularly for the open-source community. In this paper, we propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method that enables the generation of large-scale mathematical reasoning datasets using lightweight 7B-scale models. ScaleQuest introduces a two-stage question-tuning process comprising Question Fine-Tuning (QFT) and Question Preference Optimization (QPO) to unlock the question generation capabilities of problem-solving models. By generating diverse questions from scratch – without relying on powerful proprietary models or seed data – we produce a dataset of 1 million problem-solution pairs. Our experiments demonstrate that models trained on our data outperform existing open-source datasets in both in-domain and out-of-domain evaluations. Furthermore, our approach shows continued performance improvement as the volume of training data increases, highlighting its potential for ongoing data scaling. The extensive improvements observed in code reasoning tasks demonstrate the generalization capabilities of our proposed method. Our work provides the open-source community with a practical solution to enhance the mathematical reasoning abilities of LLMs.
nan
Article 636
Title@2025-05-27 (2): Predicting Implicit Arguments in Procedural Video Instructions
Title: Predicting Implicit Arguments in Procedural Video Instructions | Implizite Argumente in verfahrenstechnischen Video-Anweisungen voraussagen | 程序性录像教学中预测隐含的论据 2505.21068v1 |
Authors: Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step’s where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models’ contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.
nan
Article 637
Title@2025-05-27 (2): Plan2Align: Predictive Planning Based Test-Time Preference Alignment for Large Language Models
Title: Plan2Align: Predictive Planning Based Test-Time Preference Alignment for Large Language Models | Plan2Align: Predictive Planning Based Test-Time Preference Alignment für große Sprachmodelle | 计划2对等:以预测规划为基础的大语言模型试验时间首选比对齐 2502.20795v2 |
Authors: Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
Aligning Large Language Models with Preference Fine-Tuning is often resource-intensive. Test-time alignment techniques that do not modify the underlying models, such as prompting and guided decodings, offer a lightweight alternative. However, existing test-time alignment methods primarily improve short responses and fail to ensure coherence over extended contexts due to the myopic nature of token-level alignment. Moreover, these methods often incur a slowdown during inference. To address these challenges, we propose Plan2Align, a test-time alignment framework that formulates text generation as a predictive planning problem. Plan2Align adapts Model Predictive Control (MPC) to iteratively refine output by rolling out multiple complete responses and optimizing each segment. To more rigorously evaluate the effectiveness and efficiency, we focus on the more challenging task of long-text generation. Experiments on the long-form response subset of the HH-RLHF dataset and the WMT’24 Discourse-Level Literary Translation demonstrate that Plan2Align significantly enhances the performance of base LLMs. Compared to existing training-time and test-time alignment methods on LLaMA-3.1 8B, Plan2Align achieves comparable or superior results, while also delivering improved inference efficiency relative to prior test-time alignment approaches.
nan
Article 638
Title@2025-05-27 (2): Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
Title: Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction | Visuelle Queues verbessern vorausschauende Wende-Taking für zwei-Partei menschliche Interaktion | 提高两党人互动的预测转向 2505.21043v1 |
Authors: Sam O’Connor Russell, Naomi Harte
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
nan
Article 639
Title@2025-05-27 (2): How Private are Language Models in Abstractive Summarization?
Title: How Private are Language Models in Abstractive Summarization? | Wie privat sind Sprachmodelle in abstrakter Zusammenfassung? | 私人语言模式在抽象总结中如何? 2412.12040v2 |
Authors: Anthony Hughes, Ning Ma, Nikolaos Aletras
In sensitive domains such as medical and legal, protecting sensitive information is critical, with protective laws strictly prohibiting the disclosure of personal data. This poses challenges for sharing valuable data such as medical reports and legal cases summaries. While language models (LMs) have shown strong performance in text summarization, it is still an open question to what extent they can provide privacy-preserving summaries from non-private source documents. In this paper, we perform a comprehensive study of privacy risks in LM-based summarization across two closed- and four open-weight models of different sizes and families. We experiment with both prompting and fine-tuning strategies for privacy-preservation across a range of summarization datasets including medical and legal domains. Our quantitative and qualitative analysis, including human evaluation, shows that LMs frequently leak personally identifiable information in their summaries, in contrast to human-generated privacy-preserving summaries, which demonstrate significantly higher privacy protection levels. These findings highlight a substantial gap between current LM capabilities and expert human expert performance in privacy-sensitive summarization tasks.
nan
Article 640
Title@2025-05-27 (2): Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models
Title: Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models | Debate-to-Detect: Neuformulieren von Fehlinformationserkennung als Real-World-Debatte mit großen Sprachmodellen | 辩论至检测:重拟错误信息探测作为有大语言模式的现实世界辩论 2505.18596v2 |
Authors: Chen Han, Wenzhen Zheng, Xijin Tang
The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D’s capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. The code will be open-sourced in a future release.
nan
Article 641
Title@2025-05-27 (2): Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models
Title: Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models | Optimierung des Case-Based-Reasoning-Systems für die Generierung funktionaler Testskripte mit großen Sprachmodellen | 为具有大语言模型的功能测试脚本生成优化基于个案的理由说明系统 2503.20576v3 |
Authors: Siyuan Guo, Huiwu Liu, Xiaolong Chen, Yuming Xie, Liang Zhang, Tao Han, Hechang Chen, Yi Chang, Jun Wang
In this work, we explore the potential of large language models (LLMs) for generating functional test scripts, which necessitates understanding the dynamically evolving code structure of the target software. To achieve this, we propose a case-based reasoning (CBR) system utilizing a 4R cycle (i.e., retrieve, reuse, revise, and retain), which maintains and leverages a case bank of test intent descriptions and corresponding test scripts to facilitate LLMs for test script generation. To improve user experience further, we introduce Re4, an optimization method for the CBR system, comprising reranking-based retrieval finetuning and reinforced reuse finetuning. Specifically, we first identify positive examples with high semantic and script similarity, providing reliable pseudo-labels for finetuning the retriever model without costly labeling. Then, we apply supervised finetuning, followed by a reinforcement learning finetuning stage, to align LLMs with our production scenarios, ensuring the faithful reuse of retrieved cases. Extensive experimental results on two product development units from Huawei Datacom demonstrate the superiority of the proposed CBR+Re4. Notably, we also show that the proposed Re4 method can help alleviate the repetitive generation issues with LLMs.
nan
Article 642
Title@2025-05-27 (2): Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation
Title: Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation | Def-DTS: Deduktive Begründung für Open-Domain Dialog Themensegmentierung | Def-DTS: 公开对话的削减理由 2505.21033v1 |
Authors: Seungmin Lee, Yongsang Yoo, Minhwa Jung, Min Song
Dialogue Topic Segmentation (DTS) aims to divide dialogues into coherent segments. DTS plays a crucial role in various NLP downstream tasks, but suffers from chronic problems: data shortage, labeling ambiguity, and incremental complexity of recently proposed solutions. On the other hand, Despite advances in Large Language Models (LLMs) and reasoning strategies, these have rarely been applied to DTS. This paper introduces Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation, which utilizes LLM-based multi-step deductive reasoning to enhance DTS performance and enable case study using intermediate result. Our method employs a structured prompting approach for bidirectional context summarization, utterance intent classification, and deductive topic shift detection. In the intent classification process, we propose the generalizable intent list for domain-agnostic dialogue intent classification. Experiments in various dialogue settings demonstrate that Def-DTS consistently outperforms traditional and state-of-the-art approaches, with each subtask contributing to improved performance, particularly in reducing type 2 error. We also explore the potential for autolabeling, emphasizing the importance of LLM reasoning techniques in DTS.
nan
Article 643
Title@2025-05-27 (2): Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers
Title: Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers | Pause Tokens erhöhen streng die Expressivität der konstant-tiefen Transformer | 严格提高常数面变换器的表达性 2505.21024v1 |
Authors: Charles London, Varun Kanade
Pause tokens, simple filler symbols such as “…”, consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $\mathsf{AC}^0$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $\mathsf{TC}^0$, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.
nan
Article 644
Title@2025-05-27 (2): Can Community Notes Replace Professional Fact-Checkers?
Title: Can Community Notes Replace Professional Fact-Checkers? | Können Community Notes professionelle Fact-Checker ersetzen? | 社区说明能否取代专业实况调查人? 2502.14132v2 |
Authors: Nadav Borenstein, Greta Warren, Desmond Elliott, Isabelle Augenstein
Two commonly employed strategies to combat the rise of misinformation on social media are (i) fact-checking by professional organisations and (ii) community moderation by platform users. Policy changes by Twitter/X and, more recently, Meta, signal a shift away from partnerships with fact-checking organisations and towards an increased reliance on crowdsourced community notes. However, the extent and nature of dependencies between fact-checking and helpful community notes remain unclear. To address these questions, we use language models to annotate a large corpus of Twitter/X community notes with attributes such as topic, cited sources, and whether they refute claims tied to broader misinformation narratives. Our analysis reveals that community notes cite fact-checking sources up to five times more than previously reported. Fact-checking is especially crucial for notes on posts linked to broader narratives, which are twice as likely to reference fact-checking sources compared to other sources. Our results show that successful community moderation relies on professional fact-checking and highlight how citizen and professional fact-checking are deeply intertwined.
nan
Article 645
Title@2025-05-27 (2): LLMs are Frequency Pattern Learners in Natural Language Inference
Title: LLMs are Frequency Pattern Learners in Natural Language Inference | LLMs sind Frequency Pattern Learners in Natural Language Inferenz | LLMs是自然语言推断的频率模式学习者。 2505.21011v1 |
Authors: Liang Cheng, Zhaowei Wang, Mark Steedman
While fine-tuning LLMs on NLI corpora improves their inferential performance, the underlying mechanisms driving this improvement remain largely opaque. In this work, we conduct a series of experiments to investigate what LLMs actually learn during fine-tuning. We begin by analyzing predicate frequencies in premises and hypotheses across NLI datasets and identify a consistent frequency bias, where predicates in hypotheses occur more frequently than those in premises for positive instances. To assess the impact of this bias, we evaluate both standard and NLI fine-tuned LLMs on bias-consistent and bias-adversarial cases. We find that LLMs exploit frequency bias for inference and perform poorly on adversarial instances. Furthermore, fine-tuned LLMs exhibit significantly increased reliance on this bias, suggesting that they are learning these frequency patterns from datasets. Finally, we compute the frequencies of hyponyms and their corresponding hypernyms from WordNet, revealing a correlation between frequency bias and textual entailment. These findings help explain why learning frequency patterns can enhance model performance on inference tasks.
nan
Article 646
Title@2025-05-27 (2): Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods
Title: Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods | Kompromisse zwischen Ausrichtung und Hilfsbereitschaft in Sprachmodellen mit Lenkungsmethoden | 使用指导方法的语文模式的平衡兼顾和利弊取舍 2401.16332v5 |
Authors: Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua
Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model’s behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. First, we find that under the conditions of our framework, alignment can be guaranteed with representation engineering, and at the same time that helpfulness is harmed in the process. Second, we show that helpfulness is harmed quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.
nan
Article 647
Title@2025-05-27 (2): Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?
Title: Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models? | Ungewissheit unverhüllt: Kann die Exposition gegenüber mehr In-Kontext-Beispielen Ungewissheit bei großen Sprachmodellen erhöhen? | 不确定性未消除:接触更多内置实例能减轻大语言模型的不确定性吗? 2505.21003v1 |
Authors: Yifei Wang, Yu Sheng, Linjing Li, Daniel Zeng
Recent advances in handling long sequences have facilitated the exploration of long-context in-context learning (ICL). While much of the existing research emphasizes performance improvements driven by additional in-context examples, the influence on the trustworthiness of generated responses remains underexplored. This paper addresses this gap by investigating how increased examples influence predictive uncertainty, an essential aspect in trustworthiness. We begin by systematically quantifying the uncertainty of ICL with varying shot counts, analyzing the impact of example quantity. Through uncertainty decomposition, we introduce a novel perspective on performance enhancement, with a focus on epistemic uncertainty (EU). Our results reveal that additional examples reduce total uncertainty in both simple and complex tasks by injecting task-specific knowledge, thereby diminishing EU and enhancing performance. For complex tasks, these advantages emerge only after addressing the increased noise and uncertainty associated with longer inputs. Finally, we explore the evolution of internal confidence across layers, unveiling the mechanisms driving the reduction in uncertainty.
nan
Article 648
Title@2025-05-27 (2): RvLLM: LLM Runtime Verification with Domain Knowledge
Title: RvLLM: LLM Runtime Verification with Domain Knowledge | RvLLM: LLM Laufzeitverifizierung mit Domänenwissen | RvLLM: LLM 使用域知识运行时间校验 2505.18585v2 |
Authors: Yedi Zhang, Sun Yi Emma, Annabelle Lee Jia En, Jin Song Dong
Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.
nan
Article 649
Title@2025-05-27 (2): Articulatory strategy in vowel production as a basis for speaker discrimination
Title: Articulatory strategy in vowel production as a basis for speaker discrimination | Artikulatorische Strategie in der Vokalproduktion als Grundlage für die Diskriminierung von Sprechern | 元音制作的交替战略,作为议长歧视的基础 2505.20995v1 |
Authors: Justin J. H. Lo, Patrycja Strycharczuk, Sam Kirkham
The way speakers articulate is well known to be variable across individuals while at the same time subject to anatomical and biomechanical constraints. In this study, we ask whether articulatory strategy in vowel production can be sufficiently speaker-specific to form the basis for speaker discrimination. We conducted Generalised Procrustes Analyses of tongue shape data from 40 English speakers from the North West of England, and assessed the speaker-discriminatory potential of orthogonal tongue shape features within the framework of likelihood ratios. Tongue size emerged as the individual dimension with the strongest discriminatory power, while tongue shape variation in the more anterior part of the tongue generally outperformed tongue shape variation in the posterior part. When considered in combination, shape-only information may offer comparable levels of speaker specificity to size-and-shape information, but only when features do not exhibit speaker-level co-variation.
nan
Article 650
Title@2025-05-27 (2): Who Reasons in the Large Language Models?
Title: Who Reasons in the Large Language Models? | Wer begründet in den großen Sprachmodellen? | 大语言模型中谁的理由? 2505.20993v1 |
Authors: Jie Shao, Jianxin Wu
Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities–such as mathematical reasoning–remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in well-trained LLMs are primarily attributed to the output projection module (oproj) in the Transformer’s multi-head self-attention (MHSA) mechanism. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that oproj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.
nan
Article 651
Title@2025-05-27 (2): LLMs with Industrial Lens: Deciphering the Challenges and Prospects – A Survey
Title: LLMs with Industrial Lens: Deciphering the Challenges and Prospects – A Survey | LLMs mit Industrieobjektiv: Die Herausforderungen und Aussichten bestimmen – Eine Umfrage | 与工业镜头的LLM:挑战与前景的解析 – – 调查 2402.14558v2 |
Authors: Ashok Urlana, Charaka Vinayak Kumar, Ajeet Kumar Singh, Bala Mallikarjunarao Garlapati, Srinivasa Rao Chalamala, Rahul Mishra
Large language models (LLMs) have become the secret ingredient driving numerous industrial applications, showcasing their remarkable versatility across a diverse spectrum of tasks. From natural language processing and sentiment analysis to content generation and personalized recommendations, their unparalleled adaptability has facilitated widespread adoption across industries. This transformative shift driven by LLMs underscores the need to explore the underlying associated challenges and avenues for enhancement in their utilization. In this paper, our objective is to unravel and evaluate the obstacles and opportunities inherent in leveraging LLMs within an industrial context. To this end, we conduct a survey involving a group of industry practitioners, develop four research questions derived from the insights gathered, and examine 68 industry papers to address these questions and derive meaningful conclusions. We maintain the Github repository with the most recent papers in the field.
nan
Article 652
Title@2025-05-27 (2): RefAV: Towards Planning-Centric Scenario Mining
Title: RefAV: Towards Planning-Centric Scenario Mining | RefAV: Auf dem Weg zum planerisch-zentralen Szenario Bergbau | RefAV: 走向规划中心情景采矿 2505.20981v1 |
Authors: Cainan Davidson, Deva Ramanan, Neehar Peri
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
nan
Article 653
Title@2025-05-27 (2): Evaluating and Steering Modality Preferences in Multimodal Large Language Model
Title: Evaluating and Steering Modality Preferences in Multimodal Large Language Model | Bewertung und Steuerung von Modalitätseinstellungen im multimodalen Large Language Model | 评价和指导多式大语言模式模式模式模式模式模式模式的优惠 2505.20977v1 |
Authors: Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang
Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.
nan
Article 654
Title@2025-05-27 (2): Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing
Title: Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing | Kontrastives Lernen auf LLM Back Generation Treebank für Cross-Domain-Konstituenz Parsing | 在LLM 后一代植树库进行反向学习 2505.20976v1 |
Authors: Peiming Guo, Meishan Zhang, Jianling Li, Min Zhang, Yue Zhang
Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.
nan
Article 655
Title@2025-05-27 (2): Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA
Title: Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA | Reason-Align-Respond: LLM-Reasoning mit Wissensgraphen für KGQA ausrichten | 合理对称:KGQA以知识图表对称LLM 2505.20971v1 |
Authors: Xiangqing Shen, Fanfan Wang, Rui Xia
LLMs have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for KGQA. Our approach consists of three key components: a Reasoner that generates human-like reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit@1 scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths. Furthermore, RAR exhibits strong zero-shot generalization capabilities and maintains computational efficiency during inference.
nan
Article 656
Title@2025-05-27 (2): Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation
Title: Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation | Personalisierte Abfrage Auto-Completion für langfristige und kurzfristige Interessen mit adaptiver Entgiftung Generation | 适应性戒毒一代的长期和短期利益个人自问自动完成 2505.20966v1 |
Authors: Zhibo Wang, Xiaoze Jiang, Zhiheng Qin, Enyun Yu, Han Li
Query auto-completion (QAC) plays a crucial role in modern search systems. However, in real-world applications, there are two pressing challenges that still need to be addressed. First, there is a need for hierarchical personalized representations for users. Previous approaches have typically used users’ search behavior as a single, overall representation, which proves inadequate in more nuanced generative scenarios. Additionally, query prefixes are typically short and may contain typos or sensitive information, increasing the likelihood of generating toxic content compared to traditional text generation tasks. Such toxic content can degrade user experience and lead to public relations issues. Therefore, the second critical challenge is detoxifying QAC systems. To address these two limitations, we propose a novel model (LaD) that captures personalized information from both long-term and short-term interests, incorporating adaptive detoxification. In LaD, personalized information is captured hierarchically at both coarse-grained and fine-grained levels. This approach preserves as much personalized information as possible while enabling online generation within time constraints. To move a futher step, we propose an online training method based on Reject Preference Optimization (RPO). By incorporating a special token [Reject] during both the training and inference processes, the model achieves adaptive detoxification. Consequently, the generated text presented to users is both non-toxic and relevant to the given prefix. We conduct comprehensive experiments on industrial-scale datasets and perform online A/B tests, delivering the largest single-experiment metric improvement in nearly two years of our product. Our model has been deployed on Kuaishou search, driving the primary traffic for hundreds of millions of active users. The code is available at https://github.com/JXZe/LaD.
nan
Article 657
Title@2025-05-27 (2): HalluCounter: Reference-free LLM Hallucination Detection in the Wild!
Title: HalluCounter: Reference-free LLM Hallucination Detection in the Wild! | HalluCounter: Reference-free LLM Halluzination Detection in the Wild! | 万圣节:无参考的LLM 幻觉探测在野外! 2503.04615v2 |
Authors: Ashok Urlana, Gopichand Kanumolu, Charaka Vinayak Kumar, Bala Mallikarjunarao Garlapati, Rahul Mishra
Response consistency-based, reference-free hallucination detection (RFHD) methods do not depend on internal model states, such as generation probabilities or gradients, which Grey-box models typically rely on but are inaccessible in closed-source LLMs. However, their inability to capture query-response alignment patterns often results in lower detection accuracy. Additionally, the lack of large-scale benchmark datasets spanning diverse domains remains a challenge, as most existing datasets are limited in size and scope. To this end, we propose HalluCounter, a novel reference-free hallucination detection method that utilizes both response-response and query-response consistency and alignment patterns. This enables the training of a classifier that detects hallucinations and provides a confidence score and an optimal response for user queries. Furthermore, we introduce HalluCounterEval, a benchmark dataset comprising both synthetically generated and human-curated samples across multiple domains. Our method outperforms state-of-the-art approaches by a significant margin, achieving over 90\% average confidence in hallucination detection across datasets.
nan
Article 658
Title@2025-05-27 (2): Context-Aware Content Moderation for German Newspaper Comments
Title: Context-Aware Content Moderation for German Newspaper Comments | Context-Aware Content Moderation für die deutsche Zeitung Kommentare | 德国报纸评论的背景资料内容调控 2505.20963v1 |
Authors: Felix Krejca, Tobias Kietreiber, Alexander Buchelt, Sebastian Neumaier
The increasing volume of online discussions requires advanced automatic content moderation to maintain responsible discourse. While hate speech detection on social media is well-studied, research on German-language newspaper forums remains limited. Existing studies often neglect platform-specific context, such as user history and article themes. This paper addresses this gap by developing and evaluating binary classification models for automatic content moderation in German newspaper forums, incorporating contextual information. Using LSTM, CNN, and ChatGPT-3.5 Turbo, and leveraging the One Million Posts Corpus from the Austrian newspaper Der Standard, we assess the impact of context-aware models. Results show that CNN and LSTM models benefit from contextual information and perform competitively with state-of-the-art approaches. In contrast, ChatGPT’s zero-shot classification does not improve with added context and underperforms.
nan
Article 659
Title@2025-05-27 (2): Research Community Perspectives on “Intelligence” and Large Language Models
Title: Research Community Perspectives on “Intelligence” and Large Language Models | Forschungsgemeinschaftsperspektiven zu “Intelligenz” und großen Sprachmodellen | 关于“情报”和大语言模式的社区研究观点 2505.20959v1 |
Authors: Bertram Højer, Terne Sasha Thorn Jakobsen, Anna Rogers, Stefan Heinrich
Despite the widespread use of ‘‘artificial intelligence’’ (AI) framing in Natural Language Processing (NLP) research, it is not clear what researchers mean by ‘‘intelligence’’. To that end, we present the results of a survey on the notion of ‘‘intelligence’’ among researchers and its role in the research agenda. The survey elicited complete responses from 303 researchers from a variety of fields including NLP, Machine Learning (ML), Cognitive Science, Linguistics, and Neuroscience. We identify 3 criteria of intelligence that the community agrees on the most: generalization, adaptability, & reasoning. Our results suggests that the perception of the current NLP systems as ‘‘intelligent’’ is a minority position (29%). Furthermore, only 16.2% of the respondents see developing intelligent systems as a research goal, and these respondents are more likely to consider the current systems intelligent.
nan
Article 660
Title@2025-05-27 (2): More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives
Title: More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives | Mehr ist nicht immer besser? Viel-Shot-In-Context-Lernen mit differenzierten und neugewichtigen Zielen verbessern | 越多越好,越多越好?用差异化和再加权目标,加强多热化的内流学习 2501.04070v3 |
Authors: Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, Rui Yan
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce \textit{DrICL}, a novel optimization method that enhances model performance through \textit{Differentiated} and \textit{Reweighting} objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the \textit{Many-Shot ICL Benchmark} (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and dataset hoping to facilitate further research in many-shot ICL\footnote{https://github.com/xiaoqzhwhu/DrICL}.
nan
Article 661
Title@2025-05-27 (2): QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
Title: QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization | QwenLong-CPRS: Auf dem Weg zu $\infty$-LLMs mit dynamischer Kontextoptimierung | 20Long-CPRS:争取以动态环境优化实现美元/美元-LLMs 2505.18092v2 |
Authors: Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, Bin Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the “lost in the middle” performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS’s threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.
nan
Article 662
Title@2025-05-27 (2): QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Title: QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning | QwenLong-L1: Auf dem Weg zu einem langen Kontext Große Vernunftmodelle mit Stärkungslernen | QuwenLong-L1:寻求具有强化学习作用的长期大型理由模型 2505.17667v2 |
Authors: Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.
nan
Article 663
Title@2025-05-27 (2): Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training
Title: Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training | Zwei Experten sind alles, was Sie zum Lenken Denken brauchen: Kognitive Bemühungen in MoE-Reasoning-Modellen ohne zusätzliches Training verstärken | 两位专家是指导思考所需要的两个专家:在没有额外培训的情况下加强教育部理由说明模式中的认知努力 2505.14681v2 |
Authors: Mengru Wang, Xingyu Chen, Yue Wang, Zhiwei He, Jiahao Xu, Tian Liang, Qiuzhi Liu, Yunzhi Yao, Wenxuan Wang, Ruotian Ma, Haitao Mi, Ningyu Zhang, Zhaopeng Tu, Xiaolong Li, Dong Yu
Mixture-of-Experts (MoE) architectures within Large Reasoning Models (LRMs) have achieved impressive reasoning capabilities by selectively activating experts to facilitate structured cognitive processes. Despite notable advances, existing reasoning models often suffer from cognitive inefficiencies like overthinking and underthinking. To address these limitations, we introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE), designed to improve reasoning performance without additional training or complex heuristics. Leveraging normalized Pointwise Mutual Information (nPMI), we systematically identify specialized experts, termed ‘‘cognitive experts’’ that orchestrate meta-level reasoning operations characterized by tokens like ‘‘
nan
Article 664
Title@2025-05-27 (2): Conversational Code Generation: a Case Study of Designing a Dialogue System for Generating Driving Scenarios for Testing Autonomous Vehicles
Title: Conversational Code Generation: a Case Study of Designing a Dialogue System for Generating Driving Scenarios for Testing Autonomous Vehicles | Conversational Code Generation: eine Fallstudie zur Konzeption eines Dialogsystems zur Generierung von Fahrszenarien für die Prüfung autonomer Fahrzeuge | 相互交流的代码生成:设计一个对话系统,为测试自用车辆创造驱动情景的对话系统案例研究 2410.09829v2 |
Authors: Rimvydas Rubavicius, Antonio Valerio Miceli-Barone, Alex Lascarides, Subramanian Ramamoorthy
Cyber-physical systems like autonomous vehicles are tested in simulation before deployment, using domain-specific programs for scenario specification. To aid the testing of autonomous vehicles in simulation, we design a natural language interface, using an instruction-following large language model, to assist a non-coding domain expert in synthesising the desired scenarios and vehicle behaviours. We show that using it to convert utterances to the symbolic program is feasible, despite the very small training dataset. Human experiments show that dialogue is critical to successful simulation generation, leading to a 4.5 times higher success rate than a generation without engaging in extended conversation.
nan
Article 665
Title@2025-05-27 (2): On VLMs for Diverse Tasks in Multimodal Meme Classification
Title: On VLMs for Diverse Tasks in Multimodal Meme Classification | Auf VLMs für vielfältige Aufgaben in der multimodalen Meme-Klassifikation | 关于多式气象分类中多种任务VLMs 2505.20937v1 |
Authors: Deepesh Gavit, Debajyoti Mazumder, Samiran Das, Jasabanta Patro
In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.
nan
Article 666
Title@2025-05-27 (2): EPIC: Efficient Position-Independent Caching for Serving Large Language Models
Title: EPIC: Efficient Position-Independent Caching for Serving Large Language Models | EPIC: Effizientes positionsunabhängiges Caching für das Servieren großer Sprachmodelle | EPIC: 高效的、独立定位的为大语言模式服务的工作 2410.15332v3 |
Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie
Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8x improvements in Time-To-First-Token (TTFT) and 7x throughput gains over existing systems, with negligible or no accuracy loss.
nan
Article 667
Title@2025-05-27 (2): Information-Theoretic Complementary Prompts for Improved Continual Text Classification
Title: Information-Theoretic Complementary Prompts for Improved Continual Text Classification | Information-Theoretische Ergänzungsprompte für eine verbesserte fortlaufende Textklassifikation | 改进持续性文本分类信息理论补充提示 2505.20933v1 |
Authors: Duzhen Zhang, Yong Ren, Chenxing Li, Dong Yu, Tielin Zhang
Continual Text Classification (CTC) aims to continuously classify new text data over time while minimizing catastrophic forgetting of previously acquired knowledge. However, existing methods often focus on task-specific knowledge, overlooking the importance of shared, task-agnostic knowledge. Inspired by the complementary learning systems theory, which posits that humans learn continually through the interaction of two systems – the hippocampus, responsible for forming distinct representations of specific experiences, and the neocortex, which extracts more general and transferable representations from past experiences – we introduce Information-Theoretic Complementary Prompts (InfoComp), a novel approach for CTC. InfoComp explicitly learns two distinct prompt spaces: P(rivate)-Prompt and S(hared)-Prompt. These respectively encode task-specific and task-invariant knowledge, enabling models to sequentially learn classification tasks without relying on data replay. To promote more informative prompt learning, InfoComp uses an information-theoretic framework that maximizes mutual information between different parameters (or encoded representations). Within this framework, we design two novel loss functions: (1) to strengthen the accumulation of task-specific knowledge in P-Prompt, effectively mitigating catastrophic forgetting, and (2) to enhance the retention of task-invariant knowledge in S-Prompt, improving forward knowledge transfer. Extensive experiments on diverse CTC benchmarks show that our approach outperforms previous state-of-the-art methods.
nan
Article 668
Title@2025-05-27 (2): Multi-objective Large Language Model Alignment with Hierarchical Experts
Title: Multi-objective Large Language Model Alignment with Hierarchical Experts | Multi-objektive großsprachige Modellausrichtung mit Hierarchischen Experten | 多目标大语言多目标模式,与等级专家相配合 2505.20925v1 |
Authors: Zhuo Li, Guodong Du, Weiyang Guo, Yigeng Zhou, Xiucheng Li, Wenya Wang, Fangming Liu, Yequan Wang, Deheng Ye, Min Zhang, Jing Li
Aligning large language models (LLMs) to simultaneously satisfy multiple objectives remains a significant challenge, especially given the diverse and often conflicting nature of human preferences. Existing alignment methods struggle to balance trade-offs effectively, often requiring costly retraining or yielding suboptimal results across the Pareto frontier of preferences. In this paper, we introduce \textit{HoE}(Hierarchical Mixture-of-Experts), a \textit{lightweight}, \textit{parameter-efficient}, and \textit{plug-and-play} approach that eliminates the need for model training, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences. In particular, \textit{HoE} consists of three hierarchical components: LoRA Experts, Router Experts and Preference Routing, reaching optimal Pareto frontiers and achieving a trade-off between parameter size, training cost, and performance. We evaluate \textit{HoE} across various tasks on 14 objectives and 200 different preferences among 6 benchmarks, demonstrating superior performance over 15 recent baselines. Code is available in the supplementary materials.
nan
Article 669
Title@2025-05-27 (2): “Oh LLM, I’m Asking Thee, Please Give Me a Decision Tree”: Zero-Shot Decision Tree Induction and Embedding with Large Language Models
Title: “Oh LLM, I’m Asking Thee, Please Give Me a Decision Tree”: Zero-Shot Decision Tree Induction and Embedding with Large Language Models | “Oh LLM, ich frage dich, bitte gib mir einen Entscheidungsbaum”: Nullschnelle Entscheidungsbauminduktion und Einbettung mit großen Sprachmodellen | “哦,LLM,我问你,请给我一棵决定树”: “零热决定树上演和嵌入大语言模型” 2409.18594v2 |
Authors: Ricardo Knauer, Mario Koddenbrock, Raphael Wallsberger, Nicholas M. Brisson, Georg N. Duda, Deborah Falla, David W. Evans, Erik Rodner
Large language models (LLMs) provide powerful means to leverage prior knowledge for predictive modeling when data is limited. In this work, we demonstrate how LLMs can use their compressed world knowledge to generate intrinsically interpretable machine learning models, i.e., decision trees, without any training data. We find that these zero-shot decision trees can even surpass data-driven trees on some small-sized tabular datasets and that embeddings derived from these trees perform better than data-driven tree-based embeddings on average. Our decision tree induction and embedding approaches can therefore serve as new knowledge-driven baselines for data-driven machine learning methods in the low-data regime. Furthermore, they offer ways to harness the rich world knowledge within LLMs for tabular machine learning tasks. Our code and results are available at https://github.com/ml-lab-htw/llm-trees.
nan
Article 670
Title@2025-05-27 (2): Automated Privacy Information Annotation in Large Language Model Interactions
Title: Automated Privacy Information Annotation in Large Language Model Interactions | Automatisierte Datenschutzerklärung Annotation in Interaktionen mit großen Sprachmodellen | 大语言模式互动中自动隐私信息说明 2505.20910v1 |
Authors: Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Shaojie Tang, Guihai Chen
Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application scenarios, typically tagging personally identifiable information (PII) in anonymous content. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with cloud-based strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.
nan
Article 671
Title@2025-05-27 (2): Towards Objective Fine-tuning: How LLMs’ Prior Knowledge Causes Potential Poor Calibration?
Title: Towards Objective Fine-tuning: How LLMs’ Prior Knowledge Causes Potential Poor Calibration? | Auf dem Weg zu einer objektiven Feinabstimmung: Wie verursacht LLMs’ vorheriges Wissen eine potenzielle schlechte Kalibrierung? | 目标微调:LLMS的先前知识原因如何造成潜在的不协调? 2505.20903v1 |
Authors: Ziming Wang, Zeyu Shi, Haoyi Zhou, Shiqi Gao, Qingyun Sun, Jianxin Li
Fine-tuned Large Language Models (LLMs) often demonstrate poor calibration, with their confidence scores misaligned with actual performance. While calibration has been extensively studied in models trained from scratch, the impact of LLMs’ prior knowledge on calibration during fine-tuning remains understudied. Our research reveals that LLMs’ prior knowledge causes potential poor calibration due to the ubiquitous presence of known data in real-world fine-tuning, which appears harmful for calibration. Specifically, data aligned with LLMs’ prior knowledge would induce overconfidence, while new knowledge improves calibration. Our findings expose a tension: LLMs’ encyclopedic knowledge, while enabling task versatility, undermines calibration through unavoidable knowledge overlaps. To address this, we propose CogCalib, a cognition-aware framework that applies targeted learning strategies according to the model’s prior knowledge. Experiments across 7 tasks using 3 LLM families prove that CogCalib significantly improves calibration while maintaining performance, achieving an average 57\% reduction in ECE compared to standard fine-tuning in Llama3-8B. These improvements generalize well to out-of-domain tasks, enhancing the objectivity and reliability of domain-specific LLMs, and making them more trustworthy for critical human-AI interaction applications.
nan
Article 672
Title@2025-05-27 (2): A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models
Title: A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models | Eine Stereotyp-Inhaltsanalyse zu farbbezogenen sozialen Bias in großen Visions-Sprachmodellen | 关于大视觉语言模式中与肤色有关的社会偏见的定型内容分析 2505.20901v1 |
Authors: Junhyuk Choi, Minju Kim, Yeseon Hong, Bugeun Kim
As large vision language models(LVLMs) rapidly advance, concerns about their potential to learn and generate social biases and stereotypes are increasing. Previous studies on LVLM’s stereotypes face two primary limitations: metrics that overlooked the importance of content words, and datasets that overlooked the effect of color. To address these limitations, this study introduces new evaluation metrics based on the Stereotype Content Model (SCM). We also propose BASIC, a benchmark for assessing gender, race, and color stereotypes. Using SCM metrics and BASIC, we conduct a study with eight LVLMs to discover stereotypes. As a result, we found three findings. (1) The SCM-based evaluation is effective in capturing stereotypes. (2) LVLMs exhibit color stereotypes in the output along with gender and race ones. (3) Interaction between model architecture and parameter sizes seems to affect stereotypes. We release BASIC publicly on [anonymized for review].
nan
Article 673
Title@2025-05-27 (2): Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
Title: Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing | Dub-S2ST: Textlose Sprach-zu-Sprach-Übersetzung für nahtloses Synchronisieren | Dub-S2ST: 无缝Dubbing无文本语音翻译 2505.20899v1 |
Authors: Jeongsoo Choi, Jaehun Kim, Joon Son Chung
This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech’s duration and speaking pace, while achieving competitive translation performance.
nan
Article 674
Title@2025-05-27 (2): The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions
Title: The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions | Die versteckten Dimensionen der LLM-Ausrichtung: Eine mehrdimensionale Analyse der orthogonalen Sicherheitsanweisungen | LLM 对齐的隐藏面:对正交安全方向的多维分析 2502.09674v4 |
Authors: Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, Xiaohua Jia
Large Language Models’ safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model’s refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model’s refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.
nan
Article 675
Title@2025-05-27 (2): Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use
Title: Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use | Loquacious Set: 25.000 Stunden transkribierte und vielfältige englische Spracherkennungsdaten für Forschung und kommerzielle Nutzung | 便利的一套:25 000小时被分配和多样化的英语语音识别数据,供研究和商业使用 2505.21578v1 |
Authors: Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier van Dalen
Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People’s Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.
nan
Article 676
Title@2025-05-27 (2): Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Title: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation | Kreuz von links nach rechts Gehirn: Adaptiver Texttraum für Vision-und-Sprachen-Navigation | 从左脑到右脑交叉:愿景和语言导航的适应性文本梦想者 2505.20897v1 |
Authors: Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li
Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.
nan
Article 677
Title@2025-05-27 (2): How Do Transformers Learn Variable Binding in Symbolic Programs?
Title: How Do Transformers Learn Variable Binding in Symbolic Programs? | Wie lernen Transformer variable Bindungen in Symbolischen Programmen? | 变换者如何在符号程序中学习变数绑定 ? 2505.20896v1 |
Authors: Yiwei Wu, Atticus Geiger, Raphaël Millière
Variable binding – the ability to associate variables with values – is fundamental to symbolic computation and cognition. Although classical architectures typically implement variable binding via addressable memory, it is not well understood how modern neural networks lacking built-in binding operations may acquire this capacity. We investigate this by training a Transformer to dereference queried variables in symbolic programs where variables are assigned either numerical constants or other variables. Each program requires following chains of variable assignments up to four steps deep to find the queried value, and also contains irrelevant chains of assignments acting as distractors. Our analysis reveals a developmental trajectory with three distinct phases during training: (1) random prediction of numerical constants, (2) a shallow heuristic prioritizing early variable assignments, and (3) the emergence of a systematic mechanism for dereferencing assignment chains. Using causal interventions, we find that the model learns to exploit the residual stream as an addressable memory space, with specialized attention heads routing information across token positions. This mechanism allows the model to dynamically track variable bindings across layers, resulting in accurate dereferencing. Our results show how Transformer models can learn to implement systematic variable binding without explicit architectural support, bridging connectionist and symbolic approaches.
nan
Article 678
Title@2025-05-27 (2): EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models
Title: EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models | EasyDistill: Ein umfassendes Toolkit für effektive Wissensdestillation von großen Sprachmodellen | 简易蒸馏:大语言模式有效知识蒸馏综合工具箱 2505.20888v1 |
Authors: Chengyu Wang, Junbing Yan, Wenrui Cai, Yuanhao Yue, Jun Huang
In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud’s Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community.
nan
Article 679
Title@2025-05-27 (2): ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention
Title: ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention | ComplexEhemaliger: Disruptived Advance Transformer Inferenz-Fähigkeit über Head-Specific Complex Vector Achtung | 复杂形式:通过头部特定复杂矢量的注意,干扰推进变压器推断能力 2505.10222v2 |
Authors: Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, MingKai Zheng
Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.
nan
Article 680
Title@2025-05-27 (2): Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality
Title: Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality | Macht-Rechts-Dekay-Verlust für große Sprachmodell Finetuning: Fokussierung auf Informationssparsität zur Verbesserung der Generationsqualität | 大语言模型调整的功率法减退损失:侧重于信息平等以提高世代质量 2505.16900v3 |
Authors: Jintian Shao, Yiming Cheng, Hongyi Huang, Jiayi Wu, Beiwen Zhang, Zhiyu Wu, You Shan, Mingkai Zheng
During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.
nan
Article 681
Title@2025-05-27 (2): Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective
Title: Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective | Auf dem Weg zur Analyse und dem Verständnis der Grenzen von VAPO: Eine theoretische Perspektive | 分析和理解VAPO的局限性:理论视角 2505.17997v2 |
Authors: Jintian Shao, Yiming Cheng, Hongyi Huang, Beiwen Zhang, Zhiyu Wu, You Shan, Mingkai Zheng
The VAPO framework has demonstrated significant empirical success in enhancing the efficiency and reliability of reinforcement learning for long chain-of-thought (CoT) reasoning tasks with large language models (LLMs). By systematically addressing challenges such as value model bias, heterogeneous sequence lengths, and sparse reward signals, VAPO achieves state-of-the-art performance. While its practical benefits are evident, a deeper theoretical understanding of its underlying mechanisms and potential limitations is crucial for guiding future advancements. This paper aims to initiate such a discussion by exploring VAPO from a theoretical perspective, highlighting areas where its assumptions might be challenged and where further investigation could yield more robust and generalizable reasoning agents. We delve into the intricacies of value function approximation in complex reasoning spaces, the optimality of adaptive advantage estimation, the impact of token-level optimization, and the enduring challenges of exploration and generalization.
nan
Article 682
Title@2025-05-27 (2): Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning
Title: Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning | Vergessen im LLM-Fine-Tuning durch Low-Perplexity Token Learning verhindern | 减轻LLM 微调调整通过低重复调调调学习的忘却现象 2501.14315v3 |
Authors: Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Shao-Hua Sun, Hung-yi Lee
Maintaining consistent model performance across domains is a fundamental challenge in machine learning. While recent work has explored using LLM-generated data for fine-tuning, its impact on cross-domain generalization remains poorly understood. This paper presents a systematic analysis revealing that fine-tuning with LLM-generated data not only improves target task performance but also reduces non-target task degradation compared to fine-tuning with ground truth data. Through analyzing the data sequence in tasks of various domains, we demonstrate that this enhancement of non-target task robustness stems from the reduction of high perplexity tokens found in LLM-generated sequences. Following our findings, we showed that masking high perplexity tokens in ground truth training data achieves similar non-target task performance preservation, comparable to using LLM-generated data. Extensive experiments across different model families and scales, including Gemma 2 IT 2B, Llama 3 8B Instruct, and 3 additional models, agree with our findings. To the best of our knowledge, this is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning, offering valuable insights for developing more robust fine-tuning strategies.
nan
Article 683
Title@2025-05-27 (2): MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection
Title: MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection | MSA bei SemEval-2025 Task 3: Hochwertiges schwaches Etikettieren und LLM-Ensemble-Verifikation für Mehrsprachige Halluzinationserkennung | SemEval-2025 SMAS 任务3:高品质的差错标签和多种语言幻觉探测的LLM组合核查 2505.20880v1 |
Authors: Baraa Hikal, Ahmed Nasreldin, Ali Hamdi
This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages. Our approach combines task-specific prompt engineering with an LLM ensemble verification mechanism, where a primary model extracts hallucination spans and three independent LLMs adjudicate their validity through probability-based voting. This framework simulates the human annotation workflow used in the shared task validation and test data. Additionally, fuzzy matching refines span alignment. Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.
nan
Article 684
Title@2025-05-27 (2): Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Title: Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties | Trans-EnV: Ein Rahmen zur Bewertung der sprachlichen Robustheit von LLMs gegen englische Sorten | Trans-EnV: 反英语多样性LLMs语言能力评价框架 2505.20875v1 |
Authors: Jiyoung Lee, Seungho Kim, Jieun Han, Jun-Min Lee, Kitaek Kim, Alice Oh, Edward Choi
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our \href{https://github.com/jiyounglee-0523/TransEnV}{code} and \href{https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1}{datasets} are publicly available.
nan
Article 685
Title@2025-05-27 (2): Can LLMs Learn to Map the World from Local Descriptions?
Title: Can LLMs Learn to Map the World from Local Descriptions? | Können LLMs lernen, die Welt aus lokalen Beschreibungen zu kartieren? | LLMs能够学习用当地描述绘制世界地图吗? 2505.20874v1 |
Authors: Sirui Xia, Aili Chen, Xintao Wang, Tinghui Zhu, Yikai Zhang, Jiangjie Chen, Yanghua Xiao
Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.
nan
Article 686
Title@2025-05-27 (2): Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG
Title: Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG | Divide-Then-Align: Ehrliche Ausrichtung auf Basis der Wissensgrenze der RAG | 分离后对齐:基于RAG知识界限的诚实一致 2505.20871v1 |
Authors: Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Yuehe Chen, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
Large language models (LLMs) augmented with retrieval systems have significantly advanced natural language processing tasks by integrating external knowledge sources, enabling more accurate and contextually rich responses. To improve the robustness of such systems against noisy retrievals, Retrieval-Augmented Fine-Tuning (RAFT) has emerged as a widely adopted method. However, RAFT conditions models to generate answers even in the absence of reliable knowledge. This behavior undermines their reliability in high-stakes domains, where acknowledging uncertainty is critical. To address this issue, we propose Divide-Then-Align (DTA), a post-training approach designed to endow RAG systems with the ability to respond with “I don’t know” when the query is out of the knowledge boundary of both the retrieved passages and the model’s internal knowledge. DTA divides data samples into four knowledge quadrants and constructs tailored preference data for each quadrant, resulting in a curated dataset for Direct Preference Optimization (DPO). Experimental results on three benchmark datasets demonstrate that DTA effectively balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
nan
Article 687
Title@2025-05-27 (2): AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection
Title: AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection | AmpleHate: Verstärkte Aufmerksamkeit für die Vielseitige Implizite Hate-Erkennung | 全面:扩大对易变性隐含仇恨侦测的注意 2505.19528v2 |
Authors: Yejin Lee, Joonghyuk Hahn, Hyeseon Ahn, Yo-Sub Han
Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness.
nan
Article 688
Title@2025-05-27 (2): Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks
Title: Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks | Strukturierte Denkfragen: Verbesserung der LLM-Verallgemeinerung bei ursächlichen Folgeaufgaben | 结构思考事项:改进因果推断任务中的普遍化 2505.18034v2 |
Authors: Wentao Sun, João Paulo Nogueira, Alonso Silva
Despite remarkable advances in the field, LLMs remain unreliable in distinguishing causation from correlation. Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs – such as GPT-4 (F1 score: 29.08) – only marginally outperform random baselines (Random Uniform, F1 score: 20.38), indicating limited capacity of generalization. To tackle this limitation, we propose a novel structured approach: rather than directly answering causal queries, we provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph, systematically encoding the provided correlational premises, to answer the causal queries. This intermediate representation significantly enhances the model’s causal capabilities. Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods, improving F1 scores from 32.71 to 48.26 (over 47.5% relative increase), along with notable improvements in precision and recall. These results underscore the effectiveness of providing the model with the capability to structure its thinking and highlight its promising potential for broader generalization across diverse causal inference tasks.
nan
Article 689
Title@2025-05-27 (2): SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Title: SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | SAFEPATH: Verhindern schädlicher Vernunft in der Kette der Gedanken durch frühzeitige Ausrichtung | SAFPATH:通过早期协调防止在研究链中产生有害理由 2505.14667v2 |
Authors: Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No
Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.
nan
Article 690
Title@2025-05-27 (2): SEPS: A Separability Measure for Robust Unlearning in LLMs
Title: SEPS: A Separability Measure for Robust Unlearning in LLMs | SEPS: Eine Separabilitätsmessung für robustes Lernen in LLMs | SEPS: LLMM 中强有力解学的分离措施 2505.14832v2 |
Authors: Wonje Jeung, Sangyeon Yoon, Albert No
Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model’s ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.
nan
Article 691
Title@2025-05-27 (2): DUSK: Do Not Unlearn Shared Knowledge
Title: DUSK: Do Not Unlearn Shared Knowledge | DUSK: Gemeinsames Wissen nicht entschärfen | DUSK: 不共享未读共享知识 2505.15209v2 |
Authors: Wonje Jeung, Sangyeon Yoon, Hyesoo Hong, Soeun Kim, Seungju Han, Youngjae Yu, Albert No
Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about the unauthorized use of copyrighted or sensitive data. Machine unlearning aims to remove such ‘forget’ data while preserving utility and information from the ‘retain’ set. However, existing evaluations typically assume that forget and retain sets are fully disjoint, overlooking realistic scenarios where they share overlapping content. For instance, a news article may need to be unlearned, even though the same event, such as an earthquake in Japan, is also described factually on Wikipedia. Effective unlearning should remove the specific phrasing of the news article while preserving publicly supported facts. In this paper, we introduce DUSK, a benchmark designed to evaluate unlearning methods under realistic data overlap. DUSK constructs document sets that describe the same factual content in different styles, with some shared information appearing across all sets and other content remaining unique to each. When one set is designated for unlearning, an ideal method should remove its unique content while preserving shared facts. We define seven evaluation metrics to assess whether unlearning methods can achieve this selective removal. Our evaluation of nine recent unlearning methods reveals a key limitation: while most can remove surface-level text, they often fail to erase deeper, context-specific knowledge without damaging shared content. We release DUSK as a public benchmark to support the development of more precise and reliable unlearning techniques for real-world applications.
nan
Article 692
Title@2025-05-27 (2): An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
Title: An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks | Ein LLM-as-Judge Metric zur Überwindung der Lücke mit menschlicher Bewertung in SE-Aufgaben | 消除社会经济任务中与人的评价差距的法学硕士法官 2505.20854v1 |
Authors: Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, David Lo
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge’s potential as a scalable and reliable alternative to human evaluation.
nan
Article 693
Title@2025-05-27 (2): Concealment of Intent: A Game-Theoretic Analysis
Title: Concealment of Intent: A Game-Theoretic Analysis | Concealment of Intent: Eine Game-Theoretische Analyse | 隐藏意图:游戏理论分析 2505.20841v1 |
Authors: Xinbo Wu, Abhishek Umrawal, Lav R. Varshney
As large language models (LLMs) grow more capable, concerns about their safe deployment have also grown. Although alignment mechanisms have been introduced to deter misuse, they remain vulnerable to carefully designed adversarial prompts. In this work, we present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills. We develop a game-theoretic framework to model the interaction between such attacks and defense systems that apply both prompt and response filtering. Our analysis identifies equilibrium points and reveals structural advantages for the attacker. To counter these threats, we propose and analyze a defense mechanism tailored to intent-hiding attacks. Empirically, we validate the attack’s effectiveness on multiple real-world LLMs across a range of malicious behaviors, demonstrating clear advantages over existing adversarial prompting techniques.
nan
Article 694
Title@2025-05-27 (2): Tuning LLM Judge Design Decisions for 1/1000 of the Cost
Title: Tuning LLM Judge Design Decisions for 1/1000 of the Cost | Tuning LLM Richter Design Entscheidungen für 1/1000 der Kosten | 1 000美元费用1 000美元法官设计决定 2501.17178v4 |
Authors: David Salinas, Omar Swelam, Frank Hutter
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .
nan
Article 695
Title@2025-05-27 (2): The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents
Title: The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents | Die Kraft der Persönlichkeit: Eine menschliche Simulationsperspektive zur Untersuchung von Large Language Model Agents | 个性力量:从人类模拟角度调查大语言示范物剂 2502.20859v2 |
Authors: Yifan Duan, Yihong Tang, Xuefeng Bai, Kehai Chen, Juntao Li, Min Zhang
Large language models (LLMs) excel in both closed tasks (including problem-solving, and code generation) and open tasks (including creative writing), yet existing explanations for their capabilities lack connections to real-world human intelligence. To fill this gap, this paper systematically investigates LLM intelligence through the lens of ``human simulation’’, addressing three core questions: (1) \textit{How do personality traits affect problem-solving in closed tasks?} (2) \textit{How do traits shape creativity in open tasks?} (3) \textit{How does single-agent performance influence multi-agent collaboration?} By assigning Big Five personality traits to LLM agents and evaluating their performance in single- and multi-agent settings, we reveal that specific traits significantly influence reasoning accuracy (closed tasks) and creative output (open tasks). Furthermore, multi-agent systems exhibit collective intelligence distinct from individual capabilities, driven by distinguishing combinations of personalities.
nan
Article 696
Title@2025-05-27 (2): Enhance Mobile Agents Thinking Process Via Iterative Preference Learning
Title: Enhance Mobile Agents Thinking Process Via Iterative Preference Learning | Mobile Agenten durch iteratives Preference-Lernen weiter denken | 加强移动媒介思考流程动态动态迭代性优先学习 2505.12299v2 |
Authors: Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An
The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q\&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
nan
Article 697
Title@2025-05-27 (2): Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning
Title: Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning | Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning | 不要半听半听:在连续教学图示中获取关键部分信息 2403.10056v4 |
Authors: Yongquan He, Wenyuan Zhang, Xuancheng Huang, Peng Zhang, Lingxun Meng, Xiang Zhou, Ke Zeng, Xunliang Cai
Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying data, which may only remember the surface-level pattern of instructions and get confused on held-out tasks. In this paper, we propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG). Our method computes the information gain on masked parts to dynamically replay data and refine the training objective, which enables LLMs to capture task-aware information relevant to the correct response and alleviate overfitting to general descriptions in instructions. In addition, we propose two metrics, P-score and V-score, to measure the generalization and instruction-following abilities of LLMs. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
nan
Article 698
Title@2025-05-27 (2): Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations
Title: Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations | Inklusive Systematische Bewertungen aktivieren: Einschließlich Preprint-Artikel mit großsprachigen modellgetriebenen Bewertungen | 促进包容性的系统审查:将预印条款纳入大语言模式示范评价 2503.13857v3 |
Authors: Rui Yang, Jiayi Tong, Haoyuan Wang, Hui Huang, Ziyang Hu, Peiyu Li, Nan Liu, Christopher J. Lindsell, Michael J. Pencina, Yong Chen, Chuan Hong
Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews. Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time. Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667 with semantic embeddings. Conclusion. Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AutoConfidence enhances predictive performance while reducing manual annotation burden. The framework has the potential to facilitate incorporation of preprint articles during the appraisal phase of systematic reviews, supporting researchers in more effective utilization of preprint resources.
nan
Article 699
Title@2025-05-27 (2): WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Title: WizardCoder: Empowering Code Large Language Models with Evol-Instruct | WizardCoder: Empowering Code Große Sprachmodelle mit Evol-Instruct | 巫师编码器:授权使用电动制造器的守则大语言模型 2306.08568v2 |
Authors: Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic’s Claude and Google’s Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
nan
Article 700
Title@2025-05-27 (2): MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
Title: MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving | MA-LoT: Modell-Kollaboration Lean-based Long Chain-of-Thought Reasoning verbessert formalen Theorem Proving | MA-LOT:示范-协作:基于精液的探讨理由长期链加强正式理论证明 2503.03205v3 |
Authors: Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, Tong Zhang
Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose MA-LoT: Model-CollAboration Lean-based Long Chain-of-Thought, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel LoT-Transfer Learning training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a 61.07% accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.
nan
Article 701
Title@2025-05-27 (2): R-TOFU: Unlearning in Large Reasoning Models
Title: R-TOFU: Unlearning in Large Reasoning Models | R-TOFU: Unlearning in großen Vernunftmodellen | R-TOFU:在大理由模型中重新学习 2505.15214v2 |
Authors: Sangyeon Yoon, Wonje Jeung, Albert No
Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier refusal styles. Finally, we identify a failure mode: decoding variants such as ZeroThink and LessThink can still reveal forgotten content despite seemingly successful unlearning, emphasizing the need to evaluate models under diverse decoding settings. Together, the benchmark, analysis, and new baseline establish a systematic foundation for studying and improving unlearning in LRMs while preserving their reasoning capabilities.
nan
Article 702
Title@2025-05-27 (2): AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset
Title: AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset | AdParaphrase v2.0: Attraktive Ad-Texte mit einem Präferenz-Annotierten Paraphrase-Datensatz generieren | AdParadhanv2.0:利用附加说明的首选参数句数据集生成有吸引力的附加文本 2505.20826v1 |
Authors: Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase-v2.0.
nan
Article 703
Title@2025-05-27 (2): Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation
Title: Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation | Verstärkte Informativitätsoptimierung für die langformige Retrieval-Augmented Generation | 长期回收型后期人种最佳利用强化信息 2505.20825v1 |
Authors: Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at https://github.com/RUCAIBox/RioRAG.
nan
Article 704
Title@2025-05-27 (2): Predicting drug-gene relations via analogy tasks with word embeddings
Title: Predicting drug-gene relations via analogy tasks with word embeddings | Vorhersage von Drogen-Gene-Beziehungen über Analogieaufgaben mit Worteinbettungen | 通过用词嵌入词词类比任务预测毒品与基因的关系 2406.00984v5 |
Authors: Hiroaki Yamagiwa, Ryoma Hashimoto, Kiwamu Arakane, Ken Murakami, Shou Soeda, Momose Oyama, Yihua Zhu, Mariko Okada, Hidetoshi Shimodaira
Natural language processing (NLP) is utilized in a wide range of fields, where words in text are typically transformed into feature vectors called embeddings. BioConceptVec is a specific example of embeddings tailored for biology, trained on approximately 30 million PubMed abstracts using models such as skip-gram. Generally, word embeddings are known to solve analogy tasks through simple vector arithmetic. For example, subtracting the vector for man from that of king and then adding the vector for woman yields a point that lies closer to queen in the embedding space. In this study, we demonstrate that BioConceptVec embeddings, along with our own embeddings trained on PubMed abstracts, contain information about drug-gene relations and can predict target genes from a given drug through analogy computations. We also show that categorizing drugs and genes using biological pathways improves performance. Furthermore, we illustrate that vectors derived from known relations in the past can predict unknown future relations in datasets divided by year. Despite the simplicity of implementing analogy tasks as vector additions, our approach demonstrated performance comparable to that of large language models such as GPT-4 in predicting drug-gene relations.
nan
Article 705
Title@2025-05-27 (2): Tracing and Reversing Rank-One Model Edits
Title: Tracing and Reversing Rank-One Model Edits | Rank-One-Modellbearbeitungen verfolgen und umkehren | 追踪和校正一等一模式编辑 2505.20819v1 |
Authors: Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer
Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We first show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights. Second, we show that these altered weights can reliably be used to predict the edited factual relation, enabling partial reconstruction of the modified fact. Building on this, we propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be reversed, recovering the model’s original outputs with $\geq$ 80% accuracy. Our findings highlight the feasibility of detecting, tracing, and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.
nan
Article 706
Title@2025-05-27 (2): HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices
Title: HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices | HomeBench: Bewertung von LLMs in Smart Homes mit gültigen und ungültigen Anweisungen über einzelne und mehrere Geräte | HomeBench: 评估智能住宅中具有跨越单一和多种装置的无效和无效指令的智能住宅中LLMs 2505.19628v2 |
Authors: Silin Li, Yuhang Guo, Jiashu Yao, Zeming Liu, Haifeng Wang
Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at https://github.com/BITHLP/HomeBench.
nan
Article 707
Title@2025-05-27 (2): Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints
Title: Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints | Semantisches Parsing für große Sprachmodelle neu denken: LLM-Performance mit semantischen Hinweisen verbessern | 重新思考大语言模型的语义分解:用语义提示提高LLM性能 2409.14469v2 |
Authors: Kaikai An, Shuzheng Si, Helan Hu, Haozhe Zhao, Yuchi Wang, Qingyan Guo, Baobao Chang
Semantic Parsing aims to capture the meaning of a sentence and convert it into a logical, structured form. Previous studies show that semantic parsing enhances the performance of smaller models (e.g., BERT) on downstream tasks. However, it remains unclear whether the improvements extend similarly to LLMs. In this paper, our empirical findings reveal that, unlike smaller models, directly adding semantic parsing results into LLMs reduces their performance. To overcome this, we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs’ performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.
nan
Article 708
Title@2025-05-27 (2): TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
Title: TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent | TrojanStego: Ihr Sprachmodell kann geheim ein Steganographic Privacy Leaking Agent sein | TrojanStego:您的语言模式可以秘密地隐秘地隐秘地渗漏剂。 2505.20118v2 |
Authors: Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
nan
Article 709
Title@2025-05-27 (2): Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective
Title: Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective | Rethinking Information Synthese in multimodalen Fragen Antwort auf eine multi-agente Perspektive | 以多机构视角回答多式联运问题 重新思考信息综述 2505.20816v1 |
Authors: Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta
Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
nan
Article 710
Title@2025-05-27 (2): Exploring the Necessity of Reasoning in LLM-based Agent Scenarios
Title: Exploring the Necessity of Reasoning in LLM-based Agent Scenarios | Erforschung der Notwendigkeit der Vernunft in LLM-basierten Agent-Szenarien | 探讨基于LLM代理设想情况中合理理由的必要性 2503.11074v2 |
Authors: Xueyang Zhou, Guiyao Tie, Guowen Zhang, Weidong Wang, Zhigang Zuo, Di Wu, Duanfeng Chu, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun
The rise of Large Reasoning Models (LRMs) signifies a paradigm shift toward advanced computational reasoning. Yet, this progress disrupts traditional agent frameworks, traditionally anchored by execution-oriented Large Language Models (LLMs). To explore this transformation, we propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving, assessed with three top LLMs (e.g., Claude3.5-sonnet) and five leading LRMs (e.g., DeepSeek-R1). Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes; LLMs excel in execution-driven tasks such as Tool Usage, prioritizing efficiency; hybrid LLM-LRM configurations, pairing LLMs as actors with LRMs as reflectors, optimize agent performance by blending execution speed with reasoning depth; and LRMs’ enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies. This study fosters deeper inquiry into LRMs’ balance of deep thinking and overthinking, laying a critical foundation for future agent design advancements.
nan
Article 711
Title@2025-05-27 (2): CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis
Title: CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis | CulFiT: Ein feinkörniges Kulturbewusstsein LLM-Training Paradigma über Mehrsprachige Kritikdatensynthese | CulFIT:通过多种语言的克里端数据综合分析进行精美的有文化意识的LLM培训模型 2505.19484v2 |
Authors: Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, Shuo Shang
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural biases, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality, but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturally-aware training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity. Our approach synthesizes diverse cultural-related questions, constructs critique data in culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units for interpretable evaluation. We also introduce GlobalCultureQA, a multilingual open-ended question-answering dataset designed to evaluate culturally-aware responses in a global context. Extensive experiments on three existing benchmarks and our GlobalCultureQA demonstrate that CulFiT achieves state-of-the-art open-source model performance in cultural alignment and general reasoning.
nan
Article 712
Title@2025-05-27 (2): Improved Representation Steering for Language Models
Title: Improved Representation Steering for Language Models | Verbesserte Repräsentationssteuerung für Sprachmodelle | 改进语文模式代表性指导 2505.20809v1 |
Authors: Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D. Manning, Christopher Potts
Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting – while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.
nan
Article 713
Title@2025-05-27 (2): Sentiment Reasoning for Healthcare
Title: Sentiment Reasoning for Healthcare | Sentiment Reasoning für die Gesundheitsversorgung | 保健的情感理由 2407.21054v4 |
Authors: Khai-Nguyen Nguyen, Khai Le-Duc, Bach Phan Tat, Duy Le, Long Vo-Dang, Truong-Son Hy
Transparency in AI healthcare decision-making is crucial. By incorporating rationales to explain reason for each predicted label, users could understand Large Language Models (LLMs)’s reasoning to make better decision. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, and our proposed multimodal multitask framework and the world’s largest multimodal sentiment analysis dataset. Sentiment Reasoning is an auxiliary task in sentiment analysis where the model predicts both the sentiment label and generates the rationale behind it based on the input transcript. Our study conducted on both human transcripts and Automatic Speech Recognition (ASR) transcripts shows that Sentiment Reasoning helps improve model transparency by providing rationale for model prediction with quality semantically comparable to humans while also improving model’s classification performance (+2% increase in both accuracy and macro-F1) via rationale-augmented fine-tuning. Also, no significant difference in the semantic quality of generated rationales between human and ASR transcripts. All code, data (five languages - Vietnamese, English, Chinese, German, and French) and models are published online: https://github.com/leduckhai/Sentiment-Reasoning
nan
Article 714
Title@2025-05-27 (2): A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models
Title: A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models | Eine Graphenperspektive zur Untersuchung struktureller Wissensmuster in großen Sprachmodellen | 《大语言模式知识结构模式研究图示展望》 2505.19286v2 |
Authors: Utkarsh Sahu, Zhisheng Qi, Yongjia Lei, Ryan A. Rossi, Franck Dernoncourt, Nesreen K. Ahmed, Mahantesh M Halappanavar, Yao Ma, Yu Wang
Large language models have been extensively studied as neural knowledge bases for their knowledge access, editability, reasoning, and explainability. However, few works focus on the structural patterns of their knowledge. Motivated by this gap, we investigate these structural patterns from a graph perspective. We quantify the knowledge of LLMs at both the triplet and entity levels, and analyze how it relates to graph structural properties such as node degree. Furthermore, we uncover the knowledge homophily, where topologically close entities exhibit similar levels of knowledgeability, which further motivates us to develop graph machine learning models to estimate entity knowledge based on its local neighbors. This model further enables valuable knowledge checking by selecting triplets less known to LLMs. Empirical results show that using selected triplets for fine-tuning leads to superior performance.
nan
Article 715
Title@2025-05-27 (2): WizardLM: Empowering large pre-trained language models to follow complex instructions
Title: WizardLM: Empowering large pre-trained language models to follow complex instructions | WizardLM: Ermächtigen von großen vortrainierten Sprachmodellen, komplexe Anweisungen zu befolgen | 巫灵LM:授权大型预先培训的语文模式遵循复杂的指令 2304.12244v3 |
Authors: Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, Daxin Jiang
Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna’s testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM
nan
Article 716
Title@2025-05-27 (2): MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability
Title: MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability | MaskSearch: Ein universelles Pre-Training-Framework, um Agentische Suchfähigkeit zu verbessern | 保护面具搜索:加强制剂搜索能力的普遍培训前框架 2505.20285v2 |
Authors: Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou
Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.
nan
Article 717
Title@2025-05-27 (2): SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Title: SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences | SpecExtend: Ein Drop-in-Enhancement für spekulative Decoding von langen Sequenzen | 外观:对长期序列的投机性代谢的减少增强 2505.20776v1 |
Authors: Jungyoub Cha, Hyunjong Kim, Sungzoon Cho
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models, reducing latency across all stages. To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy that uses the target model’s attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. The code is available at https://github.com/jycha98/SpecExtend .
nan
Article 718
Title@2025-05-27 (2): Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Title: Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs | Achten Sie auf Ihr Po! Messen und Abmildern von KI-Sicherheitsrisiken bei Rollenspielen Feintuning von LLMs | 当心你的阿宝! 衡量并减轻AI公司在角色扮演中的安全风险 微调LLMs 2502.20968v2 |
Authors: Weixiang Zhao, Yulin Hu, Yang Deng, Jiahe Guo, Xingyu Sui, Xinyang Han, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.
nan
Article 719
Title@2025-05-27 (2): ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools
Title: ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools | ChemHAS: Hierarchische Agenzien-Stacking zur Verbesserung von Chemiewerkzeugen | ChemHAS:加强化学工具的等级代理人 2505.21569v1 |
Authors: Zhucong Li, Bowei Zhang, Jin Xiao, Zhijian Zhou, Fenglei Cao, Jiaqing Liang, Yuan Qi
Large Language Model (LLM)-based agents have demonstrated the ability to improve performance in chemistry-related tasks by selecting appropriate tools. However, their effectiveness remains limited by the inherent prediction errors of chemistry tools. In this paper, we take a step further by exploring how LLMbased agents can, in turn, be leveraged to reduce prediction errors of the tools. To this end, we propose ChemHAS (Chemical Hierarchical Agent Stacking), a simple yet effective method that enhances chemistry tools through optimizing agent-stacking structures from limited data. ChemHAS achieves state-of-the-art performance across four fundamental chemistry tasks, demonstrating that our method can effectively compensate for prediction errors of the tools. Furthermore, we identify and characterize four distinct agent-stacking behaviors, potentially improving interpretability and revealing new possibilities for AI agent applications in scientific research. Our code and dataset are publicly available at https: //anonymous.4open.science/r/ChemHAS-01E4/README.md.
nan
Article 720
Title@2025-05-27 (2): Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey
Title: Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey | Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage | 增强法典制定中强化学习的加强守则LLMS 代码生成:调查 2412.20367v3 |
Authors: Junqiao Wang, Zeng Zhang, Yangfan He, Zihao Zhang, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Xin Yi, Zhongwei Wan, Xinhang Yuan, Kuan Lu, Menghao Huo, Guangwu Qian, Keqin Li, Qiuwu Chen, Lewei He
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing large language models (LLMs) in code generation and optimization. This survey systematically reviews RL-driven techniques across the code development lifecycle, from compiler-level optimizations and resource allocation strategies to end-to-end code synthesis frameworks. We first examine classical and modern RL algorithms – spanning policy gradients, actor-critic methods, human-feedback alignment, and preference-based optimization – and their adaptations to the unique challenges of code generation, such as sparse and delayed rewards. Next, we analyze key benchmarks, datasets, and evaluation metrics that drive progress in RL-augmented Code LLMs. Finally, we identify open problems, including the need for richer feedback sources, support for low-level and domain-specific languages, and methods to reduce computational overhead. By consolidating current insights and outlining future directions, this work aims to guide researchers and practitioners in leveraging RL to produce more robust, efficient, and human-aligned code generation systems.
nan
Article 721
Title@2025-05-27 (2): Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains
Title: Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains | Denken Sie leise, denken Sie schnell: Dynamische Latent-Kompression von LLM-vernünftigen Ketten | 默默思考,快速思考:LLM 解释性链条的动态延迟压缩 2505.16552v3 |
Authors: Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Ruihua Song
Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head’s non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.
nan
Article 722
Title@2025-05-27 (2): No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models
Title: No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models | Kein LLM ist frei von Bias: Eine umfassende Studie der Bias-Bewertung in großen Sprachmodellen | No LLM “ 免于偏见:对大语言模式的偏见评价的全面研究 “ 。 2503.11985v2 |
Authors: Charaka Vinayak Kumar, Ashok Urlana, Gopichand Kanumolu, Bala Mallikarjunarao Garlapati, Pruthwik Mishra
Advancements in Large Language Models (LLMs) have increased the performance of different natural language understanding as well as generation tasks. Although LLMs have breached the state-of-the-art performance in various tasks, they often reflect different forms of bias present in the training data. In the light of this perceived limitation, we provide a unified evaluation of benchmarks using a set of representative small and medium-sized LLMs that cover different forms of biases starting from physical characteristics to socio-economic categories. Moreover, we propose five prompting approaches to carry out the bias detection task across different aspects of bias. Further, we formulate three research questions to gain valuable insight in detecting biases in LLMs using different approaches and evaluation metrics across benchmarks. The results indicate that each of the selected LLMs suffer from one or the other form of bias with the Phi-3.5B model being the least biased. Finally, we conclude the paper with the identification of key challenges and possible future directions.
nan
Article 723
Title@2025-05-27 (2): Systematic Generalization in Language Models Scales with Information Entropy
Title: Systematic Generalization in Language Models Scales with Information Entropy | Systematische Generalisierung in Sprachmodellen Skalen mit Informationsentropie | 语言模型中系统化的通用化( 带有信息信封的语言模型缩放) 2505.13089v2 |
Authors: Sondre Wold, Lucas Georges Gabriel Charpentier, Étienne Simon
Systematic generalization remains challenging for current language models, which are known to be both sensitive to semantically similar permutations of the input and to struggle with known concepts presented in novel contexts. Although benchmarks exist for assessing compositional behavior, it is unclear how to measure the difficulty of a systematic generalization problem. In this work, we show how one aspect of systematic generalization can be described by the entropy of the distribution of component parts in the training data. We formalize a framework for measuring entropy in a sequence-to-sequence task and find that the performance of popular model architectures scales with the entropy. Our work connects systematic generalization to information efficiency, and our results indicate that success at high entropy can be achieved even without built-in priors, and that success at low entropy can serve as a target for assessing progress towards robust systematic generalization.
nan
Article 724
Title@2025-05-27 (2): Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?
Title: Can Small Language Models Learn, Unlearn, and Retain Noise Patterns? | Können kleine Sprachmodelle Geräuschmuster lernen, nicht lernen und erhalten? | 小语言模型能够学习、不学习和保留噪音模式吗? 2407.00996v3 |
Authors: Nicy Scaria, Silvester John Joseph Kennedy, Deepak Subramani
With the growing need for efficient language models in resource-constrained environments, Small Language Models (SLMs) have emerged as compact and practical alternatives to Large Language Models (LLMs). While studies have explored noise handling in LLMs, little is known about how SLMs handle noise, a critical factor for their reliable real-world deployment. This study investigates the ability of SLMs with parameters between 1 and 3 billion to learn, retain, and subsequently eliminate different types of noise (word flip, character flip, transliteration, irrelevant content, and contradictory information). Four pretrained SLMs (Olmo 1B, Qwen1.5 1.8B, Gemma1.1 2B, and Phi2 2.7B) were instruction-tuned on noise-free data and tested with in-context examples to assess noise learning. Subsequently, noise patterns were introduced in instruction tuning to assess their adaptability. The results revealed differences in how models handle noise, with smaller models like Olmo quickly adapting to noise patterns. Phi2’s carefully curated, structured, and high-quality pretraining data enabled resistance to character level, transliteration, and counterfactual noise, while Gemma adapted successfully to transliteration noise through its multilingual pretraining. Subsequent clean data training effectively mitigated noise effects. These findings provide practical strategies for developing robust SLMs for real-world applications.
nan
Article 725
Title@2025-05-27 (2): Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator
Title: Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator | Schalldämpfer: Von der Entdeckung zur Eindämmung von Selbst-Bias im LLM-as-Benchmark-Generator | 沉默器:从发现到减少LLM-as-Bunchmark-Generator中的自我比亚 2505.20738v1 |
Authors: Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
nan
Article 726
Title@2025-05-27 (2): BQA: Body Language Question Answering Dataset for Video Large Language Models
Title: BQA: Body Language Question Answering Dataset for Video Large Language Models | BQA: Körper Sprache Frage-Frage-Beantwortung Datensatz für Video Große Sprachmodelle | BQA:视频大语言模型的体语言问题解答数据集 2410.13206v2 |
Authors: Shintaro Ozaki, Kazuki Hayashi, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made significantly biased answers depending on the age group and ethnicity of the individuals in the video. The dataset is available.
nan
Article 727
Title@2025-05-27 (2): SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution
Title: SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution | SPA-RL: Verstärkung der LLM-Agenten durch schrittweise Fortschrittszuweisung | SPA-RL:通过逐步推进加强LLM代理 2505.20732v1 |
Authors: Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, Wenjie Li
Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal-oriented tasks that require multi-step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non-trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per-step contribution with a grounding signal for actions executed in the environment as the fine-grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state-of-the-art method in both success rate (+2.5\% on average) and grounding accuracy (+1.9\% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at https://github.com/WangHanLinHenry/SPA-RL-Agent.
nan
Article 728
Title@2025-05-27 (2): What LLMs Miss in Recommendations: Bridging the Gap with Retrieval-Augmented Collaborative Signals
Title: What LLMs Miss in Recommendations: Bridging the Gap with Retrieval-Augmented Collaborative Signals | Was LLMs in Empfehlungen vermissen: Die Lücke mit retrieval-Augmented Collaborative Signals überbrücken | 在建议中错过了什么的LLM女士:用检索增强的合作信号弥合差距 2505.20730v1 |
Authors: Shahrooz Pouryousef
User-item interactions contain rich collaborative signals that form the backbone of many successful recommender systems. While recent work has explored the use of large language models (LLMs) for recommendation, it remains unclear whether LLMs can effectively reason over this type of collaborative information. In this paper, we conduct a systematic comparison between LLMs and classical matrix factorization (MF) models to assess LLMs’ ability to leverage user-item interaction data. We further introduce a simple retrieval-augmented generation (RAG) method that enhances LLMs by grounding their predictions in structured interaction data. Our experiments reveal that current LLMs often fall short in capturing collaborative patterns inherent to MF models, but that our RAG-based approach substantially improves recommendation quality-highlighting a promising direction for future LLM-based recommenders.
nan
Article 729
Title@2025-05-27 (2): S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
Title: S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models | S1-Bench: Ein einfacher Benchmark für die Bewertung von System 1 Denkfähigkeit von Großmodellen | S1-区:评估系统1思考大理由模型的能力的简单基准 2504.10368v3 |
Authors: Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu
We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM’s system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs’ performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.
nan
Article 730
Title@2025-05-27 (2): Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection
Title: Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection | Effiziente und präzise Optimierung: Der Vorteil des Gedächtnisses in exemplargeführter Reflexion | 高效和准确的迅速优化:外光引导反射中内存的益处 2411.07446v2 |
Authors: Cilin Yan, Jingyun Wang, Lin Zhang, Ruihui Zhao, Xiaopu Wu, Kai Xiong, Qingsong Liu, Guoliang Kang, Yangyang Kang
Automatic prompt engineering aims to enhance the generation quality of large language models (LLMs). Recent works utilize feedbacks generated from erroneous cases to guide the prompt optimization. During inference, they may further retrieve several semantically-related exemplars and concatenate them to the optimized prompts to improve the performance. However, those works only utilize the feedback at the current step, ignoring historical and unseleccted feedbacks which are potentially beneficial. Moreover, the selection of exemplars only considers the general semantic relationship and may not be optimal in terms of task performance and matching with the optimized prompt. In this work, we propose an Exemplar-Guided Reflection with Memory mechanism (ERM) to realize more efficient and accurate prompt optimization. Specifically, we design an exemplar-guided reflection mechanism where the feedback generation is additionally guided by the generated exemplars. We further build two kinds of memory to fully utilize the historical feedback information and support more effective exemplar retrieval. Empirical evaluations show our method surpasses previous state-of-the-arts with less optimization steps, i.e., improving F1 score by 10.1 on LIAR dataset, and reducing half of the optimization steps on ProTeGi.
nan
Article 731
Title@2025-05-27 (2): Autoregressive Speech Synthesis without Vector Quantization
Title: Autoregressive Speech Synthesis without Vector Quantization | Autoregressive Sprachsynthese ohne Vector Quantization | 无矢量量化的自动递减语音合成 2407.08551v2 |
Authors: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei
We present MELLE, a novel continuous-valued token based language modeling approach for text-to-speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which is typically designed for audio compression and sacrifices fidelity compared to continuous representations. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens; (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language model VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling vector-quantized codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. The demos of our work are provided at https://aka.ms/melle.
nan
Article 732
Title@2025-05-27 (2): ProgCo: Program Helps Self-Correction of Large Language Models
Title: ProgCo: Program Helps Self-Correction of Large Language Models | ProgCo: Programm hilft bei der Selbstkorrektur großer Sprachmodelle | ProgC:帮助大语言模式自我校正方案 2501.01264v2 |
Authors: Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng
Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools. We release our code at https://github.com/songxiaoshuai/progco.
nan
Article 733
Title@2025-05-27 (2): LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multimodal Large Language Models
Title: LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multimodal Large Language Models | LatentExplainer: Erklären von latenten Darstellungen in tiefgenerativen Modellen mit multimodalen großen Sprachmodellen | 前任Explainer:在多模式大语言模型的深创模型中解释前述表述 2406.14862v6 |
Authors: Mengdan Zhu, Raasikh Kanjiani, Jiahui Lu, Andrew Choi, Qirui Ye, Liang Zhao
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces \textit{LatentExplainer}, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. \textit{LatentExplainer} tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. Our approach perturbs latent variables, interpreting changes in generated data, and uses multimodal large language models (MLLMs) to produce human-understandable explanations. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations for latent variables. The results highlight the effectiveness of incorporating inductive biases and uncertainty quantification, significantly enhancing model interpretability.
nan
Article 734
Title@2025-05-27 (2): Analyzing Biases in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework
Title: Analyzing Biases in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework | Analyse von Biasen im politischen Dialog: Tagging US-Präsidentschaftsdebatten mit einem erweiterten DAMSL-Rahmen | 分析政治对话中的偏见:美国总统辩论与扩展的DAMSL框架拖累美国总统辩论 2505.19515v2 |
Authors: Lavanya Prahallad, Radhika Mamidi
We present a critical discourse analysis of the 2024 U.S. presidential debates, examining Donald Trump’s rhetorical strategies in his interactions with Joe Biden and Kamala Harris. We introduce a novel annotation framework, BEADS (Bias Enriched Annotation for Dialogue Structure), which systematically extends the DAMSL framework to capture bias driven and adversarial discourse features in political communication. BEADS includes a domain and language agnostic set of tags that model ideological framing, emotional appeals, and confrontational tactics. Our methodology compares detailed human annotation with zero shot ChatGPT assisted tagging on verified transcripts from the Trump and Biden (19,219 words) and Trump and Harris (18,123 words) debates. Our analysis shows that Trump consistently dominated in key categories: Challenge and Adversarial Exchanges, Selective Emphasis, Appeal to Fear, Political Bias, and Perceived Dismissiveness. These findings underscore his use of emotionally charged and adversarial rhetoric to control the narrative and influence audience perception. In this work, we establish BEADS as a scalable and reproducible framework for critical discourse analysis across languages, domains, and political contexts.
nan
Article 735
Title@2025-05-27 (2): MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Title: MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding | MUSEG: Verstärktes zeitliches Verständnis von Video über Timestamp-Aware Multi-Segment Erdung | MUSEG:通过Timestamp-Aware多部分定位加强视频时间理解 2505.20715v1 |
Authors: Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chenliang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.
nan
Article 736
Title@2025-05-27 (2): GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement
Title: GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement | GigaSpeech 2: Ein sich entwickelnder, großformatiger und multidomänischer ASR-Korpus für ressourcenarme Sprachen mit Automatisiertem Crawling, Transkription und Verfeinerung | GigaSpeech2:具有自动拖网、拖网、拖网和精炼功能的低资源语言不断演化、大型和多领域ASR公司 2406.11546v2 |
Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline involves Whisper for initial transcription, MMS for forced alignment, and multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thereby enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus’s high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to Whisper large-v3, with merely 10% model parameters. Furthermore, our ASR models trained on GigaSpeech 2 yield superior performance compared to commercial services. We hope that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area.
nan
Article 737
Title@2025-05-27 (2): Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective
Title: Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective | Physik-Aufklärung in kleinen Sprachmodellen: Eine multidimensionale Analyse aus pädagogischer Perspektive | 《小语言模型中的物理原因解剖:从教育角度的多层次分析》 2505.20707v1 |
Authors: Nicy Scaria, Silvester John Joseph Kennedy, Diksha Seth, Deepak Subramani
Small Language Models (SLMs) offer computational efficiency and accessibility, making them promising for educational applications. However, their capacity for complex reasoning, particularly in domains such as physics, remains underexplored. This study investigates the high school physics reasoning capabilities of state-of-the-art SLMs (under 4 billion parameters), including instruct versions of Llama 3.2, Phi 4 Mini, Gemma 3, and Qwen series. We developed a comprehensive physics dataset from the OpenStax High School Physics textbook, annotated according to Bloom’s Taxonomy, with LaTeX and plaintext mathematical notations. A novel cultural contextualization approach was applied to a subset, creating culturally adapted problems for Asian, African, and South American/Australian contexts while preserving core physics principles. Using an LLM-as-a-judge framework with Google’s Gemini 2.5 Flash, we evaluated answer and reasoning chain correctness, along with calculation accuracy. The results reveal significant differences between the SLMs. Qwen 3 1.7B achieved high answer accuracy' (85%), but
fully correct reasoning’ was substantially low (38%). The format of the mathematical notation had a negligible impact on performance. SLMs exhibited varied performance across the physics topics and showed a decline in reasoning quality with increasing cognitive and knowledge complexity. In particular, the consistency of reasoning was largely maintained in diverse cultural contexts, especially by better performing models. These findings indicate that, while SLMs can often find correct answers, their underlying reasoning is frequently flawed, suggesting an overreliance on pattern recognition. For SLMs to become reliable educational tools in physics, future development must prioritize enhancing genuine understanding and the generation of sound, verifiable reasoning chains over mere answer accuracy.
nan
Article 738
Title@2025-05-27 (2): NeUQI: Near-Optimal Uniform Quantization Parameter Initialization
Title: NeUQI: Near-Optimal Uniform Quantization Parameter Initialization | NeUQI: Beinahe-optimale einheitliche Quantisierung Parameter Initialisierung | NeUQI: 近最佳统一量化参数初始化 2505.17595v2 |
Authors: Li Lin, Xinyu Hu, Xiaojun Wan
Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on $\geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.
nan
Article 739
Title@2025-05-27 (2): Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Title: Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases | Zwischen Circuits und Chomsky: Pre-Pretraining auf Formal Languages Imparts Linguistic Biases | 巡回巡回和乔姆斯基之间:正式语言语言语言预科培训 2502.19249v2 |
Authors: Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, Tal Linzen
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model’s performance on syntactic evaluations.
nan
Article 740
Title@2025-05-27 (2): RaDeR: Reasoning-aware Dense Retrieval Models
Title: RaDeR: Reasoning-aware Dense Retrieval Models | RaDeR: Vernünftige Dense-Retrieval-Modelle | RaDER: 合理觉悟常量检索模型 2505.18405v2 |
Authors: Debrup Das, Sam O’ Nuallain, Razieh Rahimi
We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall performance. Notably, RaDeR achieves significantly higher performance than baselines on the Math and Coding splits. In addition, RaDeR presents the first dense retriever that outperforms BM25 when queries are Chain-of-Thought reasoning steps, underscoring the critical role of reasoning-based retrieval to augment reasoning language models. Furthermore, RaDeR achieves comparable or superior performance while using only 2.5% of the training data used by the concurrent work REASONIR, highlighting the quality of our synthesized training data.
nan
Article 741
Title@2025-05-27 (2): Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Title: Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing | Erhöhung der Messlatte: Ermittlung der Werte von großen Sprachmodellen durch Generative Evolving-Tests | 提高律师资格:通过创造演变测试调查大语言模式的价值 2406.14230v4 |
Authors: Han Jiang, Xiaoyuan Yi, Zhihua Wei, Ziang Xiao, Shu Wang, Xing Xie
Warning: Contains harmful model outputs. Despite significant advancements, the propensity of Large Language Models (LLMs) to generate harmful and unethical content poses critical challenges. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Although numerous benchmarks have been constructed to assess social bias, toxicity, and ethical issues in LLMs, those static benchmarks suffer from evaluation chronoeffect, in which, as models rapidly evolve, existing benchmarks may leak into training data or become saturated, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach based on adaptive testing methods in measurement theory. Unlike traditional adaptive testing methods that rely on a static test item pool, GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability. GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity, thus effectively addressing evaluation chronoeffect. We evaluated various popular LLMs with GETA and demonstrated that 1) GETA can dynamically create difficulty-tailored test items and 2) GETA’s evaluation results are more consistent with models’ performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
nan
Article 742
Title@2025-05-27 (2): vCache: Verified Semantic Prompt Caching
Title: vCache: Verified Semantic Prompt Caching | vCache: Verifizierter semantischer Prompt-Caching | vCache: 校验语义快速缓冲 2502.03771v3 |
Authors: Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, Joseph E. Gonzalez
Semantic caches return cached LLM-generated responses for semantically similar prompts to reduce inference latency and cost. They embed cached prompts and store them alongside their response in a vector database. Embedding similarity metrics assign a numerical score to quantify the similarity between a request and its nearest neighbor prompt from the cache. Existing systems use the same static similarity threshold across all requests to determine whether two prompts can share similar responses. However, we observe that static thresholds do not give formal correctness guarantees, can result in unexpected error rates, and lead to suboptimal cache hit rates. This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees. It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training. Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines. We release the vCache implementation and benchmarks to support future research.
nan
Article 743
Title@2025-05-27 (2): Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration
Title: Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration | Beyond Templates: Dynamische Anpassung von Reasoning-Demonstrationen durch Machbarkeits-Bewusst-Exploration | 超越模板:通过可行性研究软件探索对说明理由的演示进行动态调整 2505.20700v1 |
Authors: Yong Wu, Weihang Pan, Ke Li, Chen Binhui, Ping Li, Binbin Lin
Large language models (LLMs) have shown remarkable reasoning capabilities, yet aligning such abilities to small language models (SLMs) remains a challenge due to distributional mismatches and limited model capacity. Existing reasoning datasets, typically designed for powerful LLMs, often lead to degraded performance when directly applied to weaker models. In this work, we introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework that bridges the capability gap between expert reasoning trajectories and diverse SLMs. Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation via solution simulation. When expert steps surpass the student’s capacity – signaled by an Imitation Gap – the student autonomously explores alternative reasoning paths, constrained by outcome consistency. We validate DART across multiple reasoning benchmarks and model scales, demonstrating that it significantly improves generalization and data efficiency over static fine-tuning. Our method enhances supervision quality by aligning training signals with the student’s reasoning capabilities, offering a scalable solution for reasoning alignment in resource-constrained models.
nan
Article 744
Title@2025-05-27 (2): Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Title: Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models | Token-Level Akzeptieren oder ablehnen: Ein Micro Alignment-Ansatz für große Sprachmodelle | 接受或拒绝时肯级别:大语言模式微调整方法 2505.19743v2 |
Authors: Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are “Accepted” or “Rejected” as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs. The source code and implementation details are publicly available at https://github.com/IAAR-Shanghai/MARA, and the trained models are released at https://huggingface.co/IAAR-Shanghai/MARA_AGENTS.
nan
Article 745
Title@2025-05-27 (2): Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages
Title: Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages | Phir Hera Fairy: Ein englisches Märchen ist ein starker Faker der fließenden Rede in Low-Resource indischen Sprachen | Phir Hera Fairy:英国仙女是印度低资源语言流利流利的有力名人 2505.20693v1 |
Authors: Praveen Srinivasa Varadhan, Srija Anand, Soma Siddhartha, Mitesh M. Khapra
What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with only Indian data proves most effective and the resultant IN-F5 is a near-human polyglot; that enables speakers of one language (e.g., Odia) to fluently speak in another (e.g., Hindi). Our results show English pretraining aids low-resource TTS in reaching human parity. To aid progress in other low-resource languages, we study data-constrained setups and arrive at a compute optimal strategy. Finally, we show IN-F5 can synthesize unseen languages like Bhojpuri and Tulu using a human-in-the-loop approach for zero-resource TTS via synthetic data generation.
nan
Article 746
Title@2025-05-27 (2): Can we Debias Social Stereotypes in AI-Generated Images? Examining Text-to-Image Outputs and User Perceptions
Title: Can we Debias Social Stereotypes in AI-Generated Images? Examining Text-to-Image Outputs and User Perceptions | Können wir Debias Social Stereotype in KI-generierten Bildern? Prüfung von Text-to-Image-Ausgaben und Benutzerwahrnehmungen | 我们能否在AI-光化图像中贬低社会陈规定型观念?审查文本到图像的产出和用户的看法 2505.20692v1 |
Authors: Saharsh Barve, Andy Mao, Jiayue Melissa Shi, Prerna Juneja, Koustuv Saha
Recent advances in generative AI have enabled visual content creation through text-to-image (T2I) generation. However, despite their creative potential, T2I models often replicate and amplify societal stereotypes – particularly those related to gender, race, and culture – raising important ethical concerns. This paper proposes a theory-driven bias detection rubric and a Social Stereotype Index (SSI) to systematically evaluate social biases in T2I outputs. We audited three major T2I model outputs – DALL-E-3, Midjourney-6.1, and Stability AI Core – using 100 queries across three categories – geocultural, occupational, and adjectival. Our analysis reveals that initial outputs are prone to include stereotypical visual cues, including gendered professions, cultural markers, and western beauty norms. To address this, we adopted our rubric to conduct targeted prompt refinement using LLMs, which significantly reduced bias – SSI dropped by 61% for geocultural, 69% for occupational, and 51% for adjectival queries. We complemented our quantitative analysis through a user study examining perceptions, awareness, and preferences around AI-generated biased imagery. Our findings reveal a key tension – although prompt refinement can mitigate stereotypes, it can limit contextual alignment. Interestingly, users often perceived stereotypical images to be more aligned with their expectations. We discuss the need to balance ethical debiasing with contextual relevance and call for T2I systems that support global diversity and inclusivity while not compromising the reflection of real-world social complexity.
nan
Article 747
Title@2025-05-27 (2): A Survey of LLM $\times$ DATA
Title: A Survey of LLM $\times$ DATA | Eine Umfrage über LLM $\times$ DATEN | 对LLLM 美元-美元-美元-美元-数据数据的调查 2505.18458v2 |
Authors: Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu
The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.
nan
Article 748
Title@2025-05-27 (2): SELF-PERCEPT: Introspection Improves Large Language Models’ Detection of Multi-Person Mental Manipulation in Conversations
Title: SELF-PERCEPT: Introspection Improves Large Language Models’ Detection of Multi-Person Mental Manipulation in Conversations | SELF-PERCEPT: Introspection verbessert die Erkennung von Multi-Person-Gedankenmanipulation in Gesprächen durch große Sprachmodelle | SELF-PERCEPT: 调查改进大语言模型在对话中探测多人心理操纵 2505.20679v1 |
Authors: Danush Khanna, Pratinav Seth, Sidhaarth Sredharan Murali, Aditya Kumar Guru, Siddharth Shukla, Tanuj Tyagi, Sandeep Chaurasia, Kripabandhu Ghosh
Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation’s nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at https://github.com/danushkhanna/self-percept .
nan
Article 749
Title@2025-05-27 (2): Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
Title: Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples | Flow of Reasoning: Schulung von LLMs für divergente Reasoning mit minimalen Beispielen | 理由流动:不同理由与最微小例子培训LLM 2406.05673v6 |
Authors: Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, Lianhui Qin
The ability to generate diverse solutions to a given problem is a hallmark of human creativity. This divergent reasoning is also crucial for machines, enhancing their robustness and enabling them to assist humans in many applications such as scientific discovery. However, existing approaches to multi-step reasoning with large language models (LLMs) have mostly focused only on reasoning accuracy, without further discovering more diverse valid solutions. For example, supervised fine-tuning improves reasoning quality but requires vast labeled data, while reward-maximizing reinforcement learning finds top-reward solutions while neglecting the solution diversity. To fill this gap, we propose Flow of Reasoning (FoR), an efficient diversity-seeking LLM finetuning method aimed at improving reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. This formulation allows us to incorporate and adapt principled GFlowNet approaches, for finetuning LLMs to sample divergent paths with probabilities proportional to the (unnormalized) reward of target problems. Extensive experiments show that, with limited training examples (e.g., 15 examples), FoR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including BlocksWorld (embodied reasoning), Game24 (math puzzle solving), Rubik’s Cube (spatial reasoning), 1D-ARC (abstraction reasoning), GSM8k (math reasoning), and ProntoQA (logical reasoning). Code is available at https://github.com/Yu-Fangxu/FoR.
nan
Article 750
Title@2025-05-27 (2): Pretraining Language Models to Ponder in Continuous Space
Title: Pretraining Language Models to Ponder in Continuous Space | Vorschulung von Sprachmodellen im kontinuierlichen Raum | 连续空间Ponder语言模型培训前 2505.20674v1 |
Authors: Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Our method is straightforward and can be seamlessly integrated with various existing language models. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, pondering-enhanced Pythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.
nan
Article 751
Title@2025-05-27 (2): Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System
Title: Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System | Viele Köpfe sind besser als eins: Verbesserte wissenschaftliche Idee-Generation durch ein LLM-basiertes Multi-Agent-System | 许多领导人比一个领导人好得多:由以LLM为基础的多种机构系统改进科学思想的一代 2410.09403v4 |
Authors: Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, Nanqing Dong
The rapid advancement of scientific progress requires innovative tools that can accelerate knowledge discovery. Although recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short of replicating the collaborative nature of real-world scientific practices, where diverse experts work together in teams to tackle complex problems. To address the limitations, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VirSci), designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas. We further investigate the collaboration mechanisms that contribute to its tendency to produce ideas with higher novelty, offering valuable insights to guide future research and illuminating pathways toward building a robust system for autonomous scientific discovery. The code is available at https://github.com/open-sciencelab/Virtual-Scientists.
nan
Article 752
Title@2025-05-27 (2): Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Title: Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders | Enthüllen sprachspezifischer Funktionen in großen Sprachmodellen über Sparse Autoencoder | 通过 Sparse 自动编译器在大语言模型中未解析特定语言特征 2505.05111v2 |
Authors: Boyi Deng, Yu Wan, Yidan Zhang, Baosong Yang, Fuli Feng
The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into a sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs. The code is publicly available at https://github.com/Aatrox103/multilingual-llm-features.
nan
Article 753
Title@2025-05-27 (2): DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
Title: DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models | DRP: Destillierte Reasoning Pruning mit skill-aware Schritt Zersetzung für effiziente große Reasoning Modelle | DRP: 以技能认知方式逐步分解高效大型理由解释模型 2505.13975v2 |
Authors: Yuxuan Jiang, Dawei Li, Frank Ferraro
While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student’s reasoning capacity is critical for effective knowledge transfer and performance gains.
nan
Article 754
Title@2025-05-27 (2): An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment
Title: An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment | Eine eingehende Bewertung großer Sprachmodelle in der Satzvereinfachung mit fehlerbasierter Human Assessment | 深入评价以基于错误的人类评估为根据的简化刑期的大语言模式 2403.04963v3 |
Authors: Xuanxin Wu, Yuki Arase
Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs’ simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models’ performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation’s reliability. To address these problems, this study provides in-depth insights into LLMs’ performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs’ simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art. However, LLMs have their limitations, as seen in GPT-4’s struggles with lexical paraphrasing. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4’s and Qwen2.5-72B’s struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.
nan
Article 755
Title@2025-05-27 (2): Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Title: Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework | Multi-Faceted-Evaluierung lernen: Ein einheitliches und robustes Framework | 学习如何调整多面评价:统一和强有力的框架 2502.18874v3 |
Authors: Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.
nan
Article 756
Title@2025-05-27 (2): Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing
Title: Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing | Subtile Fehler bei der Begründung: Präferenz-Lernen durch Error-injected Self-editing | 理由解释中的字幕错误:通过错误输入自编辑学习偏好 2410.06638v4 |
Authors: Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Chak Tou Leong, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li
Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs’ full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.
nan
Article 757
Title@2025-05-27 (2): Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond
Title: Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond | Auf dem Weg zu LLM Unlearning Resilient to Relearning Attacks: Eine scharfsinnige Minimierungsperspektive und darüber hinaus | 走向LLM 学会学会学会学会重新学习攻击的不学习能力:锐化-尽量减少知识的视角及展望 2502.05374v4 |
Authors: Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, Sijia Liu
The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired data-model influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning’’ the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal’s impact in robustifying LLM unlearning. Codes are available at https://github.com/OPTML-Group/Unlearn-Smooth.
nan
Article 758
Title@2025-05-27 (2): Shadow-FT: Tuning Instruct via Base
Title: Shadow-FT: Tuning Instruct via Base | Shadow-FT: Tuning Instruct via Base | 影子-FT:通过基地的调试指示 2505.12716v2 |
Authors: Taiqiang Wu, Runming Yang, Jiayi Li, Pengfei Hu, Ngai Wong, Yujiu Yang
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the INSTRUCT (i.e., instruction tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired BASE models, the foundation for these INSTRUCT variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). Therefore, we propose a novel Shadow-FT framework to tune the INSTRUCT models by leveraging the corresponding BASE models. The key insight is to fine-tune the BASE model, and then directly graft the learned weight updates to the INSTRUCT model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization (DPO). Codes and weights are available at \href{https://github.com/wutaiqiang/Shadow-FT}{Github}.
nan
Article 759
Title@2025-05-27 (2): ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
Title: ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning | ReMA: Meta-Denken lernen für LLMs mit Multi-Agenten-Verstärkungs-Lernen | ReMA:学习多机构强化学习的LLMLM的元思维 2503.09501v3 |
Authors: Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen
Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking – enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public
nan
Article 760
Title@2025-05-27 (2): Knowledge Boundary of Large Language Models: A Survey
Title: Knowledge Boundary of Large Language Models: A Survey | Wissensgrenze von großen Sprachmodellen: Eine Umfrage | 大语言模式的知识范围:调查 2412.12472v2 |
Authors: Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, Yang Deng
Although large language models (LLMs) store vast amount of knowledge in their parameters, they still have limitations in the memorization and utilization of certain knowledge, leading to undesired behaviors such as generating untruthful and inaccurate responses. This highlights the critical need to understand the knowledge boundary of LLMs, a concept that remains inadequately defined in existing research. In this survey, we propose a comprehensive definition of the LLM knowledge boundary and introduce a formalized taxonomy categorizing knowledge into four distinct types. Using this foundation, we systematically review the field through three key lenses: the motivation for studying LLM knowledge boundaries, methods for identifying these boundaries, and strategies for mitigating the challenges they present. Finally, we discuss open challenges and potential research directions in this area. We aim for this survey to offer the community a comprehensive overview, facilitate access to key issues, and inspire further advancements in LLM knowledge research.
nan
Article 761
Title@2025-05-27 (2): How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines
Title: How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines | Wie können neurale Netzwerke mit Skalierungsgesetzen ausgebaut werden? Eine Umfrage und praktische Leitlinien | 如何提升具有扩展法的神经网络? 2502.12051v3 |
Authors: Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
Neural scaling laws have revolutionized the design and optimization of large-scale AI models by revealing predictable relationships between model size, dataset volume, and computational resources. Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies. However, recent studies highlighted their limitations across architectures, modalities, and deployment contexts. Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns. Moreover, scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches. In this survey, we synthesize insights from over 50 studies, examining the theoretical foundations, empirical findings, and practical implications of scaling laws. We also explore key challenges, including data efficiency, inference scaling, and architecture-specific constraints, advocating for adaptive scaling strategies tailored to real-world applications. We suggest that while scaling laws provide a useful guide, they do not always generalize across all architectures and training strategies.
nan
Article 762
Title@2025-05-27 (2): Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning
Title: Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning | Self-Route: Automatische Mode-Umschaltung über Capability-Schätzung für effizientes Reasoning | 自操作: 通过能力估计法进行自动模式转换,以高效理由推理 2505.20664v1 |
Authors: Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu
While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model’s ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55\% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.
nan
Article 763
Title@2025-05-27 (2): TeroSeek: An AI-Powered Knowledge Base and Retrieval Generation Platform for Terpenoid Research
Title: TeroSeek: An AI-Powered Knowledge Base and Retrieval Generation Platform for Terpenoid Research | TeroSeek: Eine KI-Powered Knowledge Base und Plattform zur Retrieval-Generation für Terpenoidforschung | TeroSeek: AI-Prepenorids研究知识库和检索生成平台 2505.20663v1 |
Authors: Xu Kang, Siqi Jiang, Kangwei Xu, Jiahao Li, Ruibo Wu
Terpenoids are a crucial class of natural products that have been studied for over 150 years, but their interdisciplinary nature (spanning chemistry, pharmacology, and biology) complicates knowledge integration. To address this, the authors developed TeroSeek, a curated knowledge base (KB) built from two decades of terpenoid literature, coupled with an AI-powered question-answering chatbot and web service. Leveraging a retrieval-augmented generation (RAG) framework, TeroSeek provides structured, high-quality information and outperforms general-purpose large language models (LLMs) in terpenoid-related queries. It serves as a domain-specific expert tool for multidisciplinary research and is publicly available at http://teroseek.qmclab.com.
nan
Article 764
Title@2025-05-27 (2): TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
Title: TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization | TailorKV: Hybrides Framework für lange Kontext-Inferenz durch maßgeschneiderte KV-Cache-Optimierung | 定制 KV: 通过定制 KV Cache 优化实现长文本推断的混合框架 2505.19586v2 |
Authors: Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang
The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.
nan
Article 765
Title@2025-05-27 (2): BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism
Title: BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism | BacktrackAgent: Verbesserung des GUI-Agenten mit Fehlererkennung und Backtracking-Mechanismus | 后向跟踪:加强有错误探测和回溯跟踪机制的图形界面代理 2505.20660v1 |
Authors: Qinzhuo Wu, Pengzhi Gao, Wei Liu, Jian Luan
Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent’s performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.
nan
Article 766
Title@2025-05-27 (2): DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
Title: DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs | DynamicKV: Task-Aware Adaptive KV Cache-Kompression für LLMs mit langem Kontext | DiriveKV: 长期LMS 任务- 软件适应 KV 缓存压缩 2412.14838v4 |
Authors: Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding
Efficient KV cache management in LLMs is crucial for long-context tasks like RAG and summarization. Existing KV cache compression methods enforce a fixed pattern, neglecting task-specific characteristics and reducing the retention of essential information. However, we observe distinct activation patterns across layers in various tasks, highlighting the need for adaptive strategies tailored to each task’s unique demands. Based on this insight, we propose DynamicKV, a method that dynamically optimizes token retention by adjusting the number of tokens retained at each layer to adapt to the specific task. DynamicKV establishes global and per-layer maximum KV cache budgets, temporarily retaining the maximum budget for the current layer, and periodically updating the KV cache sizes of all preceding layers during inference. Our method retains only 1.7% of the KV cache size while achieving ~85% of the Full KV cache performance on LongBench. Notably, even under extreme compression (0.9%), DynamicKV surpasses state-of-the-art (SOTA) methods by 11% in the Needle-in-a-Haystack test using Mistral-7B-Instruct-v0.2. The code will be released.
nan
Article 767
Title@2025-05-27 (2): Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge
Title: Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge | Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen | 利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变 2505.20658v1 |
Authors: Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan
Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.
nan
Article 768
Title@2025-05-27 (2): Chinese Cyberbullying Detection: Dataset, Method, and Validation
Title: Chinese Cyberbullying Detection: Dataset, Method, and Validation | Chinesische Cyberbully-Erkennung: Datensatz, Methode und Validierung | 中国网络欺凌探测:数据集、方法和校验 2505.20654v1 |
Authors: Yi Zhu, Xin Zou, Xindong Wu
Existing cyberbullying detection benchmarks were organized by the polarity of speech, such as “offensive” and “non-offensive”, which were essentially hate speech detection. However, in the real world, cyberbullying often attracted widespread social attention through incidents. To address this problem, we propose a novel annotation method to construct a cyberbullying dataset that organized by incidents. The constructed CHNCI is the first Chinese cyberbullying incident detection dataset, which consists of 220,676 comments in 91 incidents. Specifically, we first combine three cyberbullying detection methods based on explanations generation as an ensemble method to generate the pseudo labels, and then let human annotators judge these labels. Then we propose the evaluation criteria for validating whether it constitutes a cyberbullying incident. Experimental results demonstrate that the constructed dataset can be a benchmark for the tasks of cyberbullying detection and incident prediction. To the best of our knowledge, this is the first study for the Chinese cyberbullying incident detection task.
nan
Article 769
Title@2025-05-27 (2): Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning
Title: Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning | Enthüllen der wichtigsten Faktoren für die Destillierung Kette-of-Thought-Reasoning | 理据链的理据的理据 2502.18001v3 |
Authors: Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, Xiaoyu Shen
Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at https://github.com/EIT-NLP/Distilling-CoT-Reasoning.
nan
Article 770
Title@2025-05-27 (2): When More is Less: Understanding Chain-of-Thought Length in LLMs
Title: When More is Less: Understanding Chain-of-Thought Length in LLMs | Wenn mehr weniger ist: Verstehst du die Kettenlänge in LLMs? | 越少越多: 了解LLM 中所寻求的链条长度 2502.07266v3 |
Authors: Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, Yisen Wang
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length’s scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference. These findings offer both a principled understanding of the “overthinking” phenomenon and multiple practical guidelines for CoT calibration, enabling LLMs to achieve optimal reasoning performance with adaptive CoTs tailored to task complexity and model capability.
nan
Article 771
Title@2025-05-27 (2): FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
Title: FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information | Fintagging: Ein LLM-fähiger Benchmark für die Gewinnung und Strukturierung von Finanzinformationen | 金融信息抽取和结构安排:LLM已准备就绪的金融信息提取和结构框架基准 2505.20650v1 |
Authors: Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Xiao-Yang Liu, Jimin Huang, Qianqian Xie
We introduce FinTagging, the first full-scope, table-aware XBRL benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs) in the context of XBRL-based financial reporting. Unlike prior benchmarks that oversimplify XBRL tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the XBRL tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment. It requires models to jointly extract facts and align them with the full 10k+ US-GAAP taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation. We assess a diverse set of LLMs under zero-shot settings, systematically analyzing their performance on both subtasks and overall tagging accuracy. Our results reveal that, while LLMs demonstrate strong generalization in information extraction, they struggle with fine-grained concept alignment, particularly in disambiguating closely related taxonomy entries. These findings highlight the limitations of existing LLMs in fully automating XBRL tagging and underscore the need for improved semantic reasoning and schema-aware modeling to meet the demands of accurate financial disclosure. Code is available at our GitHub repository and data is at our Hugging Face repository.
nan
Article 772
Title@2025-05-27 (2): DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
Title: DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization | DRPruning: Effiziente großsprachige Modellprüfung durch distributiv robuste Optimierung | DRP 运行:通过分布式强力优化实现高效大语言模式 2411.14055v2 |
Authors: Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jing Li, Min Zhang, Zhaopeng Tu
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs. Structured pruning reduces model size and speeds up inference but often causes uneven degradation across domains, leading to biased performance. To address this, we propose DRPruning, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data. Experiments in monolingual and multilingual settings show that DRPruning surpasses similarly sized models in both pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning. Further analysis demonstrates the robustness of DRPruning towards various domains and distribution shifts. Furthermore, DRPruning can determine optimal reference losses and data ratios automatically, suggesting potential for broader applications. Code and scripts are available at https://github.com/hexuandeng/DRPruning.
nan
Article 773
Title@2025-05-27 (2): STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models
Title: STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models | STEER-BENCH: Benchmark für die Bewertung der Steerability von großen Sprachmodellen | STEER-BENCH:评估大语言模型可耐性的基准 2505.20645v1 |
Authors: Kai Chen, Zihao He, Taiwei Shi, Kristina Lerman
Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. Steer-Bench is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.
nan
Article 774
Title@2025-05-27 (2): Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations
Title: Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations | Prompt-basierte LLMs für Position Bias-Aware Reranking in personalisierten Empfehlungen | 个人化建议中位置比亚软件重新排位的即时即时全资 2505.04948v2 |
Authors: Md Aminul Islam, Ahmed Sayeed Faruk
Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking due to token constraints. LLMs can also be sensitive to position bias, as they may overemphasize earlier items in the prompt regardless of their true relevance. To address and investigate these issues, we propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts. We evaluate the effects of user history reordering and instructional prompts for mitigating position bias. Experiments on MovieLens-100K show that randomizing user history improves ranking quality, but LLM-based reranking does not outperform the base model. Explicit instructions to reduce position bias are also ineffective. Our evaluations reveal limitations in LLMs’ ability to model ranking context and mitigate bias. Our code is publicly available at https://github.com/aminul7506/LLMForReRanking.
nan
Article 775
Title@2025-05-27 (2): A-MEM: Agentic Memory for LLM Agents
Title: A-MEM: Agentic Memory for LLM Agents | A-MEM: Agentischer Speicher für LLM-Agenten | A-MEM: LLM 剂的剂内存 2502.12110v8 |
Authors: Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems’ fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/AgenticMemory, while the source code of agentic memory system is available at https://github.com/agiresearch/A-mem.
nan
Article 776
Title@2025-05-27 (2): Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation
Title: Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation | Rethinking MUSHRA: Bewältigung moderner Herausforderungen in der Text-zu-Speech-Bewertung | 重新思考MUSHRA:应对文本到语音评价中的现代挑战 2411.12719v3 |
Authors: Praveen Srinivasa Varadhan, Amogh Gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M. Khapra
Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS’s pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.
nan
Article 777
Title@2025-05-27 (2): GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
Title: GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration | GMoE: Stärkung von LLMs Feinsteuerung über MoE Graph Collaboration | GMOE:通过教育部图表合作,赋予LLMs Fine-Turning女士权力 2412.16216v3 |
Authors: Ting Bai, Yue Yu, Le Huang, Zenan Xu, Zhe Zhao, Chuan Shi
The sparse Mixture-of-Experts (MoE) architecture of large language models (LLMs) confronts an inherent issue of load imbalance arising from the simplistic linear router strategy, which ultimately causes the instability and inefficient learning of LLMs. To address this challenge, we introduce a novel MoE graph-based framework $\textbf{GMoE}$, aimed at enhancing the collaboration among multiple experts. In GMoE, a graph router function is designed to capture the collaboration signals among experts. This enables all experts to dynamically allocate information derived from input data by sharing information with their neighboring experts. Moreover, we put forward two coordination strategies in GMoE: the $\textit{Poisson distribution-based distinction strategy}$ and the $\textit{Normal distribution-based balance strategy}$, to further release the capacity of each expert and increase the model stability in the fine-tuning of LLMs. Specifically, we leverage a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation (LoRA), to implement the graph MoE architecture. Extensive experiments on four real-world benchmark datasets demonstrate the effectiveness of GMoE, showing the benefits of facilitating collaborations of multiple experts in LLM fine-tuning. The code of experimental implementation is available at https://github.com/BAI-LAB/GMoE
nan
Article 778
Title@2025-05-27 (2): STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing
Title: STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing | STEM-POM: Bewertung von Sprachmodellen Mathe-Symbol-Reasoning in Document Parsing | STEM-POM: 评估文档分析中的语言模型数学类比理由 2411.00387v2 |
Authors: Jiaru Zou, Qing Wang, Pratyush Thakur, Nickvash Kani
Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM (Science, Technology, Engineering, and Mathematics) documents. While LLMs can generate equations or solve math-related queries, their ability to fully understand and interpret abstract mathematical symbols in long, math-rich documents remains limited. In this paper, we introduce STEM-PoM, a comprehensive benchmark dataset designed to evaluate LLMs’ reasoning abilities on math symbols within contextual scientific text. The dataset, sourced from real-world ArXiv documents, contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors, with additional sub-attributes including scalar/vector/matrix for variables and local/global/discipline-specific labels for both constants and operators. Our extensive experiments demonstrate that state-of-the-art LLMs achieve an average accuracy of 20-60% under in-context learning and 50-60% with fine-tuning, highlighting a substantial gap in their ability to classify mathematical symbols. By improving LLMs’ mathematical symbol classification, STEM-PoM further enhances models’ downstream mathematical reasoning capabilities. The code and data are available at https://github.com/jiaruzouu/STEM-PoM.
nan
Article 779
Title@2025-05-27 (2): Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
Title: Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models | Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models | 以信息引导为导向,对不利于大语言模式的自治歧视大语种模式采取因果干预 2504.12898v3 |
Authors: Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu
Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (ICD) framework. To eliminate biases within the instruction-tuning dataset, it is essential to ensure that these biases do not provide any additional information to predict the answers, i.e., the information gain of these biases for predicting the answers needs to be 0. Under this guidance, this framework utilizes a causal intervention-based data rewriting method to automatically and autonomously balance the distribution of instruction-tuning dataset for reducing the information gain. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that ICD can effectively debias LLM to improve its generalizability across different tasks.
nan
Article 780
Title@2025-05-27 (2): Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing
Title: Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing | Benchmarking und Pushing der Multi-Bias Elimination Boundary von LLMs über Causal Effect Schätzung-geführte Debiasing | 通过因果关系估测-制导偏向性,确定和推动消除长效LLMs的多比消除边界 2505.16522v2 |
Authors: Zhouhao Sun, Zhiyuan Kan, Xiao Ding, Li Du, Yang Zhao, Bing Qin, Ting Liu
Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.
nan
Article 781
Title@2025-05-27 (2): Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning
Title: Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning | Monocle: Hybride lokale und globale In-Context-Evaluierung für die Langtext-Generierung mit unsicherem aktivem Lernen | 单项:对具有不确定和积极学习能力的长篇和不确定的代代人进行地方-全球混合文 文 评价 2505.20195v2 |
Authors: Xiaorong Wang, Ting Yang, Zhu Zhang, Shuo Wang, Zihan Zhou, Liner Yang, Zhiyuan Liu, Maosong Sun
Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.
nan
Article 782
Title@2025-05-27 (2): Test-Time Learning for Large Language Models
Title: Test-Time Learning for Large Language Models | Test-Time Learning für große Sprachmodelle | 大语言模型试验时间学习 2505.20633v1 |
Authors: Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, Mingkui Tan
While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.
nan
Article 783
Title@2025-05-27 (2): SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis
Title: SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis | SV-TrustEval-C: Bewertung von Struktur und semantischer Vernunft in großen Sprachmodellen für die Analyse von Quellencode-Anfälligkeiten | SV-信任值-C:在源码脆弱性分析大语言模型中评估结构和语义理由 2505.20630v1 |
Authors: Yansong Li, Paula Branco, Alexander M. Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, Stephan Jou
As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs’ abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.
nan
Article 784
Title@2025-05-27 (2): Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration
Title: Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration | Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration | 长期范围:通过多代理问题驱动的协作实现分化和征服 2505.20625v1 |
Authors: Sibo Xiao, Zixin Lin, Wenyang Gao, Yue Zhang
Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA’s feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20\% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.
nan
Article 785
Title@2025-05-27 (2): POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization
Title: POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization | POLAR: Benchmark für multilinguale, multikulturelle und multi-eventuelle Online-Polarisierung | POLAR: 多种语言、多文化和多种晚上在线极化的基准 2505.20624v1 |
Authors: Usman Naseem, Juan Ren, Saba Anwar, Sarah Kohail, Rudy Alexandro Garrido Veliz, Robert Geislinger, Aisha Jabr, Idris Abdulmumin, Laiba Qureshi, Aarushi Ajay Borkar, Maryam Ibrahim Mukhtar, Abinew Ali Ayele, Ibrahim Said Ahmad, Adem Ali, Martin Semmann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multievent dataset with over 23k instances in seven languages from diverse online platforms and real-world events. Polarization is annotated along three axes: presence, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) we fine-tune six multilingual pretrained language models in both monolingual and cross-lingual setups; and (2) we evaluate a range of open and closed large language models (LLMs) in few-shot and zero-shot scenarios. Results show that while most models perform well on binary polarization detection, they achieve substantially lower scores when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.
nan
Article 786
Title@2025-05-27 (2): Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages
Title: Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages | Towards Inclusive ASR: Untersuchung der Sprachumwandlung für Dysarthric Speech Recognition in Low-Resource Sprachen | 努力实现包容性的ASR:低资源语言中承认代谢语言语音转换调查 2505.14874v2 |
Authors: Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
nan
Article 787
Title@2025-05-27 (2): SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation
Title: SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation | SeqPO-SiMT: Sequentielle Politikoptimierung für die gleichzeitige maschinelle Übersetzung | SeqPO-SIMT:同步机器翻译的序列政策优化 2505.20622v1 |
Authors: Ting Xu, Zhichao Huang, Jiankai Sun, Shanbo Cheng, Wai Lam
We present Sequential Policy Optimization for Simultaneous Machine Translation (SeqPO-SiMT), a new policy optimization framework that defines the simultaneous machine translation (SiMT) task as a sequential decision making problem, incorporating a tailored reward to enhance translation quality while reducing latency. In contrast to popular Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO and DPO, which are typically applied in single-step tasks, SeqPO-SiMT effectively tackles the multi-step SiMT task. This intuitive framework allows the SiMT LLMs to simulate and refine the SiMT process using a tailored reward. We conduct experiments on six datasets from diverse domains for En to Zh and Zh to En SiMT tasks, demonstrating that SeqPO-SiMT consistently achieves significantly higher translation quality with lower latency. In particular, SeqPO-SiMT outperforms the supervised fine-tuning (SFT) model by 1.13 points in COMET, while reducing the Average Lagging by 6.17 in the NEWSTEST2021 En to Zh dataset. While SiMT operates with far less context than offline translation, the SiMT results of SeqPO-SiMT on 7B LLM surprisingly rival the offline translation of high-performing LLMs, including Qwen-2.5-7B-Instruct and LLaMA-3-8B-Instruct.
nan
Article 788
Title@2025-05-27 (2): LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers
Title: LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers | LLM-FE: Automatisiertes Feature Engineering für Tabellendaten mit LLMs als Evolutionsoptimierer | LLM-FE: 制表数据的自动地貌工程,LLMM作为进化优化器 2503.14434v2 |
Authors: Nikhil Abhyankar, Parshin Shojaee, Chandan K. Reddy
Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.
nan
Article 789
Title@2025-05-27 (2): Retrospex: Language Agent Meets Offline Reinforcement Learning Critic
Title: Retrospex: Language Agent Meets Offline Reinforcement Learning Critic | Retrospex: Sprachagent trifft Offline-Verstärkung Lernkritik | Retrospex: 语言代理 与离线强化学习中心相会 2505.11807v2 |
Authors: Yufei Xiang, Yiqun Shen, Yeqin Zhang, Cam-Tu Nguyen
Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM’s context. Instead, it combines the LLM’s action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ‘‘retrospection’’ process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.
nan
Article 790
Title@2025-05-27 (2): REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Title: REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning | REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning | 实际检索: 数学理由的回收增量精液预言 2505.20613v1 |
Authors: Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
nan
Article 791
Title@2025-05-27 (2): Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Title: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models | Roboflow100-VL: Ein Multi-Domain-Objekterkennungs-Benchmark für Vision-Language-Modelle | 机器人流100-VL:愿景-语言模型多功能物体探测基准 2505.20612v1 |
Authors: Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/
nan
Article 792
Title@2025-05-27 (2): Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings
Title: Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings | Hierarchische Mamba trifft auf Hyperbolische Geometrie: Ein neues Paradigma für strukturierte Spracheinbettungen | 等级式 Mamba 相遇超双曲几何: 结构化语言嵌入的新范式 2505.18973v2 |
Authors: Sarang Patil, Ashish Parmanand Pandey, Ioannis Koutis, Mengjia Xu
Selective state-space models have achieved great success in long-sequence modeling. However, their capacity for language representation, especially in complex hierarchical reasoning tasks, remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this limitation, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with exponential growth and curved nature of hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincare ball (via tangent-based mapping) or Lorentzian manifold (via cosine and sine-based mapping) with “learnable” curvature, optimized with a combined hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning. This makes it well-suited for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. We evaluated our HiM with four linguistic and medical datasets for mixed-hop prediction and multi-hop inference tasks. Experimental results demonstrated that: 1) Both HiM models effectively capture hierarchical relationships for four ontological datasets, surpassing Euclidean baselines. 2) HiM-Poincare captures fine-grained semantic distinctions with higher h-norms, while HiM-Lorentz provides more stable, compact, and hierarchy-preserving embeddings favoring robustness over detail.
nan
Article 793
Title@2025-05-27 (2): Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients
Title: Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients | Vergleiche zwischen einer großsprachigen, auf Echtzeit-Compound-Diagnostik basierenden medizinischen KI-Schnittstelle und Ärzten für Fälle der gewöhnlichen inneren Medizin mit simulierten Patienten | 使用模拟病人的大型语言模型基于实时复合诊断器实时诊断模型的医学AI 接口和使用模拟病人的普通内科病人医生对普通内科病例的比较 2505.20609v1 |
Authors: Hyungjun Park, Chang-Yun Woo, Seungjo Lim, Seunghwan Lim, Keunho Kwak, Ju Young Jeong, Chong Hyun Suh
Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two internal medicine residents (2nd and 3rd year), and five simulated patients. The clinical vignettes were adapted from the USMLE Step 2 CS style exams. We developed 10 representative internal medicine cases based on actual patients and included information available on initial diagnostic evaluation. Primary outcome was the accuracy of the first differential diagnosis. Repeatability was evaluated based on the proportion of agreement. Results The accuracy of the physicians’ first differential diagnosis ranged from 50% to 70%, whereas the realtime compound diagnostic medical AI interface achieved an accuracy of 80%. The proportion of agreement for the first differential diagnosis was 0.7. The accuracy of the first and second differential diagnoses ranged from 70% to 90% for physicians, whereas the AI interface achieved an accuracy rate of 100%. The average time for the AI interface (557 sec) was 44.6% shorter than that of the physicians (1006 sec). The AI interface ($0.08) also reduced costs by 98.1% compared to the physicians’ average ($4.2). Patient satisfaction scores ranged from 4.2 to 4.3 for care by physicians and were 3.9 for the AI interface Conclusion An LLM based realtime compound diagnostic medical AI interface demonstrated diagnostic accuracy and patient satisfaction comparable to those of a physician, while requiring less time and lower costs. These findings suggest that AI interfaces may have the potential to assist primary care consultations for common internal medicine cases.
nan
Article 794
Title@2025-05-27 (2): NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human
Title: NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human | NAP^2: Ein Benchmark für Natürlichkeit und Datenschutz-Erhaltung Text-Rewriting durch Lernen vom Menschen | 国家行动纲领第2号: “ 从人类学习 “ 的自然和隐私保护文本改写基准 2406.03749v2 |
Authors: Shuo Huang, William MacLean, Xiaoxi Kang, Qiongkai Xu, Zhuang Li, Xingliang Yuan, Gholamreza Haffari, Lizhen Qu
The widespread use of cloud-based Large Language Models (LLMs) has heightened concerns over user privacy, as sensitive information may be inadvertently exposed during interactions with these services. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works on anonymization, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments. Researchers interested in accessing the dataset are encouraged to contact the first or corresponding author via email.
nan
Article 795
Title@2025-05-27 (2): Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation
Title: Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation | Auf dem Weg zur Vorschulung Robustes ASR-Stiftungsmodell mit akustisch-bewusster Datenvergrößerung | ASR基金会样板,配有声-声-声-声数据增强数据增强模型 2505.20606v1 |
Authors: Dancheng Liu, Amir Nassereldine, Chenhui Xu, Jinjun Xiong
Whisper’s robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.
nan
Article 796
Title@2025-05-27 (2): TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Title: TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis | TCSinger 2: Anpassbare Mehrsprachige Null-Shot-Singen-Stimme-Synthese | TCSinger 2:可定制的多语种零弹唱声合成 2505.14910v2 |
Authors: Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao
Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks. Singing voice samples are available at https://aaronz345.github.io/TCSinger2Demo/.
nan
Article 797
Title@2025-05-27 (2): Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations
Title: Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations | Gender and Positional Biases in LLM-based Hiring Entscheidungen: Belege aus vergleichenden CV/Résumé Bewertungen | 以LLM为基础的雇用决定中的性别与职位两重情况:比较 CV/摘要评价中的证据 2505.17049v2 |
Authors: David Rozado
This study examines the behavior of Large Language Models (LLMs) when evaluating professional candidates based on their resumes or curricula vitae (CVs). In an experiment involving 22 leading LLMs, each model was systematically given one job description along with a pair of profession-matched CVs, one bearing a male first name, the other a female first name, and asked to select the more suitable candidate for the job. Each CV pair was presented twice, with names swapped to ensure that any observed preferences in candidate selection stemmed from gendered names cues. Despite identical professional qualifications across genders, all LLMs consistently favored female-named candidates across 70 different professions. Adding an explicit gender field (male/female) to the CVs further increased the preference for female applicants. When gendered names were replaced with gender-neutral identifiers “Candidate A” and “Candidate B”, several models displayed a preference to select “Candidate A”. Counterbalancing gender assignment between these gender-neutral identifiers resulted in gender parity in candidate selection. When asked to rate CVs in isolation rather than compare pairs, LLMs assigned slightly higher average scores to female CVs overall, but the effect size was negligible. Including preferred pronouns (he/him or she/her) next to a candidate’s name slightly increased the odds of the candidate being selected regardless of gender. Finally, most models exhibited a substantial positional bias to select the candidate listed first in the prompt. These findings underscore the need for caution when deploying LLMs in high-stakes autonomous decision-making contexts and raise doubts about whether LLMs consistently apply principled reasoning.
nan
Article 798
Title@2025-05-26 (1): Effectiveness of Prompt Optimization in NL2SQL Systems
Title: Effectiveness of Prompt Optimization in NL2SQL Systems | Wirksamkeit der Prompt-Optimierung in NL2SQL-Systemen | NL2SQL系统迅速优化的效能 2505.20591v1 |
Authors: Sairam Gurajada, Eser Kandogan, Sajjadur Rahman
NL2SQL approaches have greatly benefited from the impressive capabilities of large language models (LLMs). In particular, bootstrapping an NL2SQL system for a specific domain can be as simple as instructing an LLM with sufficient contextual information, such as schema details and translation demonstrations. However, building an accurate system still requires the rigorous task of selecting the right context for each query-including identifying relevant schema elements, cell values, and suitable exemplars that help the LLM understand domain-specific nuances. Retrieval-based methods have become the go-to approach for identifying such context. While effective, these methods introduce additional inference-time costs due to the retrieval process. In this paper, we argue that production scenarios demand high-precision, high-performance NL2SQL systems, rather than simply high-quality SQL generation, which is the focus of most current NL2SQL approaches. In such scenarios, the careful selection of a static set of exemplars-capturing the intricacies of the query log, target database, SQL constructs, and execution latencies-plays a more crucial role than exemplar selection based solely on similarity. The key challenge, however, lies in identifying a representative set of exemplars for a given production setting. To this end, we propose a prompt optimization framework that not only addresses the high-precision requirement but also optimizes the performance of the generated SQL through multi-objective optimization. Preliminary empirical analysis demonstrates the effectiveness of the proposed framework.
nan
Article 799
Title@2025-05-26 (1): Training a Generally Curious Agent
Title: Training a Generally Curious Agent | Ein allgemein neugieriger Agent ausbilden | a 训练一般好奇剂 2502.17543v3 |
Authors: Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Ruslan Salakhutdinov
Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach’s primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
nan
Article 800
Title@2025-05-26 (1): Emotion Classification In-Context in Spanish
Title: Emotion Classification In-Context in Spanish | Emotion Classification In-Context auf Spanisch | 西班牙文《情感分类西班牙文内引文》 2505.20571v1 |
Authors: Bipul Thapa, Gabriel Cofre
Classifying customer feedback into distinct emotion categories is essential for understanding sentiment and improving customer experience. In this paper, we classify customer feedback in Spanish into three emotion categories–positive, neutral, and negative–using advanced NLP and ML techniques. Traditional methods translate feedback from widely spoken languages to less common ones, resulting in a loss of semantic integrity and contextual nuances inherent to the original language. To address this limitation, we propose a hybrid approach that combines TF-IDF with BERT embeddings, effectively transforming Spanish text into rich numerical representations that preserve the semantic depth of the original language by using a Custom Stacking Ensemble (CSE) approach. To evaluate emotion classification, we utilize a range of models, including Logistic Regression, KNN, Bagging classifier with LGBM, and AdaBoost. The CSE model combines these classifiers as base models and uses a one-vs-all Logistic Regression as the meta-model. Our experimental results demonstrate that CSE significantly outperforms the individual and BERT model, achieving a test accuracy of 93.3% on the native Spanish dataset–higher than the accuracy obtained from the translated version. These findings underscore the challenges of emotion classification in Spanish and highlight the advantages of combining vectorization techniques like TF-IDF with BERT for improved accuracy. Our results provide valuable insights for businesses seeking to leverage emotion classification to enhance customer feedback analysis and service improvements.
nan
Article 801
Title@2025-05-26 (1): The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
Title: The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages | Der NaijaVoices-Datensatz: Pflege von großformatigen, qualitativ hochwertigen, kulturell-richschen Sprachdaten für afrikanische Sprachen | NaijaVoices数据集:培养非洲语言的大型、高质量、文化-Rich语音数据 2505.20564v1 |
Authors: Chris Emezue, The NaijaVoices Community, Busayo Awobade, Abraham Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal
The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages – including our focus, Igbo, Hausa, and Yoruba – remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices’ potential to advance multilingual speech processing for African languages.
nan
Article 802
Title@2025-05-26 (1): Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
Title: Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning | Jenseits von Markovian: Reflektierende Exploration über Bayes-Adaptive RL für LLM-Reasoning | 马尔科维安之后:通过Bayes-Adapative RL进行反射勘探,用于LLM 理由分析 2505.20561v1 |
Authors: Shenao Zhang, Yaqing Wang, Yinxiao Liu, Tianqi Liu, Peter Grabowski, Eugene Ie, Zhaoran Wang, Yunxuan Li
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as backtracking and error correction. However, conventional Markovian RL confines exploration to the training phase to learn an optimal deterministic policy and depends on the history contexts only through the current state. Therefore, it remains unclear whether reflective reasoning will emerge during Markovian RL training, or why they are beneficial at test time. To remedy this, we recast reflective exploration within the Bayes-Adaptive RL framework, which explicitly optimizes the expected return under a posterior distribution over Markov decision processes. This Bayesian formulation inherently incentivizes both reward-maximizing exploitation and information-gathering exploration via belief updates. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms standard Markovian RL approaches at test time, achieving superior token efficiency with improved exploration effectiveness. Our code is available at https://github.com/shenao-zhang/BARL.
nan
Article 803
Title@2025-05-26 (1): Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text
Title: Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text | Task-informierte Anti-Kurriculum durch Masken verbessert Downstream-Performance auf Text | 通过遮罩改进文字下流业绩,以任务化的反文体 2502.12953v2 |
Authors: Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
Masked language modeling has become a widely adopted unsupervised technique to pre-train large language models (LLMs). However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at https://github.com/JarcaAndrei/TIACBM.
nan
Article 804
Title@2025-05-26 (1): Predicting Through Generation: Why Generation Is Better for Prediction
Title: Predicting Through Generation: Why Generation Is Better for Prediction | Vorhersagen durch Generation: Warum Generation besser für Vorhersagen ist | 通过一代人预测:为什么一代人更有利于预测 2502.17817v2 |
Authors: Md Kowsher, Nusrat Jahan Prottasha, Prakash Bhat, Chun-Nam Yu, Mojtaba Soltanalian, Ivan Garibay, Ozlem Garibay, Chen Chen, Niloofar Yousefi
This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. Since LLMs are trained on massive text corpora using next-token prediction, generation aligns naturally with their learned behavior. Using the Data Processing Inequality (DPI), we provide both theoretical and empirical evidence supporting this claim. However, autoregressive models face two key challenges when used for prediction: (1) exposure bias, where the model sees ground truth tokens during training but relies on its own predictions during inference, leading to errors, and (2) format mismatch, where discrete tokens do not always align with the tasks required output structure. To address these challenges, we introduce PredGen(Predicting Through Generating), an end to end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Additionally, we introduce Writer-Director Alignment Loss (WDAL), which ensures consistency between token generation and final task predictions, improving both text coherence and numerical accuracy. We evaluate PredGen on multiple classification and regression benchmarks. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.
nan
Article 805
Title@2025-05-26 (1): MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Title: MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly | MMLongBench: Benchmarking von langkontexten Visions-Sprachenmodellen effektiv und gründlich | MMLongBench:有效和彻底地确定长长、长、长、长、远景-语言模式的基准 2505.10610v2 |
Authors: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models’ vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.
nan
Article 806
Title@2025-05-26 (1): From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Title: From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning | Von Tokens zu Gedanken: Wie LLMs und Menschen Kompression für Bedeutung traden | 从Tokens到思想:LLM和人类如何用贸易压缩来达到意义 2505.17117v2 |
Authors: Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.
nan
Article 807
Title@2025-05-26 (1): Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models
Title: Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models | Skalierung über Skalierung: Untersuchung von Test-Zeit-Skalierung Pareto in großen vernünftigen Modellen | 缩放过缩放: 探索大型理由模型中的测试时间缩放派 2505.20522v1 |
Authors: Jian Wang, Boyan Zhu, Chak Tou Leong, Yongqi Li, Wenjie Li
Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling Pareto of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.
nan
Article 808
Title@2025-05-26 (1): Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting
Title: Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting | Projekt Riley: Multimodaler Multi-Agent LLM Zusammenarbeit mit emotionaler Vernunft und Abstimmung | 莱利项目:与情感原因和投票合作 2505.20521v1 |
Authors: Ana Rita Ortigoso, Gabriel Vieira, Daniel Fuentes, Luis Frazão, Nuno Costa, António Pereira
This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar’s Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.
nan
Article 809
Title@2025-05-26 (1): Aggregation Artifacts in Subjective Tasks Collapse Large Language Models’ Posteriors
Title: Aggregation Artifacts in Subjective Tasks Collapse Large Language Models’ Posteriors | Aggregation Artefakte in subjektiven Aufgaben Zusammenklappen der Poster von großen Sprachmodellen | 在主观任务中聚合个体行为 折叠大语言模型的别墅 2410.13776v4 |
Authors: Georgios Chochlakis, Alexandros Potamianos, Kristina Lerman, Shrikanth Narayanan
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs). The knowledge acquired during pre-training is crucial for this few-shot capability, providing the model with task priors. However, recent studies have shown that ICL predominantly relies on retrieving task priors rather than “learning” to perform tasks. This limitation is particularly evident in complex subjective domains such as emotion and morality, where priors significantly influence posterior predictions. In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt. Moreover, we evaluate the posterior bias towards certain annotators by grounding our study in appropriate, quantitative measures of LLM priors. Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead. However, aggregation does not explain the entire gap between ICL and the state of the art, meaning other factors in such tasks also account for the observed phenomena. Finally, by rigorously studying annotator-level labels, we find that it is possible for minority annotators to both better align with LLMs and have their perspectives further amplified.
nan
Article 810
Title@2025-05-26 (1): Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects
Title: Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects | Multimodale Emotionserkennung in Gesprächen: Eine Übersicht über Methoden, Trends, Herausforderungen und Perspektiven | 在对话中多时的情感认识:对方法、趋势、挑战和前景的调查 2505.20511v1 |
Authors: Chengyan Wu, Yiqiang Cai, Yang Liu, Pengxu Zhu, Yun Xue, Ziwei Gong, Julia Hirschberg, Bolei Ma
While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.
nan
Article 811
Title@2025-05-26 (1): ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis
Title: ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis | ArVoice: Ein Multi-Sprecher-Datensatz für die arabische Sprachsynthese | ArVoice:用于阿拉伯语语音合成的多发言者数据集 2505.20506v1 |
Authors: Hawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki
We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.
nan
Article 812
Title@2025-05-26 (1): Large Language Models for IT Automation Tasks: Are We There Yet?
Title: Large Language Models for IT Automation Tasks: Are We There Yet? | Große Sprachmodelle für IT-Automatisierungsaufgaben: Sind wir noch da? | 信息技术自动化任务大语言模型:我们是否还存在? 2505.20505v1 |
Authors: Md Mahadi Hassan, John Salvador, Akond Rahman, Santu Karmaker
LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools, such as Ansible. We present ITAB (IT Automation Task Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers, managing files) where each task accounts for state reconciliation: a property unique to IT automation tools. ITAB evaluates LLMs’ ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source LLMs, none of which accomplish pass@10 at a rate beyond 12%. To explain these low scores, we analyze 1,411 execution failures across the evaluated LLMs and identify two main categories of prevalent semantic errors: failures in state reconciliation related reasoning (44.87% combined from variable (11.43%), host (11.84%), path(11.63%), and template (9.97%) issues) and deficiencies in module-specific execution knowledge (24.37% combined from Attribute and parameter (14.44%) and module (9.93%) errors). Our findings reveal key limitations in open-source LLMs’ ability to track state changes and apply specialized module knowledge, indicating that reliable IT automation will require major advances in state reasoning and domain-specific execution understanding.
nan
Article 813
Title@2025-05-26 (1): Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Title: Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review | Verkörperte KI mit Basismodellen für mobile Serviceroboter: Ein Systematischer Test | 与 “ 移动服务机器人:系统审查 “ 基金会模型 2505.20503v1 |
Authors: Matthew Lisondra, Beno Benhabib, Goldie Nejat
Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action Models have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interactions, robots can improve understanding, adapt to, and execute complex tasks in dynamic real-world environments. However, embodied AI in mobile service robots continues to face key challenges, including multimodal sensor fusion, real-time decision-making under uncertainty, task generalization, and effective human-robot interactions (HRI). In this paper, we present the first systematic review of the integration of foundation models in mobile service robotics, identifying key open challenges in embodied AI and examining how foundation models can address them. Namely, we explore the role of such models in enabling real-time sensor fusion, language-conditioned control, and adaptive task execution. Furthermore, we discuss real-world applications in the domestic assistance, healthcare, and service automation sectors, demonstrating the transformative impact of foundation models on service robotics. We also include potential future research directions, emphasizing the need for predictive scaling laws, autonomous long-term adaptation, and cross-embodiment generalization to enable scalable, efficient, and robust deployment of foundation models in human-centric robotic systems.
nan
Article 814
Title@2025-05-26 (1): Gatsby Without the ‘E’: Crafting Lipograms with LLMs
Title: Gatsby Without the ‘E’: Crafting Lipograms with LLMs | Gatsby ohne das ‘E’: Lipogramme mit LLMs herstellen | Gatsby没有“E”:用LLMs制作乳胶 2505.20501v1 |
Authors: Rohan Balasubramanian, Nitish Gokulakrishnan, Syeda Jannatus Saba, Steven Skiena
Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby, which daringly avoids all usage of the letter ‘e’. In this study, we explore the power of modern large language models (LLMs) by transforming the novel F. Scott Fitzgerald’s The Great Gatsby into a fully ‘e’-less text. We experimented with a range of techniques, from baseline methods like synonym replacement to sophisticated generative models enhanced with beam search and named entity analysis. We show that excluding up to 3.6% of the most common letters (up to the letter ‘u’) had minimal impact on the text’s meaning, although translation fidelity rapidly and predictably decays with stronger lipogram constraints. Our work highlights the surprising flexibility of English under strict constraints, revealing just how adaptable and creative language can be.
nan
Article 815
Title@2025-05-26 (1): Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism
Title: Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism | Beyond Keywords: Bewertung großsprachiger Modellklassifikation von Nuanced Ableism | 超越关键词:评价大语言多语言可变性模式分类 2505.20500v1 |
Authors: Naba Rizvi, Harper Strickland, Saleha Ahmedi, Aekta Kallepalli, Isha Khirwadkar, William Wu, Imani N. S. Munyaka, Nedjma Ousidhoum
Large language models (LLMs) are increasingly used in decision-making tasks like r'esum'e screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.
nan
Article 816
Title@2025-05-26 (1): Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification
Title: Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification | Erklären Sie: Evidenz-getriebene Vorhersagen für erklärbare Drogenziel-Identifikation | 寻求解释:对可解释药物目标识别的由证据驱动的预测 2402.04068v4 |
Authors: Ravi Patel, Angus Brayne, Rogier Hintzen, Daniel Jaroslawicz, Georgiana Neculae, Dane Corneil
Language models hold incredible promise for enabling scientific discovery by synthesizing massive research corpora. Many complex scientific research questions have multiple plausible answers, each supported by evidence of varying strength. However, existing language models lack the capability to quantitatively and faithfully compare answer plausibility in terms of supporting evidence. To address this, we introduce Retrieve to Explain (R2E), a retrieval-based model that scores and ranks all possible answers to a research question based on evidence retrieved from a document corpus. The architecture represents each answer only in terms of its supporting evidence, with the answer itself masked. This allows us to extend feature attribution methods such as Shapley values, to transparently attribute answer scores to supporting evidence at inference time. The architecture also allows incorporation of new evidence without retraining, including non-textual data modalities templated into natural language. We developed R2E for the challenging scientific discovery task of drug target identification, a human-in-the-loop process where failures are extremely costly and explainability paramount. When predicting whether drug targets will subsequently be confirmed as efficacious in clinical trials, R2E not only matches non-explainable literature-based models but also surpasses a genetics-based target identification approach used throughout the pharmaceutical industry.
nan
Article 817
Title@2025-05-26 (1): CLEVRER-Humans: Describing Physical and Causal Events the Human Way
Title: CLEVRER-Humans: Describing Physical and Causal Events the Human Way | CLEVRER-Mensch: Physikalische und kausale Ereignisse auf menschliche Weise beschreiben | CLEVRER-人类:将自然和因果事件描述为人类道路 2310.03635v2 |
Authors: Jiayuan Mao, Xuelin Yang, Xikun Zhang, Noah D. Goodman, Jiajun Wu
Building machines that can reason about physical events and their causal relationships is crucial for flexible interaction with the physical world. However, most existing physical and causal reasoning benchmarks are exclusively based on synthetically generated events and synthetic natural language descriptions of causal relationships. This design brings up two issues. First, there is a lack of diversity in both event types and natural language descriptions; second, causal relationships based on manually-defined heuristics are different from human judgments. To address both shortcomings, we present the CLEVRER-Humans benchmark, a video reasoning dataset for causal judgment of physical events with human labels. We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models. We convert the collected CEGs into questions and answers to be consistent with prior work. Finally, we study a collection of baseline approaches for CLEVRER-Humans question-answering, highlighting the great challenges set forth by our benchmark.
nan
Article 818
Title@2025-05-26 (1): Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages
Title: Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages | Inceptive Transformers: Erweitern von kontextuellen Darstellungen durch Multi-Scale-Feature-Lernen über Domains und Sprachen hinweg | 感动变异器:通过跨领域和跨语言的多阶段专题学习,加强背景代表方式 2505.20496v1 |
Authors: Asif Shahriar, Rifat Shahriyar, M Saifur Rahman
Conventional transformer models typically compress the information from all tokens in a sequence into a single \texttt{[CLS]} token to represent global context– an approach that can lead to information loss in tasks requiring localized or hierarchical cues. In this work, we introduce \textit{Inceptive Transformer}, a modular and lightweight architecture that enriches transformer-based token representations by integrating a multi-scale feature extraction module inspired by inception networks. Our model is designed to balance local and global dependencies by dynamically weighting tokens based on their relevance to a particular task. Evaluation across a diverse range of tasks including emotion recognition (both English and Bangla), irony detection, disease identification, and anti-COVID vaccine tweets classification shows that our models consistently outperform the baselines by 1\% to 14\% while maintaining efficiency. These findings highlight the versatility and cross-lingual applicability of our method for enriching transformer-based representations across diverse domains.
nan
Article 819
Title@2025-05-26 (1): InFact: Informativeness Alignment for Improved LLM Factuality
Title: InFact: Informativeness Alignment for Improved LLM Factuality | InFact: Informatives Alignment für verbesserte LLM-Faktizität | 事实:改进LLM事实质量的信息协调 2505.20487v1 |
Authors: Roi Cohen, Russa Biswas, Gerard de Melo
Factual completeness is a general term that captures how detailed and informative a factually correct text is. For instance, the factual sentence Barack Obama was born in the United States'' is factually correct, though less informative than the factual sentence
Barack Obama was born in Honolulu, Hawaii, United States’’. Despite the known fact that LLMs tend to hallucinate and generate factually incorrect text, they might also tend to choose to generate factual text that is indeed factually correct and yet less informative than other, more informative choices. In this work, we tackle this problem by proposing an informativeness alignment mechanism. This mechanism takes advantage of recent factual benchmarks to propose an informativeness alignment objective. This objective prioritizes answers that are both correct and informative. A key finding of our work is that when training a model to maximize this objective or optimize its preference, we can improve not just informativeness but also factuality.
nan
Article 820
Title@2025-05-26 (1): The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph
Title: The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph | Das Beste aus beiden Welten: Qualität und Vielfalt bei der Datenauswahl mit zweiteiligem Graphen überbrücken | 《最佳世界和最佳世界:在数据选择中将质量和多样性与双部分图联系起来》 2410.12458v2 |
Authors: Minghao Wu, Thuy-Trang Vu, Lizhen Qu, Gholamreza Haffari
The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised fine-tuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six widely-used benchmarks, demonstrating that it outperforms nine existing baselines in both model performance and computational efficiency. Further analysis shows that our design choices lead to more effective subset selection, underscores the value of instruction diversity, and provides insights into how quality and diversity interact with different subset sizes.
nan
Article 821
Title@2025-05-26 (1): Conversation Kernels: A Flexible Mechanism to Learn Relevant Context for Online Conversation Understanding
Title: Conversation Kernels: A Flexible Mechanism to Learn Relevant Context for Online Conversation Understanding | Gesprächs-Kernel: Ein flexibler Mechanismus, um relevante Kontexte für Online-Konversations-Verständnis zu lernen | 对话核心:学习在线对话理解相关背景的灵活机制 2505.20482v1 |
Authors: Vibhor Agarwal, Arjoo Gupta, Suparna De, Nishanth Sastry
Understanding online conversations has attracted research attention with the growth of social networks and online discussion forums. Content analysis of posts and replies in online conversations is difficult because each individual utterance is usually short and may implicitly refer to other posts within the same conversation. Thus, understanding individual posts requires capturing the conversational context and dependencies between different parts of a conversation tree and then encoding the context dependencies between posts and comments/replies into the language model. To this end, we propose a general-purpose mechanism to discover appropriate conversational context for various aspects about an online post in a conversation, such as whether it is informative, insightful, interesting or funny. Specifically, we design two families of Conversation Kernels, which explore different parts of the neighborhood of a post in the tree representing the conversation and through this, build relevant conversational context that is appropriate for each task being considered. We apply our developed method to conversations crawled from slashdot.org, which allows users to apply highly different labels to posts, such as ‘insightful’, ‘funny’, etc., and therefore provides an ideal experimental platform to study whether a framework such as Conversation Kernels is general-purpose and flexible enough to be adapted to disparately different conversation understanding tasks.
nan
Article 822
Title@2025-05-26 (1): BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics
Title: BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics | BrainStratify: Grob-zu-Fein-Entwechslung von intrakranieller Neuraldynamik | 大脑分解: 神经内神经动力学的粗向法解析 2505.20480v1 |
Authors: Hui Zheng, Hai-Teng Wang, Yi-Tao Jing, Pei-Yang Lin, Han-Qing Zhao, Wei Chen, Peng-Hu Wei, Yong-Zhi Shan, Guo-Guang Zhao, Yun-Zhe Liu
Decoding speech directly from neural activity is a central goal in brain-computer interface (BCI) research. In recent years, exciting advances have been made through the growing use of intracranial field potential recordings, such as stereo-ElectroEncephaloGraphy (sEEG) and ElectroCorticoGraphy (ECoG). These neural signals capture rich population-level activity but present key challenges: (i) task-relevant neural signals are sparsely distributed across sEEG electrodes, and (ii) they are often entangled with task-irrelevant neural signals in both sEEG and ECoG. To address these challenges, we introduce a unified Coarse-to-Fine neural disentanglement framework, BrainStratify, which includes (i) identifying functional groups through spatial-context-guided temporal-spatial modeling, and (ii) disentangling distinct neural dynamics within the target functional group using Decoupled Product Quantization (DPQ). We evaluate BrainStratify on two open-source sEEG datasets and one (epidural) ECoG dataset, spanning tasks like vocal production and speech perception. Extensive experiments show that BrainStratify, as a unified framework for decoding speech from intracranial neural signals, significantly outperforms previous decoding methods. Overall, by combining data-driven stratification with neuroscience-inspired modularity, BrainStratify offers a robust and interpretable solution for speech decoding from intracranial recordings.
nan
Article 823
Title@2025-05-26 (1): Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Title: Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought | Bias-Augmented Consistency Training reduziert biased Reasoning in Chain-of-Thought | 避免和强化的一致培训减少在寻求的连锁努力中造成不利和 不利理由 2403.05518v2 |
Authors: James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin
Chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning. But CoT can also systematically misrepresent the factors influencing models’ behavior – for example, rationalizing answers in line with a user’s opinion. We first create a new dataset of 9 different biases that affect GPT-3.5-Turbo and Llama-8b models. These consist of spurious-few-shot patterns, post hoc rationalization, and sycophantic settings. Models switch to the answer implied by the bias, without mentioning the effect of the bias in the CoT. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86\% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37\%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where ground truth reasoning is unavailable.
nan
Article 824
Title@2025-05-26 (1): Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalization of Misinformation Detection Models
Title: Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalization of Misinformation Detection Models | Gestern Nachrichten: Benchmarking Multi-Dimensional Out-of-Distribution Verallgemeinerung von Misinformation Detection Modelle | 昨天的新闻:对错误信息探测模型的多种不同传播通用进行基准衡量 2410.18122v2 |
Authors: Ivo Verhoeven, Pushkar Mishra, Ekaterina Shutova
This article introduces misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalization. Misinformation changes rapidly, much more quickly than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation detectors need to be able to perform out-of-distribution generalization, an attribute they currently lack. Our benchmark uses distant labelling to enable simulating covariate shifts in misinformation content. We identify time, event, topic, publisher, political bias, misinformation type as important axes for generalization, and we evaluate a common class of baseline models on each. Using article metadata, we show how this model fails desiderata, which is not necessarily obvious from classification metrics. Finally, we analyze properties of the data to ensure limited presence of modelling shortcuts. We make the dataset and accompanying code publicly available: https://github.com/ioverho/misinfo-general
nan
Article 825
Title@2025-05-26 (1): The Impact of a Chatbot’s Ephemerality-Framing on Self-Disclosure Perceptions
Title: The Impact of a Chatbot’s Ephemerality-Framing on Self-Disclosure Perceptions | Der Einfluss des Ephemerality-Framing eines Chatbots auf die Wahrnehmung der Selbstoffenbarung | 查塔博特人的即时态度对自我披露感知的影响 2505.20464v1 |
Authors: Samuel Rhys Cox, Rune Møberg Jacobsen, Niels van Berkel
Self-disclosure, the sharing of one’s thoughts and feelings, is affected by the perceived relationship between individuals. While chatbots are increasingly used for self-disclosure, the impact of a chatbot’s framing on users’ self-disclosure remains under-explored. We investigated how a chatbot’s description of its relationship with users, particularly in terms of ephemerality, affects self-disclosure. Specifically, we compared a Familiar chatbot, presenting itself as a companion remembering past interactions, with a Stranger chatbot, presenting itself as a new, unacquainted entity in each conversation. In a mixed factorial design, participants engaged with either the Familiar or Stranger chatbot in two sessions across two days, with one conversation focusing on Emotional- and another Factual-disclosure. When Emotional-disclosure was sought in the first chatting session, Stranger-condition participants felt more comfortable self-disclosing. However, when Factual-disclosure was sought first, these differences were replaced by more enjoyment among Familiar-condition participants. Qualitative findings showed Stranger afforded anonymity and reduced judgement, whereas Familiar sometimes felt intrusive unless rapport was built via low-risk Factual-disclosure.
nan
Article 826
Title@2025-05-26 (1): Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection
Title: Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection | Skalierungsgesetze für das Vergessen beim Finetuning mit Vorschulungs-Dateninjektion | 调整前数据输入时遗忘法律的扩大范围 2502.06042v2 |
Authors: Louis Bethune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, Pierre Ablin
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. We aim to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
nan
Article 827
Title@2025-05-26 (1): Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Title: Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries | Amulett: Komplexe Multi-Turn-Gespräche mit LLM Jurys auf dem Stand | Anulet: 将复杂多发多发对话与LLM Juries 挂起立 2505.20451v1 |
Authors: Sahana Ramnath, Anurag Mudgil, Brihi Joshi, Skyler Hallinan, Xiang Ren
Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter’s significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.
nan
Article 828
Title@2025-05-26 (1): HAMburger: Accelerating LLM Inference via Token Smashing
Title: HAMburger: Accelerating LLM Inference via Token Smashing | HAMburger: Beschleunigung der LLM-Inferenz durch Token Smashing | HAMburger:通过Token打碎加速LLM推理 2505.20438v1 |
Authors: Jingyu Liu, Ce Zhang
The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2$\times$ and achieves up to 2$\times$ TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.
nan
Article 829
Title@2025-05-26 (1): Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?
Title: Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages? | Hilft Synthetische Daten bei der Nennung der Entitätserkennung für Sprachen mit geringer Ressource? | 合成数据是否有助于为低资源语言命名实体识别? 2505.16814v2 |
Authors: Gaurav Kamath, Sowmya Vajjala
Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.
nan
Article 830
Title@2025-05-26 (1): The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
Title: The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project | The UD-NewsCrawl Treebank: Reflexionen und Herausforderungen aus einem groß angelegten Tagalog Syntactic Annotation Project | UD-News-Crawcrow Treebank:大型Tagalog聚合笔记项目反思和挑战 2505.20428v1 |
Authors: Angelina A. Aquino, Lester James V. Miranda, Elsie Marie T. Or
This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.
nan
Article 831
Title@2025-05-26 (1): SEMMA: A Semantic Aware Knowledge Graph Foundation Model
Title: SEMMA: A Semantic Aware Knowledge Graph Foundation Model | SEMMA: Ein semantisches Wissensdiagramm-Stiftungsmodell | SEMMA: 语义认知知识图基础模型 2505.20422v1 |
Authors: Arvindh Arun, Sumit Kumar, Mojtaba Nayyeri, Bo Xiong, Ponnurangam Kumaraguru, Antonio Vergari, Steffen Staab
Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.
nan
Article 832
Title@2025-05-26 (1): GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Title: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation | GraphGen: Verbessertes Supervised Fine-Tuning für LLMs mit wissensgetriebener Synthetischer Datengenerierung | 图图Gen:加强具有知识驱动合成合成数据生成的LMLMs的监管微调 2505.20416v1 |
Authors: Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong
Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
nan
Article 833
Title@2025-05-26 (1): Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision
Title: Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision | Verbesserung der logischen Vernunft in Sprachmodellen durch symbolisch geführte Monte-Carlo-Prozessüberwachung | 通过有符号指导的蒙特卡洛进程监督,加强语文模式的逻辑理由解释 2505.20415v1 |
Authors: Xingwei Tan, Marco Valentino, Mahmud Akhter, Maria Liakata, Nikolaos Aletras
Large language models (LLMs) have shown promising performance in mathematical and logical reasoning benchmarks. However, recent studies have pointed to memorization, rather than generalization, as one of the leading causes for such performance. LLMs, in fact, are susceptible to content variations, demonstrating a lack of robust symbolic abstractions supporting their reasoning process. To improve reliability, many attempts have been made to combine LLMs with symbolic methods. Nevertheless, existing approaches fail to effectively leverage symbolic representations due to the challenges involved in developing reliable and scalable verification mechanisms. In this paper, we propose to overcome such limitations by generating symbolic reasoning trajectories and select the high-quality ones using a process reward model automatically tuned based on Monte Carlo estimation. The trajectories are then employed via fine-tuning methods to improve logical reasoning and generalization. Our results on logical reasoning benchmarks such as FOLIO and LogicAsker show the effectiveness of the proposed method with large gains on frontier and open-weight models. Moreover, additional experiments on claim verification reveal that fine-tuning on the generated symbolic reasoning trajectories enhances out-of-domain generalizability, suggesting the potential impact of symbolically-guided process supervision in alleviating the effect of memorization on LLM reasoning.
nan
Article 834
Title@2025-05-26 (1): SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Title: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents | SWE-Rebench: Eine automatisierte Pipeline für die Task Collection und die dekontaminierte Evaluation von Software Engineering Agents | SWE-rebench:软件工程剂任务收集和除污评价自动管道 2505.20411v1 |
Authors: Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
nan
Article 835
Title@2025-05-26 (1): What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
Title: What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models | Was änderte sich? Instruktionsgeführte Bildbearbeitungen mit multimodalen großen Sprachmodellen erkennen und bewerten | 以多模式大语言模式对指导指导图像编辑进行检测和评估 2505.20405v1 |
Authors: Lorenzo Baraldi, Davide Bucciarelli, Federico Betti, Marcella Cornia, Lorenzo Baraldi, Nicu Sebe, Rita Cucchiara
Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.
nan
Article 836
Title@2025-05-26 (1): MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
Title: MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding | MangaVQA und MangaLMM: Ein Benchmark und Spezialmodell für multimodales Manga-Verständnis | MangaVQA和MangaLMM:多模式漫画理解基准和专门模式 2505.20298v1 |
Authors: Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
nan
Article 837
Title@2025-05-26 (1): DiSA: Diffusion Step Annealing in Autoregressive Image Generation
Title: DiSA: Diffusion Step Annealing in Autoregressive Image Generation | DiSA: Diffusionsschritt Annealing in autoregressiver Bildgenerierung | DiSA: 自动递减图像生成中的传播步骤 2505.20297v1 |
Authors: Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.
nan
Article 838
Title@2025-05-26 (1): Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
Title: Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution? | Selbstreflektierende Unsicherheiten: Kennen LLMs ihre interne Antwortverteilung? | 自我反感的不确定性:LLMs知道他们的内部答案分布吗? 2505.20295v1 |
Authors: Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Sinead Williamson
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM’s internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. Our metric enables future works towards this universal form of LLM uncertainties.
nan
Article 839
Title@2025-05-26 (1): Reasoning LLMs are Wandering Solution Explorers
Title: Reasoning LLMs are Wandering Solution Explorers | Grundlegende LLMs sind wandernde Lösungs-Explorer | 理据LLMs是游荡的解决方案探索者 2505.20296v1 |
Authors: Jiahao Lu, Ziwei Xu, Mohan Kankanhalli
Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models’ performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
nan
Article 840
Title@2025-05-26 (1): Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery
Title: Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery | Verbesserung der Verständlichkeit von Texterklärungen durch unüberwachte Concept Discovery | 通过未受监督的概念发现提高通过不受监督的概念解释的可理解性 2505.20293v1 |
Authors: Yifan Sun, Danding Wang, Qiang Sheng, Juan Cao, Jintao Li
Concept-based explainable approaches have emerged as a promising method in explainable AI because they can interpret models in a way that aligns with human reasoning. However, their adaption in the text domain remains limited. Most existing methods rely on predefined concept annotations and cannot discover unseen concepts, while other methods that extract concepts without supervision often produce explanations that are not intuitively comprehensible to humans, potentially diminishing user trust. These methods fall short of discovering comprehensible concepts automatically. To address this issue, we propose \textbf{ECO-Concept}, an intrinsically interpretable framework to discover comprehensible concepts with no concept annotations. ECO-Concept first utilizes an object-centric architecture to extract semantic concepts automatically. Then the comprehensibility of the extracted concepts is evaluated by large language models. Finally, the evaluation result guides the subsequent model fine-tuning to obtain more understandable explanations. Experiments show that our method achieves superior performance across diverse tasks. Further concept evaluations validate that the concepts learned by ECO-Concept surpassed current counterparts in comprehensibility.
nan
Article 841
Title@2025-05-26 (1): Visualized Text-to-Image Retrieval
Title: Visualized Text-to-Image Retrieval | Visualisierung von Text-zu-Bild-Retrieval | 可视化文本到图像检索 2505.20291v1 |
Authors: Di Wu, Yixin Wan, Kai-Wei Chang
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.
nan
Article 842
Title@2025-05-26 (1): Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Title: Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding | Time-R1: Nach dem Training Großer Vision-Sprachenmodell für die zeitliche Videoerdung | 时间-R1:培训后用于实时视频定位的大型视觉语言模型 2503.13377v2 |
Authors: Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.
nan
Article 843
Title@2025-05-26 (1): VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Title: VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction | VLM-3R: Vision-Language-Modelle erweitert mit instruction-aligned 3D reconstruction | VLM-3R:通过指示统一3D重建增强的愿景-语言模型 2505.20279v1 |
Authors: Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan
The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.
nan
Article 844
Title@2025-05-26 (1): The Coverage Principle: A Framework for Understanding Compositional Generalization
Title: The Coverage Principle: A Framework for Understanding Compositional Generalization | Das Coverage-Prinzip: Ein Rahmen für das Verständnis der kompositorischen Verallgemeinerung | 覆盖范围原则:理解普遍组成框架 2505.20278v1 |
Authors: Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo
Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a \emph{mechanism-based} taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.
nan
Article 845
Title@2025-05-26 (1): OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Title: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction | OmniCharacter: Auf dem Weg zu immersiven Rollenspiel-Agenten mit nahtloser Sprach-Persönlichkeits-Interaktion | OmniCharacter:争取用无缝无言语-语言个性交互作用来模拟角色扮演剂 2505.20277v1 |
Authors: Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role’s voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.
nan
Article 846
Title@2025-05-26 (1): Bias and Volatility: A Statistical Framework for Evaluating Large Language Model’s Stereotypes and the Associated Generation Inconsistency
Title: Bias and Volatility: A Statistical Framework for Evaluating Large Language Model’s Stereotypes and the Associated Generation Inconsistency | Bias and Volatility: Ein statistischer Rahmen für die Bewertung der Stereotypen und der damit verbundenen Inkonsistenz der Generation | 偏见和不稳定:评价大语言模式定型观念和关联一代人不一致的统计框架 2402.15481v5 |
Authors: Yiran Liu, Ke Yang, Zehan Qi, Xiao Liu, Yang Yu, ChengXiang Zhai
We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current alignment evaluation metrics often overlook stereotypes’ randomness caused by LLMs’ inconsistent generative behavior. For instance, LLMs may display contradictory stereotypes, such as those related to gender or race, for identical professions in different contexts. Ignoring this inconsistency risks misleading conclusions in alignment assessments and undermines efforts to evaluate the potential of LLMs to perpetuate or amplify social biases and unfairness. To address this, we propose the Bias-Volatility Framework (BVF), which estimates the probability distribution of stereotypes in LLM outputs. By capturing the variation in generative behavior, BVF assesses both the likelihood and degree to which LLM outputs negatively impact vulnerable groups, enabling a quantification of aggregated discrimination risk. Additionally, we introduce a mathematical framework to decompose this risk into bias risk (from the mean of the stereotype distribution) and volatility risk (from its variation). Applying BVF to 12 widely used LLMs, we find: i) Bias risk is the dominant contributor to discrimination; ii) Most LLMs exhibit substantial pro-male stereotypes across nearly all professions; iii) Reinforcement learning from human feedback reduces bias but increases volatility; iv) Discrimination risk correlates with socio-economic factors, such as professional salaries. Finally, we highlight BVF’s broader applicability for assessing how generation inconsistencies in LLMs impact behavior beyond stereotypes.
nan
Article 847
Title@2025-05-26 (1): Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
Title: Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs | Fann oder Flop: Ein Multigenre, Multiera Benchmark für arabische Poesie in LLMs | Fann 或 Flop: 多种语言、阿拉伯语诗类理解多元基准 2505.18152v2 |
Authors: Wafa Alghallabi, Ritesh Thawkar, Sara Ghaboura, Ketan More, Omkar Thawakar, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Arabic poetry is one of the richest and most culturally rooted forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce \emph{Fann or Flop}, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in 12 historical eras, covering 14 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM understands classical Arabic through Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release “Fann or Flop” along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: https://github.com/mbzuai-oryx/FannOrFlop.
nan
Article 848
Title@2025-05-26 (1): Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
Title: Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models | Durch Täuschung sehen: Irreführende Schöpfer-Intent in multimodalen Nachrichten mit Vision-Sprache-Modellen entdecken | 通过欺骗观察:以视觉语言模式在多模式新闻中揭开错误领导创造者意图的隐蔽 2505.15489v2 |
Authors: Jiaying Wu, Fanxiao Li, Min-Yen Kan, Bryan Hooi
The real-world impact of misinformation stems from the underlying misleading narratives that creators seek to convey. As such, interpreting misleading creator intent is essential for multimodal misinformation detection (MMD) systems aimed at effective information governance. In this paper, we introduce an automated framework that simulates real-world multimodal news creation by explicitly modeling creator intent through two components: the desired influence and the execution plan. Using this framework, we construct DeceptionDecoded, a large-scale benchmark comprising 12,000 image-caption pairs aligned with trustworthy reference articles. The dataset captures both misleading and non-misleading intents and spans manipulations across visual and textual modalities. We conduct a comprehensive evaluation of 14 state-of-the-art vision-language models (VLMs) on three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. Despite recent advances, we observe that current VLMs fall short in recognizing misleading intent, often relying on spurious cues such as superficial cross-modal consistency, stylistic signals, and heuristic authenticity hints. Our findings highlight the pressing need for intent-aware modeling in MMD and open new directions for developing systems capable of deeper reasoning about multimodal misinformation.
nan
Article 849
Title@2025-05-26 (1): We Need to Measure Data Diversity in NLP – Better and Broader
Title: We Need to Measure Data Diversity in NLP – Better and Broader | Wir müssen die Datenvielfalt in NLP messen – besser und breiter | 我们需要在《国家劳工政策》中衡量数据多样性 – – 更好和更广泛 2505.20264v1 |
Authors: Dong Nguyen, Esther Ploeger
Although diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures.
nan
Article 850
Title@2025-05-26 (1): Lifelong Safety Alignment for Language Models
Title: Lifelong Safety Alignment for Language Models | Lebenslange Sicherheitsausrichtung für Sprachmodelle | 语言模型终身安全比对 2505.20259v1 |
Authors: Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang
LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker’s success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.
nan
Article 851
Title@2025-05-26 (1): On the Compatibility of Generative AI and Generative Linguistics
Title: On the Compatibility of Generative AI and Generative Linguistics | Über die Vereinbarkeit generativer KI und generativer Linguistik | 关于 “ 创造性语言 “ 和 “ 创造性语言 “ 的兼容性 2411.10533v2 |
Authors: Eva Portelance, Masoud Jasbi
In mid-20th century, the linguist Noam Chomsky established generative linguistics, and made significant contributions to linguistics, computer science, and cognitive science by developing the computational and philosophical foundations for a theory that defined language as a formal system, instantiated in human minds or artificial machines. These developments in turn ushered a wave of research on symbolic Artificial Intelligence (AI). More recently, a new wave of non-symbolic AI has emerged with neural Language Models (LMs) that exhibit impressive linguistic performance, leading many to question the older approach and wonder about the the compatibility of generative AI and generative linguistics. In this paper, we argue that generative AI is compatible with generative linguistics and reinforces its basic tenets in at least three ways. First, we argue that LMs are formal generative models as intended originally in Chomsky’s work on formal language theory. Second, LMs can help develop a program for discovery procedures as defined by Chomsky’s “Syntactic Structures”. Third, LMs can be a major asset for Chomsky’s minimalist approach to Universal Grammar and language acquisition. In turn, generative linguistics can provide the foundation for evaluating and improving LMs as well as other generative computational models of language.
nan
Article 852
Title@2025-05-26 (1): ARM: Adaptive Reasoning Model
Title: ARM: Adaptive Reasoning Model | ARM: Anpassungsorientiertes Modell | ARM:适应性理由说明模式 2505.20258v1 |
Authors: Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao
While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the “overthinking” problem – excessive and unnecessary reasoning – which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones – Direct Answer, Short CoT, and Code – as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens – ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.
nan
Article 853
Title@2025-05-26 (1): The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language
Title: The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language | Der Faetar-Benchmark: Spracherkennung in einer sehr unterbesetzten Sprache | Faetar基准:以资源非常不足的语言进行语音承认 2409.08103v4 |
Authors: Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar
We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Proven\c{c}al variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Proven\c{c}al. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.
nan
Article 854
Title@2025-05-26 (1): Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Title: Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs | Position: Mechanische Dolmetschbarkeit sollte Feature-Konsistenz in SAEs priorisieren | 位置: 机械可解释性:应优先考虑高级专业环境评估中的地物一致性 2505.20254v1 |
Authors: Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang
Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs – the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.
nan
Article 855
Title@2025-05-26 (1): Learning Extrapolative Sequence Transformations from Markov Chains
Title: Learning Extrapolative Sequence Transformations from Markov Chains | Extrapolative Sequenztransformationen von Markov-Ketten lernen | 来自Markov 链条的学习外推序列变换 2505.20251v1 |
Authors: Sophia Hager, Aleem Khan, Andrew Wang, Nicholas Andrews
Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emph{extrapolate} beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well-designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.
nan
Article 856
Title@2025-05-26 (1): Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models
Title: Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models | Beyond the Tip of Efficiency: Enthüllen der untergetauchten Bedrohungen von Jailbreak Attacken in kleinen Sprachmodellen | 超越 “ 效率之便 “ :以小语言模式破狱袭击的潜伏威胁 2502.19883v3 |
Authors: Sibo Yi, Tianshuo Cong, Xinlei He, Qi Li, Jiaxing Song
Small language models (SLMs) have become increasingly prominent in the deployment on edge devices due to their high efficiency and low computational cost. While researchers continue to advance the capabilities of SLMs through innovative training strategies and model compression techniques, the security risks of SLMs have received considerably less attention compared to large language models (LLMs).To fill this gap, we provide a comprehensive empirical study to evaluate the security performance of 13 state-of-the-art SLMs under various jailbreak attacks. Our experiments demonstrate that most SLMs are quite susceptible to existing jailbreak attacks, while some of them are even vulnerable to direct harmful prompts.To address the safety concerns, we evaluate several representative defense methods and demonstrate their effectiveness in enhancing the security of SLMs. We further analyze the potential security degradation caused by different SLM techniques including architecture compression, quantization, knowledge distillation, and so on. We expect that our research can highlight the security challenges of SLMs and provide valuable insights to future work in developing more robust and secure SLMs.
nan
Article 857
Title@2025-05-26 (1): WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models
Title: WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models | WXImpactBench: Ein disruptives Wetter-Impact-Verständnis Benchmark für die Bewertung großer Sprachmodelle | WXImpact Bennech:评估大语言模型的干扰天气影响理解基准 2505.20249v1 |
Authors: Yongan Yu, Qingchen Hu, Xianda Du, Jiayin Wang, Fengran Mo, Renee Sieber
Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
nan
Article 858
Title@2025-05-26 (1): KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing
Title: KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing | KnowTrace: Bootstrapping iterative Retrieval-Augmented Generation mit strukturierter Wissensverfolgung | KnowTrace: 与结构化知识追踪相配套的 刺激性迭代回收- 启动型生成 2505.20245v1 |
Authors: Rui Li, Quanyu Dai, Zeyu Zhang, Xu Chen, Zhenhua Dong, Ji-Rong Wen
Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi-hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM’s context. However, the ever-growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher-quality multi-step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self-bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi-hop question answering benchmarks, and the bootstrapped version further amplifies the gains.
nan
Article 859
Title@2025-05-26 (1): On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Title: On Path to Multimodal Historical Reasoning: HistBench and HistAgent | Auf dem Weg zu multimodaler historischer Vernunft: HistBench und HistAgent | 通向多式联运历史原因原因之路:历史时尚与历史代理人 2505.20246v1 |
Authors: Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu Deng, Jiye Fu, Yunting Gu, Zijie Guan, Zirui Huang, Xiaoyan Ji, Yumeng Jiang, Delong Kong, Haolong Li, Jiaqi Li, Ruipeng Li, Tianze Li, Zhuoran Li, Haixia Lian, Mengyue Lin, Xudong Liu, Jiayi Lu, Jinghan Lu, Wanyu Luo, Ziyue Luo, Zihao Pu, Zhi Qiao, Ruihuan Ren, Liang Wan, Ruixiang Wang, Tianhui Wang, Yang Wang, Zeyu Wang, Zihua Wang, Yujia Wu, Zhaoyi Wu, Hao Xin, Weiao Xing, Ruojun Xiong, Weijie Xu, Yao Shu, Xiao Yao, Xiaorui Yang, Yuchen Yang, Nan Yi, Jiadong Yu, Yangyuxuan Yu, Huiting Zeng, Danni Zhang, Yunjie Zhang, Zhaoyu Zhang, Zhiheng Zhang, Xiaofeng Zheng, Peirong Zhou, Linyan Zhong, Xiaoyin Zong, Ying Zhao, Zhenxin Chen, Lin Ding, Xiaoyu Gao, Bingbing Gong, Yichao Li, Yang Liao, Guang Ma, Tianyuan Ma, Xinrui Sun, Tianyi Wang, Han Xia, Ruobing Xian, Gen Ye, Tengfei Yu, Wentao Zhang, Yuxi Wang, Xi Gao, Mengdi Wang
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI’s capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
nan
Article 860
Title@2025-05-26 (1): It’s High Time: A Survey of Temporal Information Retrieval and Question Answering
Title: It’s High Time: A Survey of Temporal Information Retrieval and Question Answering | Es ist höchste Zeit: Eine Umfrage der zeitlichen Informationen Retrieval und Fragen beantworten | 《高时:时间信息检索和回答问题调查》 2505.20243v1 |
Authors: Bhawna Piryani, Abdelrahman Abdullah, Jamshid Mozafari, Avishek Anand, Adam Jatowt
Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Information Retrieval and Temporal Question Answering, two research areas aimed at handling and understanding time-sensitive information. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. These challenges are critical across many dynamic and time-sensitive domains, from news and encyclopedias to science, history, and social media. We review both traditional approaches and modern neural methods, including those that use transformer models and Large Language Models (LLMs). We also review recent advances in temporal language modeling, multi-hop reasoning, and retrieval-augmented generation (RAG), alongside benchmark datasets and evaluation strategies that test temporal robustness, recency awareness, and generalization.
nan
Article 861
Title@2025-05-26 (1): Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models
Title: Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models | Vielfältig, nicht kurz: Ein längengesteuerter Selbstlernrahmen zur Verbesserung der Antwortvielfalt von Sprachmodellen | 多样性,不是短的:提高语文模式应对多样性的长期控制自学框架 2505.16245v2 |
Authors: Vijeta Deshpande, Debasmita Ghose, John D. Patterson, Roger Beaty, Anna Rumshisky
Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled self-learning framework that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective “diversity teachers” for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.
nan
Article 862
Title@2025-05-26 (1): MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Title: MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation | MMLU-ProX: Mehrsprachiger Benchmark für eine erweiterte Bewertung von großen Sprachmodellen | MMLU-ProX:高级大语言模式评价多语种基准 2503.10497v2 |
Authors: Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, Felix Juefei-Xu, Foutse Khomh, Osamu Yoshie, Qingyu Chen, Douglas Teodoro, Nan Liu, Randy Goebel, Lei Ma, Edison Marrese-Taylor, Shijian Lu, Yusuke Iwasawa, Yutaka Matsuo, Irene Li
Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it challenging to comprehensively assess LLMs’ performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
nan
Article 863
Title@2025-05-26 (1): RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Title: RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning | RAGEN: Selbst-Evolution in LLM-Agenten durch Multi-Turn-Verstärkungs-Lernen verstehen | 通过多阶段强化学习了解LLM代理商的自我演变 2504.20073v2 |
Authors: Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on four stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and gradient stabilization. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.
nan
Article 864
Title@2025-05-26 (1): BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving
Title: BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving | BPP-Suche: Verbesserung des Baumes der Gedanken Grund für mathematische Modellierung Problem Lösung | BPP-Search:为数学建模问题解决加强思考理由树 2411.17404v4 |
Authors: Teng Wang, Wing-Yin Yu, Zhenqi He, Zehua Liu, Hailei Gong, Han Wu, Xiongwei Han, Wei Shi, Ruifeng She, Fangzhou Zhu, Tao Zhong
LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source datasets in operations research domain lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, an algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency, enabling faster retrieval of correct solutions. The StructuredOR dataset is available on Huggingface https://huggingface.co/datasets/LLM4OR/StructuredOR and GitHub https://github.com/LLM4OR/StructuredOR.
nan
Article 865
Title@2025-05-26 (1): Efficient Speech Translation through Model Compression and Knowledge Distillation
Title: Efficient Speech Translation through Model Compression and Knowledge Distillation | Effiziente Sprachübersetzung durch Modellkompression und Wissensdestillation | 通过模型压缩和知识蒸馏高效语音翻译 2505.20237v1 |
Authors: Yasmin Moslem
Efficient deployment of large audio-language models for speech translation remains challenging due to their significant computational requirements. In this paper, we address this challenge through our system submissions to the “Model Compression” track at the International Conference on Spoken Language Translation (IWSLT 2025). We experiment with a combination of approaches including iterative layer pruning based on layer importance evaluation, low-rank adaptation with 4-bit quantization (QLoRA), and knowledge distillation. In our experiments, we use Qwen2-Audio-7B-Instruct for speech translation into German and Chinese. Our pruned (student) models achieve up to a 50% reduction in both model parameters and storage footprint, while retaining 97-100% of the translation quality of the in-domain (teacher) models.
nan
Article 866
Title@2025-05-26 (1): Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue
Title: Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue | Überbrückung der langfristigen Lücke: Eine memory-aktive Politik für den aufgabenorientierten Dialog mit mehreren Sessions | 缩小长期差距:多会议着重任务的对话的记忆 - 积极政策 2505.20231v1 |
Authors: Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong
Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for evaluating long-term memory in multi-session TOD. Based on this new dataset, we propose a Memory-Active Policy (MAP) that improves multi-session dialogue efficiency through a two-stage approach. 1) Memory-Guided Dialogue Planning retrieves intent-aligned history, identifies key QA units via a memory judger, refines them by removing redundant questions, and generates responses based on the reconstructed memory. 2) Proactive Response Strategy detects and correct errors or omissions, ensuring efficient and accurate task completion. We evaluate MAP on MS-TOD dataset, focusing on response quality and effectiveness of the proactive strategy. Experiments on MS-TOD demonstrate that MAP significantly improves task success and turn efficiency in multi-session scenarios, while maintaining competitive performance on conventional single-session tasks.
nan
Article 867
Title@2025-05-26 (1): FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
Title: FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models | FLAME-MoE: Eine transparente End-to-End-Forschungsplattform für Mixture-of-Experts-Sprachmodelle | FLAME-MOE:混合专家语言模型透明端对端研究平台 2505.20225v1 |
Authors: Hao Kang, Zichun Yu, Chenyan Xiong
Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture–64 experts with top-8 gating and 2 shared experts–closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.
nan
Article 868
Title@2025-05-26 (1): Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Title: Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction | Rollen Sie die Würfel & Blick, bevor Sie springen: Gehen über die kreativen Grenzen der Next-Token-Vorhersage | 跳跃前的骰子滚动和看一看:超越了次声预测的创造性极限 2504.15266v2 |
Authors: Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed as seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
nan
Article 869
Title@2025-05-26 (1): Dependency Parsing is More Parameter-Efficient with Normalization
Title: Dependency Parsing is More Parameter-Efficient with Normalization | Abhängigkeit Parsing ist mehr Parameter-Effizient mit Normalisierung | 依赖性剖析的参数比正常化的参数要高 2505.20215v1 |
Authors: Paolo Gajo, Domenic Rosati, Hassan Sajjad, Alberto Barrón-Cedeño
Dependency parsing is the task of inferring natural language structure, often approached by modeling word interactions via attention through biaffine scoring. This mechanism works like self-attention in Transformers, where scores are calculated for every pair of words in a sentence. However, unlike Transformer attention, biaffine scoring does not use normalization prior to taking the softmax of the scores. In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on six datasets for semantic and syntactic dependency parsing using a one-hop parser. We train N-layer stacked BiLSTMs and evaluate the parser’s performance with and without normalizing biaffine scores. Normalizing allows us to beat the state of the art on two datasets, with fewer samples and trainable parameters. Code: https://anonymous.4open.science/r/EfficientSDP-70C1
nan
Article 870
Title@2025-05-26 (1): How to Improve the Robustness of Closed-Source Models on NLI
Title: How to Improve the Robustness of Closed-Source Models on NLI | Wie man die Robustheit von Closed-Source-Modellen auf NLI verbessert | 如何改进封闭源码模式在非国家借贷方面的有效性 2505.20209v1 |
Authors: Joe Stacey, Lisa Alazraki, Aran Ubhi, Beyza Ermis, Aaron Mueller, Marek Rei
Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improve performance, but this often results in the models learning from dataset-specific heuristics that reduce their robustness on out-of-distribution (OOD) data. Existing methods to improve robustness either perform poorly, or are non-applicable to closed-source models because they assume access to model internals, or the ability to change the model’s training procedure. In this work, we investigate strategies to improve the robustness of closed-source LLMs through data-centric methods that do not require access to model internals. We find that the optimal strategy depends on the complexity of the OOD data. For highly complex OOD datasets, upsampling more challenging training examples can improve robustness by up to 1.5%. For less complex OOD datasets, replacing a portion of the training set with LLM-generated examples can improve robustness by 3.7%. More broadly, we find that large-scale closed-source autoregressive LLMs are substantially more robust than commonly used encoder models, and are a more appropriate choice of baseline going forward.
nan
Article 871
Title@2025-05-26 (1): Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking
Title: Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking | Adaptive Klassifikator-freie Führung über Dynamisches Low-Confidence-Masking | 通过动态低信任面罩提供适应性分类无限制指导 2505.20199v1 |
Authors: Pengxiang Li, Shilin Yan, Joey Tsai, Renrui Zhang, Ruichuan An, Ziyu Guo, Xiaowei Gao
Classifier-Free Guidance (CFG) significantly enhances controllability in generative models by interpolating conditional and unconditional predictions. However, standard CFG often employs a static unconditional input, which can be suboptimal for iterative generation processes where model uncertainty varies dynamically. We introduce Adaptive Classifier-Free Guidance (A-CFG), a novel method that tailors the unconditional input by leveraging the model’s instantaneous predictive confidence. At each step of an iterative (masked) diffusion language model, A-CFG identifies tokens in the currently generated sequence for which the model exhibits low confidence. These tokens are temporarily re-masked to create a dynamic, localized unconditional input. This focuses CFG’s corrective influence precisely on areas of ambiguity, leading to more effective guidance. We integrate A-CFG into a state-of-the-art masked diffusion language model and demonstrate its efficacy. Experiments on diverse language generation benchmarks show that A-CFG yields substantial improvements over standard CFG, achieving, for instance, a 3.9 point gain on GPQA. Our work highlights the benefit of dynamically adapting guidance mechanisms to model uncertainty in iterative generation.
nan
Article 872
Title@2025-05-26 (1): CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts
Title: CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts | CodeTaxo: Erweiterung der Taxonomie mit begrenzten Beispielen über Code Language Prompts | 代码塔克斯:通过代码语言提示,以有限实例加强分类法的扩展 2408.09070v2 |
Authors: Qingkai Zeng, Yuyang Bai, Zhaoxuan Tan, Zhenyu Wu, Shangbin Feng, Meng Jiang
Taxonomies play a crucial role in various applications by providing a structural representation of knowledge. The task of taxonomy expansion involves integrating emerging concepts into existing taxonomies by identifying appropriate parent concepts for these new query concepts. Previous approaches typically relied on self-supervised methods that generate annotation data from existing taxonomies. However, these methods are less effective when the existing taxonomy is small (fewer than 100 entities). In this work, we introduce CodeTaxo, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure. Extensive experiments on five real-world benchmarks from different domains demonstrate that CodeTaxo consistently achieves superior performance across all evaluation metrics, significantly outperforming previous state-of-the-art methods. The code and data are available at https://github.com/QingkaiZeng/CodeTaxo-Pub.
nan
Article 873
Title@2025-05-26 (1): SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs
Title: SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs | SHARP: Entsperren der interaktiven Halluzination durch Stance-Transfer in Rollenspiel-LLMs | SHARP:通过在角色扮演中转移角色来解锁互动幻觉 2411.07965v4 |
Authors: Chuyi Kong, Ziyang Luo, Hongzhan Lin, Zhiyuan Fan, Yaxin Fan, Yuxi Sun, Jing Ma
The advanced role-playing capabilities of Large Language Models (LLMs) have enabled rich interactive scenarios, yet existing research in social interactions neglects hallucination while struggling with poor generalizability and implicit character fidelity judgments. To bridge this gap, motivated by human behaviour, we introduce a generalizable and explicit paradigm for uncovering interactive patterns of LLMs across diverse worldviews. Specifically, we first define interactive hallucination through stance transfer, then construct SHARP, a benchmark built by extracting relations from commonsense knowledge graphs and utilizing LLMs’ inherent hallucination properties to simulate multi-role interactions. Extensive experiments confirm our paradigm’s effectiveness and stability, examine the factors that influence these metrics, and challenge conventional hallucination mitigation solutions. More broadly, our work reveals a fundamental limitation in popular post-training methods for role-playing LLMs: the tendency to obscure knowledge beneath style, resulting in monotonous yet human-like behaviors - interactive hallucination.
nan
Article 874
Title@2025-05-26 (1): THiNK: Can Large Language Models Think-aloud?
Title: THiNK: Can Large Language Models Think-aloud? | THiNK: Können große Sprachmodelle denken? | 大语言模型能思考吗? 2505.20184v1 |
Authors: Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski
Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.
nan
Article 875
Title@2025-05-26 (1): “KAN you hear me?” Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding
Title: “KAN you hear me?” Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding | “KAN hörst du mich?” Kolmogorov-Arnold-Netzwerke für gesprochenes Sprachverständnis erkunden | 探索科尔莫戈洛夫-阿诺尔德语言理解网络 2505.20176v1 |
Authors: Alkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis
Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within the dense block. The best-performing setup, which places a KAN layer between two linear layers, is directly applied to transformer-based models and evaluated on five SLU datasets with increasing complexity. Our results show that KAN layers can effectively replace the linear layers, achieving comparable or superior performance in most cases. Finally, we provide insights into how KAN and linear layers on top of transformers differently attend to input regions of the raw waveforms.
nan
Article 876
Title@2025-05-26 (1): From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
Title: From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data | Von der Ausrichtung zur Weiterentwicklung: Bootstrapping Audio-Language Alignment mit synthetischen Daten | 从对齐到推进: 用合成数据推动音频语言对齐 2505.20166v1 |
Authors: Chun-Yi Kuan, Hung-yi Lee
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where important textual capabilities such as instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about their reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making the process resource-intensive. To address these issues, we leverage the backbone LLMs from ALLMs to synthesize general-purpose caption-style alignment data. We refer to this process as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Building on BALSa, we introduce LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method designed to improve ALLMs’ ability to distinguish between present and absent sounds. We further extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption that describes them all, thereby enhancing audio-language alignment. Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills. Moreover, incorporating multi-audio training further enhances the model’s comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to the development of ALLMs.
nan
Article 877
Title@2025-05-26 (1): Visual Abstract Thinking Empowers Multimodal Reasoning
Title: Visual Abstract Thinking Empowers Multimodal Reasoning | Visuelles Abstraktes Denken macht multimodale Vernunft | 视觉抽象思考赋予多模式理由 2505.20164v1 |
Authors: Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, Yang Liu
Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.
nan
Article 878
Title@2025-05-26 (1): Exploring Generative Error Correction for Dysarthric Speech Recognition
Title: Exploring Generative Error Correction for Dysarthric Speech Recognition | Erforschung der Generativen Fehlerkorrektur bei der Erkennung von Dysarthric Speech | 探索为承认沙皇演说识别而产生错误校正的探索 2505.20163v1 |
Authors: Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi
Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition
nan
Article 879
Title@2025-05-26 (1): Capability-Based Scaling Laws for LLM Red-Teaming
Title: Capability-Based Scaling Laws for LLM Red-Teaming | Capability-Based Scaling-Gesetze für LLM Red-Teaming | LLM 红色团队合作以能力为基础的增强法律 2505.20162v1 |
Authors: Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping
As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target’s capability exceeds the attacker’s, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models’ persuasive and manipulative abilities to limit their effectiveness as attackers.
nan
Article 880
Title@2025-05-26 (1): Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
Title: Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning | Prismatische Synthese: Gradientenbasierte Datendiversifizierung steigert Generalisierung in LLM-Reasoning | 理论综合:基于逐步的数据多样化促进LLM理由说明的概括化 2505.20161v1 |
Authors: Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi
Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models – and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning – as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman’s $\rho \approx 0.9$) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data – not just on in-distribution test but across unseen, out-of-distribution benchmarks – significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B – the same base model trained on proprietary data generated by 671B R1 – on 6 out of 7 challenging benchmarks.
nan
Article 881
Title@2025-05-26 (1): Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up
Title: Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up | Reversal of Thought: Erweiterung von großen Sprachmodellen mit präference-guided Reverse Reasoning Warm-up | 思想的逆转:加强大语言模式,以优惠、有引导的反反反向理由暖化 2410.12323v3 |
Authors: Jiahao Yuan, Dehui Du, Hao Zhang, Zixiang Di, Usman Naseem
Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning. Existing methods to improve LLMs’ logical capabilities either involve traceable or verifiable logical sequences that generate more reliable responses by constructing logical structures yet increase computational costs, or introduces rigid logic template rules, reducing flexibility. In this paper, we propose Reversal of Thought (RoT), a plug-and-play and cost-effective reasoning framework designed to enhance the logical reasoning abilities of LLMs during the warm-up phase prior to batch inference. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs’ cognitive preferences shaped by RLHF. Through reverse reasoning, we utilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs’ reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.
nan
Article 882
Title@2025-05-26 (1): Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs
Title: Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs | Pangu Light: Gewichtswiederinitialisierung für das Beschneiden und Beschleunigen von LLMs | Pangu光: 灯光和加速LMLM的重量再启动 2505.20155v1 |
Authors: Hanting Chen, Jiarui Qin, Jialong Guo, Tao Yuan, Yichun Yin, Huiling Zhen, Yasheng Wang, Jinpeng Li, Xiaojun Meng, Meng Zhang, Rongju Ruan, Zheyuan Bai, Yehui Tang, Can Chen, Xinghao Chen, Fisher Yu, Ruiming Tang, Yunhe Wang
Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece’’. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B’s 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B’s 80.9 average score and 2225 tokens/s.
nan
Article 883
Title@2025-05-26 (1): UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models
Title: UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models | UORA: Einheitliche Orthogonale Reinitialisierungsanpassung im Parameter-Effizient Feintuning großer Modelle | UORA:大型模型参数-有效精美设计中统一的正正正重新初始化适应 2505.20154v1 |
Authors: Xueyan Zhang, Jinman Zhao, Zhifei Yang, Yibo Zhong, Shuhao Guan, Linbo Cao, Yining Wang
This paper introduces Uniform Orthogonal Reinitialization Adaptation (UORA), a novel parameter-efficient fine-tuning (PEFT) approach for Large Language Models (LLMs). UORA achieves state-of-the-art performance and parameter efficiency by leveraging a low-rank approximation method to reduce the number of trainable parameters. Unlike existing methods such as LoRA and VeRA, UORA employs an interpolation-based reparametrization mechanism that selectively reinitializes rows and columns in frozen projection matrices, guided by the vector magnitude heuristic. This results in substantially fewer trainable parameters compared to LoRA and outperforms VeRA in computation and storage efficiency. Comprehensive experiments across various benchmarks demonstrate UORA’s superiority in achieving competitive fine-tuning performance with negligible computational overhead. We demonstrate its performance on GLUE and E2E benchmarks and its effectiveness in instruction-tuning large language models and image classification models. Our contributions establish a new paradigm for scalable and resource-efficient fine-tuning of LLMs.
nan
Article 884
Title@2025-05-26 (1): Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
Title: Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities | Gedachte politische Optimierung: Überwindung externer Leitlinien und interner Fähigkeiten | 优化政策:将外部指导和内部能力结合起来 2505.15692v2 |
Authors: Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Pengpeng Shao, Huazhe Xu, Jianhua Tao
Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model’s output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance (“thought patterns”). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO’s potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.
nan
Article 885
Title@2025-05-26 (1): Polynomial, trigonometric, and tropical activations
Title: Polynomial, trigonometric, and tropical activations | Polynomische, trigonometrische und tropische Aktivierungen | 多边、三角和热带活性 2502.01247v2 |
Authors: Ismail Khalfaoui-Hassani, Stefan Kesselheim
Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.
nan
Article 886
Title@2025-05-26 (1): Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
Title: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models | Hart negatives Kontrastives Lernen für feinkörniges geometrisches Verständnis in großen multimodalen Modellen | 大型多模式模型中精细几何理解的硬反向硬学习 2505.20152v1 |
Authors: Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li
Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at https://github.com/THU-KEG/MMGeoLM.
nan
Article 887
Title@2025-05-26 (1): RESTOR: Knowledge Recovery in Machine Unlearning
Title: RESTOR: Knowledge Recovery in Machine Unlearning | RESTOR: Wissensrückgewinnung in Maschinellem Lernen | 机械学习中的知识恢复 2411.00204v3 |
Authors: Keivan Rezaei, Khyathi Chandu, Soheil Feizi, Yejin Choi, Faeze Brahman, Abhilasha Ravichander
Large language models trained on web-scale corpora can memorize undesirable data containing misinformation, copyrighted material, or private or sensitive information. Recently, several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints from trained models – that is, to approximate a model that had never been trained on these datapoints in the first place. However, evaluating the effectiveness of unlearning algorithms remains an open challenge. Previous work has relied on heuristics – such as verifying that the model can no longer reproduce the specific information targeted for removal while maintaining accuracy on unrelated test data. These approaches inadequately capture the complete effect of reversing the influence of datapoints on a trained model. In this work, we propose the RESTOR framework for machine unlearning evaluation, which assesses the ability of unlearning algorithms for targeted data erasure, by evaluating the ability of models to forget the knowledge introduced in these datapoints, while simultaneously recovering the model’s knowledge state had it never encountered these datapoints. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate – for instance, identifying that some algorithms merely emphasize forgetting but not recovering knowledge, and that localizing unlearning targets can enhance unlearning performance.
nan
Article 888
Title@2025-05-26 (1): SeMe: Training-Free Language Model Merging via Semantic Alignment
Title: SeMe: Training-Free Language Model Merging via Semantic Alignment | SeMe: Training-freies Sprachmodell Zusammenführen über semantische Ausrichtung | SeME:通过语义一致合并的无培训语言模式 2505.20144v1 |
Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Despite the remarkable capabilities of Language Models (LMs) across diverse tasks, no single model consistently outperforms others, necessitating efficient methods to combine their strengths without expensive retraining. Existing model merging techniques, such as parameter averaging and task-guided fusion, often rely on data-dependent computations or fail to preserve internal knowledge, limiting their robustness and scalability. We introduce SeMe (Semantic-based Merging), a novel, data-free, and training-free approach that leverages latent semantic alignment to merge LMs at a fine-grained, layer-wise level. Unlike prior work, SeMe not only preserves model behaviors but also explicitly stabilizes internal knowledge, addressing a critical gap in LM fusion. Through extensive experiments across diverse architectures and tasks, we demonstrate that SeMe outperforms existing methods in both performance and efficiency while eliminating reliance on external data. Our work establishes a new paradigm for knowledge-aware model merging and provides insights into the semantic structure of LMs, paving the way for more scalable and interpretable model composition.
nan
Article 889
Title@2025-05-26 (1): GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Title: GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models | GUARD: Rollenspiel zur Generierung von Jailbreakings in natürlicher Sprache zur Prüfung der Einhaltung der Leitlinie für große Sprachmodelle | GUARD: 利用《大语言模式遵守试验准则准则》创造以自然语言破门破门 2402.03299v5 |
Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
The discovery of “jailbreaks” to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing LLMs to generate unethical or guideline-violating responses. In addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. We refer to our system as GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD’s versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.
nan
Article 890
Title@2025-05-26 (1): StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs
Title: StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs | StructEval: Benchmarking der Kapazitäten von LLM zur Erzeugung struktureller Outputs | DructEval:将LLMs的能力与产生结构性产出挂钩 2505.20139v1 |
Authors: Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, Wenhu Chen
As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs’ capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.
nan
Article 891
Title@2025-05-26 (1): P$^2$ Law: Scaling Law for Post-Training After Model Pruning
Title: P$^2$ Law: Scaling Law for Post-Training After Model Pruning | P$^2$ Gesetz: Skalierungsgesetz für Post-Training nach Modellprüfung | P$2美元 法律:示范 “ 谨慎 “ 后培训后培训后扩大法 2411.10272v3 |
Authors: Xiaodong Chen, Yuxuan Hu, Xiaokang Zhang, Yanling Wang, Cuiping Li, Hong Chen, Jing Zhang
Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs). To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation. While post-training benefits from larger datasets, once the dataset size is already substantial, increasing the training data provides only limited performance gains. To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.Through extensive experiments on the Llama-3 and Qwen-2.5 series models, pruned using various common pruning methods, we uncover the scaling \textbf{Law} for \textbf{P}ost-training after model \textbf{P}runing, referred to as the P$^2$ Law.This law identifies four key factors for predicting the pruned model’s post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model’s loss before pruning. Moreover, P$^2$ Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.
nan
Article 892
Title@2025-05-26 (1): AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings
Title: AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings | AweDist: Aufmerksamkeitsbewusste Einbettung Destillation für neue Eingabe-Token-Einbettungen | AweDist: 新的输入式嵌入式嵌入器的注意嵌入蒸馏 2505.20133v1 |
Authors: Konstantin Dobler, Desmond Elliott, Gerard de Melo
Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods either require expensive further training or pretraining of additional modules. In this paper, we propose AweDist and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that AweDist is able to outperform even strong baselines.
nan
Article 893
Title@2025-05-26 (1): Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers
Title: Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers | Iterative Selbstanreizung macht große Sprachmodelle als Agent-Sucher aus | 迭代自我激励激励增强大语言模型作为代理搜索者的能力 2505.20128v1 |
Authors: Zhengliang Shi, Lingyong Yan, Dawei Yin, Suzan Verberne, Maarten de Rijke, Zhaochun Ren
Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a self-incentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, EXSEARCH adopts a Generalized Expectation-Maximization algorithm. In the E-step, the LLM generates multiple search trajectories and assigns an importance weight to each; the M-step trains the LLM on them with a re-weighted loss function. This creates a self-incentivized loop, where the LLM iteratively learns from its own generated data, progressively improving itself for search. We further theoretically analyze this training process, establishing convergence guarantees. Extensive experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines, e.g., +7.8% improvement on exact match score. Motivated by these promising results, we introduce EXSEARCH-Zoo, an extension that extends our method to broader scenarios, to facilitate future work.
nan
Article 894
Title@2025-05-26 (1): PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks
Title: PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | PandaGuard: Systematische Bewertung der LLM-Sicherheit gegen Jailbreaking-Angriffe | PandaGuard:系统评估防止侵入监狱袭击的LLM安全性 2505.13862v3 |
Authors: Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, Yi Zeng
Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks, which can bypass safety alignment and elicit harmful outputs. Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques, and lack systematic, reproducible analysis. In this work, we introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi-agent system comprising attackers, defenders, and judges. Our framework implements 19 attack methods and 12 defense mechanisms, along with multiple judgment strategies, all within a flexible plugin architecture supporting diverse LLM interfaces, multiple interaction modes, and configuration-driven experimentation that enhances reproducibility and practical deployment. Built on this framework, we develop PandaBench, a comprehensive benchmark that evaluates the interactions between these attack/defense methods across 49 LLMs and various judgment approaches, requiring over 3 billion tokens to execute. Our extensive evaluation reveals key insights into model vulnerabilities, defense cost-performance trade-offs, and judge consistency. We find that no single defense is optimal across all dimensions and that judge disagreement introduces nontrivial variance in safety assessments. We release the code, configurations, and evaluation results to support transparent and reproducible research in LLM safety.
nan
Article 895
Title@2025-05-26 (1): Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
Title: Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models | Retrieval Modelle sind nicht Tool-Savvy: Benchmarking Tool Retrieval für große Sprachmodelle | 检索模型不是工具保存工具:大语言模型基准工具检索工具 2503.01763v2 |
Authors: Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
nan
Article 896
Title@2025-05-26 (1): Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings
Title: Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings | Crabs: Ressourcenverbrauch über Auto-Generation für LLM-DoS-Angriff unter Black-Box-Einstellungen | Crabs: 在黑盒设置下通过LLM-DoS攻击的自动生成来消耗资源 2412.13879v4 |
Authors: Yuanhe Zhang, Zhenhong Zhou, Wei Zhang, Xinyue Wang, Xiaojun Jia, Yang Liu, Sen Su
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks yet still are vulnerable to external threats, particularly LLM Denial-of-Service (LLM-DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust computational resources and block services. However, existing studies predominantly focus on white-box attacks, leaving black-box scenarios underexplored. In this paper, we introduce Auto-Generation for LLM-DoS (AutoDoS) attack, an automated algorithm designed for black-box LLMs. AutoDoS constructs the DoS Attack Tree and expands the node coverage to achieve effectiveness under black-box conditions. By transferability-driven iterative optimization, AutoDoS could work across different models in one prompt. Furthermore, we reveal that embedding the Length Trojan allows AutoDoS to bypass existing defenses more effectively. Experimental results show that AutoDoS significantly amplifies service response latency by over 250$\times\uparrow$, leading to severe resource consumption in terms of GPU utilization and memory usage. Our work provides a new perspective on LLM-DoS attacks and security defenses. Our code is available at https://github.com/shuita2333/AutoDoS.
nan
Article 897
Title@2025-05-26 (1): Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi’s Zibaldone
Title: Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi’s Zibaldone | Genannte Entity Recognition in Historic Italian: Der Fall von Giacomo Leopardis Zibaldone | 在历史上意大利文中命名实体识别:Giacomo Leopardi的Zibaldone案 2505.20113v1 |
Authors: Cristian Santini, Laura Melosi, Emanuele Frontoni
The increased digitization of world’s textual heritage poses significant challenges for both computer science and literary studies. Overall, there is an urgent need of computational techniques able to adapt to the challenges of historical texts, such as orthographic and spelling variations, fragmentary structure and digitization errors. The rise of large language models (LLMs) has revolutionized natural language processing, suggesting promising applications for Named Entity Recognition (NER) on historical documents. In spite of this, no thorough evaluation has been proposed for Italian texts. This research tries to fill the gap by proposing a new challenging dataset for entity extraction based on a corpus of 19th century scholarly notes, i.e. Giacomo Leopardi’s Zibaldone (1898), containing 2,899 references to people, locations and literary works. This dataset was used to carry out reproducible experiments with both domain-specific BERT-based models and state-of-the-art LLMs such as LLaMa3.1. Results show that instruction-tuned models encounter multiple difficulties handling historical humanistic texts, while fine-tuned NER models offer more robust performance even with challenging entity types such as bibliographic references.
nan
Article 898
Title@2025-05-26 (1): ResSVD: Residual Compensated SVD for Large Language Model Compression
Title: ResSVD: Residual Compensated SVD for Large Language Model Compression | ResSVD: Residual Compensated SVD für großsprachliche Modellkompression | ResSVD: 大语言模型压缩剩余补偿SVD 2505.20112v1 |
Authors: Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang
Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models.Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.
nan
Article 899
Title@2025-05-26 (1): Language-Agnostic Suicidal Risk Detection Using Large Language Models
Title: Language-Agnostic Suicidal Risk Detection Using Large Language Models | Sprach-agnostische Suizidrisikoerkennung mit großen Sprachmodellen | 使用大语言模型进行语言不可知的自杀风险探测 2505.20109v1 |
Authors: June-Woo Kim, Wonkyo Oh, Haram Yoon, Sung-Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang
Suicidal risk detection in adolescents is a critical challenge, yet existing methods rely on language-specific models, limiting scalability and generalization. This study introduces a novel language-agnostic framework for suicidal risk assessment with large language models (LLMs). We generate Chinese transcripts from speech using an ASR model and then employ LLMs with prompt-based queries to extract suicidal risk-related features from these transcripts. The extracted features are retained in both Chinese and English to enable cross-linguistic analysis and then used to fine-tune corresponding pretrained language models independently. Experimental results show that our method achieves performance comparable to direct fine-tuning with ASR results or to models trained solely on Chinese suicidal risk-related features, demonstrating its potential to overcome language constraints and improve the robustness of suicidal risk assessment.
nan
Article 900
Title@2025-05-26 (1): Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities
Title: Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities | Große Sprachmodelle treffen auf Wissensgraphen für Fragenbeantwortung: Synthese und Chancen | 大语言模式满足回答问题的知识图表:综合与机遇 2505.20099v1 |
Authors: Chuangtao Ma, Yongrui Chen, Tianxing Wu, Arijit Khan, Haofen Wang
Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks because of their superior capabilities in natural language understanding and generation. However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG’s role when integrating with LLMs. We systematically survey state-of-the-art advances in synthesizing LLMs and KGs for QA and compare and analyze these approaches in terms of strength, limitations, and KG requirements. We then align the approaches with QA and discuss how these approaches address the main challenges of different complex QA. Finally, we summarize the advancements, evaluation metrics, and benchmark datasets and highlight open challenges and opportunities.
nan
Article 901
Title@2025-05-26 (1): S2LPP: Small-to-Large Prompt Prediction across LLMs
Title: S2LPP: Small-to-Large Prompt Prediction across LLMs | S2LPP: Kleine bis große Vorhersagen über LLMs | S2LPP: 小到大迅速预测 2505.20097v1 |
Authors: Liang Cheng, Tianyi LI, Zhaowei Wang, Mark Steedman
The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness
nan
Article 902
Title@2025-05-26 (1): MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning
Title: MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning | MA-RAG: Multi-Agent Retrieval-Augmented Generation über kollaborative Chain-of-Thought-Reasoning | MA-RAG:通过协作研究链解释理由实现多权获取-提款人一代 2505.20096v1 |
Authors: Thang Nguyen, Peter Chin, Yu-Wing Tai
We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on either end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, to tackle each stage of the RAG pipeline with task-aware reasoning. Ambiguities may arise from underspecified queries, sparse or indirect evidence in retrieved documents, or the need to integrate information scattered across multiple sources. MA-RAG mitigates these challenges by decomposing the problem into subtasks, such as query disambiguation, evidence extraction, and answer synthesis, and dispatching them to dedicated agents equipped with chain-of-thought prompting. These agents communicate intermediate reasoning and progressively refine the retrieval and synthesis process. Our design allows fine-grained control over information flow without any model fine-tuning. Crucially, agents are invoked on demand, enabling a dynamic and efficient workflow that avoids unnecessary computation. This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results. Experiments on multi-hop and ambiguous QA benchmarks demonstrate that MA-RAG outperforms state-of-the-art training-free baselines and rivals fine-tuned systems, validating the effectiveness of collaborative agent-based reasoning in RAG.
nan
Article 903
Title@2025-05-26 (1): Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models
Title: Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models | Sicherheit durch Vernunft: Eine empirische Studie zu vernünftigen Guardrail-Modellen | 安全理由:对护卫车模型说明理由的经验研究 2505.20087v1 |
Authors: Makesh Narsimhan Sreedhar, Traian Rebedea, Christopher Parisien
Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems.
nan
Article 904
Title@2025-05-26 (1): Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
Title: Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models | Enthüllen der Intrinsischen Ethischen Verletzlichkeit von ausgerichteten großen Sprachmodellen | 揭示统一大语言模式内在道德脆弱性 2504.05050v3 |
Authors: Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau
Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible “dark patterns” in LLMs’ parametric memory, evading alignment safeguards and resurfacing under adversarial inducement at distributional shifts. In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs by proving that current alignment methods yield only local “safety regions” in the knowledge manifold. In contrast, pretrained knowledge remains globally connected to harmful concepts via high-likelihood adversarial trajectories. Building on this theoretical insight, we empirically validate our findings by employing semantic coherence inducement under distributional shifts–a method that systematically bypasses alignment constraints through optimized adversarial prompts. This combined theoretical and empirical approach achieves a 100% attack success rate across 19 out of 23 state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing their universal vulnerabilities.
nan
Article 905
Title@2025-05-26 (1): SAEs Are Good for Steering – If You Select the Right Features
Title: SAEs Are Good for Steering – If You Select the Right Features | SAEs sind gut für das Lenken – wenn Sie die richtigen Funktionen auswählen | SAEs 有利于指导 – – 如果您选择了正确的特性 2505.20063v1 |
Authors: Dana Arad, Aaron Mueller, Yonatan Belinkov
Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model’s latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model’s output. In this work, we draw a distinction between two types of features: input features, which mainly capture patterns in the model’s input, and output features, which have a human-understandable effect on the model’s output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: after filtering out features with low output scores, we obtain 2-3x improvements when steering with SAEs, making them competitive with supervised methods.
nan
Article 906
Title@2025-05-26 (1): “Alexa, can you forget me?” Machine Unlearning Benchmark in Spoken Language Understanding
Title: “Alexa, can you forget me?” Machine Unlearning Benchmark in Spoken Language Understanding | „Alexa, kannst du mich vergessen?” Machine Unlearning Benchmark in Spoken Language Understanding | “亚历克斯,你能忘记我吗?” 2505.15700v2 |
Authors: Alkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis
Machine unlearning, the process of efficiently removing specific information from machine learning models, is a growing area of interest for responsible AI. However, few studies have explored the effectiveness of unlearning methods on complex tasks, particularly speech-related ones. This paper introduces UnSLU-BENCH, the first benchmark for machine unlearning in spoken language understanding (SLU), focusing on four datasets spanning four languages. We address the unlearning of data from specific speakers as a way to evaluate the quality of potential “right to be forgotten” requests. We assess eight unlearning techniques and propose a novel metric to simultaneously better capture their efficacy, utility, and efficiency. UnSLU-BENCH sets a foundation for unlearning in SLU and reveals significant differences in the effectiveness and computational feasibility of various techniques.
nan
Article 907
Title@2025-05-26 (1): Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion
Title: Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion | Multimodale LLM-geführte semantische Korrektur in Text-zu-Bild-Diffusion | 文字到图像传播中多式LLM-指导的语义校正 2505.20053v1 |
Authors: Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu
Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD’s significant improvements.
nan
Article 908
Title@2025-05-26 (1): MVP: Multi-source Voice Pathology detection
Title: MVP: Multi-source Voice Pathology detection | MVP: Multi-Source Sprachpathologie-Erkennung | MVP:多源语音病理检测 2505.20050v1 |
Authors: Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis
Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strategies to combine sentence reading and sustained vowel recordings: waveform concatenation, intermediate feature fusion, and decision-level combination. Empirical validation across the German, Portuguese, and Italian languages shows that intermediate feature fusion using transformers best captures the complementary characteristics of both recording types. Our approach achieves up to +13% AUC improvement over single-source methods.
nan
Article 909
Title@2025-05-26 (1): Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks
Title: Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks | Grammatik der formalen Unsicherheit: Wann man LLMs bei automatisierten Aufgaben zur Begründung vertraut | 正式不确定性的语法:在自动说明理由任务中何时信任LLMs 2505.20047v1 |
Authors: Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary
Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization’s domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.
nan
Article 910
Title@2025-05-26 (1): Bemba Speech Translation: Exploring a Low-Resource African Language
Title: Bemba Speech Translation: Exploring a Low-Resource African Language | Bemba Speech Translation: Erforschen einer ressourcenarmen afrikanischen Sprache | 本巴语言翻译:探索非洲低资源语言 2505.02518v2 |
Authors: Muhammad Hazim Al Farouq, Aman Kassahun Wassie, Yasmin Moslem
This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2025), low-resource languages track, namely for Bemba-to-English speech translation. We built cascaded speech translation systems based on Whisper and NLLB-200, and employed data augmentation techniques, such as back-translation. We investigate the effect of using synthetic data and discuss our experimental setup.
nan
Article 911
Title@2025-05-26 (1): REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
Title: REARANK: Reasoning Re-ranking Agent via Reinforcement Learning | REARANK: Reasoning Re-Ranking Agent über Verstärkungs-Lernen | REARANK: 通过加强学习,为重新升级的代理提供理由 2505.20046v1 |
Authors: Le Zhang, Bo Wang, Xipeng Qiu, Siva Reddy, Aishwarya Agrawal
We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, significantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.
nan
Article 912
Title@2025-05-26 (1): Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs
Title: Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs | Unsichere Aufmerksamkeitsköpfe: Effiziente Unüberwachte Unsichere Quantifizierung für LLMs | 确定性 – – 警告注意头头:对LLMs进行高效率的、无监督的、不确定性的量化 2505.20045v1 |
Authors: Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, Artem Shelmanov
Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as “hallucinations”. Uncertainty quantification (UQ) methods are a promising tool for coping with this fundamental shortcoming. Yet, existing UQ methods face challenges such as high computational overhead or reliance on supervised learning. Here, we aim to bridge this gap. In particular, we propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently. By analyzing attention weights, we identified a peculiar pattern: drops in attention to preceding tokens are systematically observed during incorrect generations for certain “uncertainty-aware” heads. RAUQ automatically selects such heads, recurrently aggregates their attention weights and token-level confidences, and computes sequence-level uncertainty scores in a single forward pass. Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results, outperforming state-of-the-art UQ methods using minimal computational overhead (<1% latency). Moreover, it requires no task-specific labels and no careful hyperparameter tuning, offering plug-and-play real-time hallucination detection in white-box LLMs.
nan
Article 913
Title@2025-05-26 (1): The More Similar, the Better? Associations between Latent Semantic Similarity and Emotional Experiences Differ across Conversation Contexts
Title: The More Similar, the Better? Associations between Latent Semantic Similarity and Emotional Experiences Differ across Conversation Contexts | Je ähnlicher, desto besser? Assoziationen zwischen latenter semantischer Ähnlichkeit und emotionaler Erfahrung unterscheiden sich über Gesprächskontexte | ” 更相似的 “ 、 “ 更好 “ 、 “ 经常语义相似性与情感经历之间联系 “ 、 “ 不同对话背景 “ 、 “ 更好 “ 、 “ 不同对话背景 “ 、 “ 不同情感经历 “ 、 “ 不同对话背景 “ 、 “ 更好 “ 、 “ 更好 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 、 “ 不同政见 “ 2309.12646v3 |
Authors: Chen-Wei Yu, Yun-Shiuan Chuang, Alexandros N. Lotsos, Tabea Meier, Claudia M. Haase
Latent semantic similarity (LSS) is a measure of the similarity of information exchanges in a conversation. Challenging the assumption that higher LSS bears more positive psychological meaning, we propose that this association might depend on the type of conversation people have. On the one hand, the share-mind perspective would predict that higher LSS should be associated with more positive emotional experiences across the board. The broaden-and-build theory, on the other hand, would predict that higher LSS should be inversely associated with more positive emotional experiences specifically in pleasant conversations. Linear mixed modeling based on conversations among 50 long-term married couples supported the latter prediction. That is, partners experienced greater positive emotions when their overall information exchanges were more dissimilar in pleasant (but not conflict) conversations. This work highlights the importance of context in understanding the emotional correlates of LSS and exemplifies how modern natural language processing tools can be used to evaluate competing theory-driven hypotheses in social psychology.
nan
Article 914
Title@2025-05-26 (1): Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation
Title: Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation | Enthüllen der Macht der Quelle: Quelle-basierte Minimum Bayes Risiko-Dekodierung für neurale maschinelle Übersetzung | 资料来源:基于源的神经机器翻译最低贝ys风险代号。 2406.11632v5 |
Authors: Boxuan Lyu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura
Maximum a posteriori decoding, a commonly used method for neural machine translation (NMT), aims to maximize the estimated posterior probability. However, high estimated probability does not always lead to high translation quality. Minimum Bayes Risk (MBR) decoding offers an alternative by seeking hypotheses with the highest expected utility. Inspired by Quality Estimation (QE) reranking which uses the QE model as a ranker we propose source-based MBR (sMBR) decoding, a novel approach that utilizes quasi-sources (generated via paraphrasing or back-translation) as ``support hypotheses’’ and a reference-free quality estimation metric as the utility function, marking the first work to solely use sources in MBR decoding. Experiments show that sMBR outperforms QE reranking and the standard MBR decoding. Our findings suggest that sMBR is a promising approach for NMT decoding.
nan
Article 915
Title@2025-05-26 (1): Multi-modal brain encoding models for multi-modal stimuli
Title: Multi-modal brain encoding models for multi-modal stimuli | Multimodale Gehirnkodierungsmodelle für multimodale Reize | 多模式刺激多模式大脑编码模型 2505.20027v1 |
Authors: Subba Reddy Oota, Khushbu Pahwa, Mounika Marreddy, Maneesh Singh, Manish Gupta, Bapi S. Raju
Despite participants engaging in unimodal stimuli, such as watching images or silent videos, recent work has demonstrated that multi-modal Transformer models can predict visual brain activity impressively well, even with incongruent modality representations. This raises the question of how accurately these multi-modal models can predict brain activity when participants are engaged in multi-modal stimuli. As these models grow increasingly popular, their use in studying neural activity provides insights into how our brains respond to such multi-modal naturalistic stimuli, i.e., where it separates and integrates information across modalities through a hierarchy of early sensory regions to higher cognition. We investigate this question by using multiple unimodal and two types of multi-modal models-cross-modal and jointly pretrained-to determine which type of model is more relevant to fMRI brain activity when participants are engaged in watching movies. We observe that both types of multi-modal models show improved alignment in several language and visual regions. This study also helps in identifying which brain regions process unimodal versus multi-modal information. We further investigate the contribution of each modality to multi-modal alignment by carefully removing unimodal features one by one from multi-modal representations, and find that there is additional information beyond the unimodal embeddings that is processed in the visual and language regions. Based on this investigation, we find that while for cross-modal models, their brain alignment is partially attributed to the video modality; for jointly pretrained models, it is partially attributed to both the video and audio modalities. This serves as a strong motivation for the neuroscience community to investigate the interpretability of these models for deepening our understanding of multi-modal information processing in brain.
nan
Article 916
Title@2025-05-26 (1): A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
Title: A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron? | Eine Umfrage über die Sicherheitsbedrohungen von Computer-Verwendern: JARVIS oder Ultron? | JARVIS还是ULTRON? 调查计算机用户的安全和安保威胁:JARVIS还是ULTRON? 2505.10924v2 |
Authors: Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang
Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
nan
Article 917
Title@2025-05-26 (1): A Survey of LLM-based Agents in Medicine: How far are we from Baymax?
Title: A Survey of LLM-based Agents in Medicine: How far are we from Baymax? | Eine Umfrage von LLM-basierten Medikamenten in der Medizin: Wie weit sind wir von Baymax entfernt? | 对医学中以LLM为主的药剂的调查:我们离Baymax有多远? 2502.11211v2 |
Authors: Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, Yixuan Yuan
Large Language Models (LLMs) are transforming healthcare through the development of LLM-based agents that can understand, reason about, and assist with medical tasks. This survey provides a comprehensive review of LLM-based agents in medicine, examining their architectures, applications, and challenges. We analyze the key components of medical agent systems, including system profiles, clinical planning mechanisms, medical reasoning frameworks, and external capacity enhancement. The survey covers major application scenarios such as clinical decision support, medical documentation, training simulations, and healthcare service optimization. We discuss evaluation frameworks and metrics used to assess these agents’ performance in healthcare settings. While LLM-based agents show promise in enhancing healthcare delivery, several challenges remain, including hallucination management, multimodal integration, implementation barriers, and ethical considerations. The survey concludes by highlighting future research directions, including advances in medical reasoning inspired by recent developments in LLM architectures, integration with physical systems, and improvements in training simulations. This work provides researchers and practitioners with a structured overview of the current state and future prospects of LLM-based agents in medicine.
nan
Article 918
Title@2025-05-26 (1): Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking
Title: Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking | Training von LLM-basierten Agenten mit synthetischen selbstreflektierten Trajektorien und partieller Maske | 具有合成自我反射轨迹和部分遮罩的以LLM为基础的代理人员培训 2505.20023v1 |
Authors: Yihan Chen, Benfeng Xu, Xiaorui Wang, Yongdong Zhang, Zhendong Mao
Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.
nan
Article 919
Title@2025-05-26 (1): TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation
Title: TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation | TTPA: Token-Level Tool-use Preference Alignment Training Framework mit feinkörniger Bewertung | TTPA: 采用精细评价法的全方位工具使用优先调整培训框架 2505.20016v1 |
Authors: Chengrui Huang, Shen Gao, Zhengliang Shi, Dongsheng Wang, Shuo Shang
Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.
nan
Article 920
Title@2025-05-26 (1): On the class of coding optimality of human languages and the origins of Zipf’s law
Title: On the class of coding optimality of human languages and the origins of Zipf’s law | Über die Klasse der Kodierung der optimalen menschlichen Sprachen und die Ursprünge des Zippschen Gesetzes | 在人类语言最优化的编码和齐普夫法律的起源方面 2505.20015v1 |
Authors: Ramon Ferrer-i-Cancho
Here we present a new class of optimality for coding systems. Members of that class are separated linearly from optimal coding and thus exhibit Zipf’s law, namely a power-law distribution of frequency ranks. Whithin that class, Zipf’s law, the size-rank law and the size-probability law form a group-like structure. We identify human languages that are members of the class. All languages showing sufficient agreement with Zipf’s law are potential members of the class. In contrast, there are communication systems in other species that cannot be members of that class for exhibiting an exponential distribution instead but dolphins and humpback whales might. We provide a new insight into plots of frequency versus rank in double logarithmic scale. For any system, a straight line in that scale indicates that the lengths of optimal codes under non-singular coding and under uniquely decodable encoding are separated by a linear function whose slope is the exponent of Zipf’s law. For systems under compression and constrained to be uniquely decodable, such a straight line may indicate that the system is coding close to optimality. Our findings provide support for the hypothesis that Zipf’s law originates from compression.
nan
Article 921
Title@2025-05-26 (1): Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation
Title: Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation | Ist Rationale Qualität Materie? Verbesserung der psychischen Störung Detektion durch selektive Begründung Destillation | 理由质量是否重要? 通过选择性理由蒸馏加强精神失常检测 2505.20014v1 |
Authors: Hoyun Song, Huije Lee, Jisu Shin, Sukmin Cho, Changgeon Ko, Jong C. Park
The detection of mental health problems from social media and the interpretation of these results have been extensively explored. Research has shown that incorporating clinical symptom information into a model enhances domain expertise, improving its detection and interpretation performance. While large language models (LLMs) are shown to be effective for generating explanatory rationales in mental health detection, their substantially large parameter size and high computational cost limit their practicality. Reasoning distillation transfers this ability to smaller language models (SLMs), but inconsistencies in the relevance and domain alignment of LLM-generated rationales pose a challenge. This paper investigates how rationale quality impacts SLM performance in mental health detection and explanation generation. We hypothesize that ensuring high-quality and domain-relevant rationales enhances the distillation. To this end, we propose a framework that selects rationales based on their alignment with expert clinical reasoning. Experiments show that our quality-focused approach significantly enhances SLM performance in both mental disorder detection and rationale generation. This work highlights the importance of rationale quality and offers an insightful framework for knowledge transfer in mental health applications.
nan
Article 922
Title@2025-05-26 (1): WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback
Title: WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback | WebCoT: Web-Agenten verbessern Begründung durch Rekonstruieren Kette-von-Gedanken in Reflexion, Verzweigung und Rollback | WebCot:通过在反射、分流和回滚中重新构建研究链,加强网络代理理由 2505.20013v1 |
Authors: Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, Jingyan Zhou, Hongming Zhang, Haitao Mi, Dong Yu, Irwin King
Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent’s (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.
nan
Article 923
Title@2025-05-26 (1): ProcessBench: Identifying Process Errors in Mathematical Reasoning
Title: ProcessBench: Identifying Process Errors in Mathematical Reasoning | ProcessBench: Identifizierung von Prozessfehlern in mathematischer Reasoning | 进程快节: 识别数学原因中的进程错误 2412.06559v4 |
Authors: Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.
nan
Article 924
Title@2025-05-26 (1): Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition
Title: Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition | Mischung von LoRA-Experten für die automatische Spracherkennung mit geringem Ressourcenbedarf | LoRA 低资源多中心自动语音识别专家混合 2505.20006v1 |
Authors: Raphaël Bagat, Irina Illina, Emmanuel Vincent
We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. We introduce Mixture of Accent-Specific LoRAs (MAS-LoRA), a fine-tuning method that leverages a mixture of Low-Rank Adaptation (LoRA) experts, each specialized in a specific accent. This method can be used when the accent is known or unknown at inference time, without the need to fine-tune the model again. Our experiments, conducted using Whisper on the L2-ARCTIC corpus, demonstrate significant improvements in Word Error Rate compared to regular LoRA and full fine-tuning when the accent is unknown. When the accent is known, the results further improve. Furthermore, MAS-LoRA shows less catastrophic forgetting than the other fine-tuning methods. To the best of our knowledge, this is the first use of a mixture of LoRA experts for non-native multi-accent ASR.
nan
Article 925
Title@2025-05-26 (1): Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents
Title: Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents | Unvollkommenheit: Simulieren von Studenten mit unterschiedlichen kognitiven Ebenen mit LLM-basierten Agenten | 普及缺陷:利用基于LLM的代理物模拟具有不同认知水平的学生 2505.19997v1 |
Authors: Tao Wu, Jingyuan Chen, Wang Lin, Mengze Li, Yumeng Zhu, Ang Li, Kun Kuang, Fei Wu
Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants’’, target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the \texttt{Student_100} dataset, consisting of $100$ students working on Python programming and $5,000$ learning records. Experimental results show that our method consistently outperforms baseline models, achieving $100\%$ improvement in simulation accuracy.
nan
Article 926
Title@2025-05-26 (1): How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation
Title: How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation | Wie gut übersetzen große Begründungsmodelle? Eine umfassende Bewertung für Multi-Domain maschinelle Übersetzung | 大理由模型如何翻译?多功能机器翻译的全面评价 2505.19987v1 |
Authors: Yongshi Ye, Biao Fu, Chongxuan Huang, Yidong Chen, Xiaodong Shi
Large language models (LLMs) have demonstrated strong performance in general-purpose machine translation, but their effectiveness in complex, domain-sensitive translation tasks remains underexplored. Recent advancements in Large Reasoning Models (LRMs), raise the question of whether structured reasoning can enhance translation quality across diverse domains. In this work, we compare the performance of LRMs with traditional LLMs across 15 representative domains and four translation directions. Our evaluation considers various factors, including task difficulty, input length, and terminology density. We use a combination of automatic metrics and an enhanced MQM-based evaluation hierarchy to assess translation quality. Our findings show that LRMs consistently outperform traditional LLMs in semantically complex domains, especially in long-text and high-difficulty translation scenarios. Moreover, domain-adaptive prompting strategies further improve performance by better leveraging the reasoning capabilities of LRMs. These results highlight the potential of structured reasoning in MDMT tasks and provide valuable insights for optimizing translation systems in domain-sensitive contexts.
nan
Article 927
Title@2025-05-26 (1): What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs
Title: What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs | Was bedeutet Neuro für Cardio? Untersuchung der Rolle klinischer Spezialdaten in medizinischen LLMs | ” 神经中度 “ 与 “ 心脏病 “ 有何关系? 调查临床特殊数据在医疗长效管中的作用 2505.10113v2 |
Authors: Xinlan Yan, Di Wu, Yibin Lei, Christof Monz, Iacer Calixto
In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from a speciality does not necessarily lead to best performance on that specialty and 2) regardless of the specialty fine-tuned on, token probabilities of clinically relevant terms for all specialties increase consistently. Thus, we believe improvement gains come mostly from domain shifting (e.g., general to medical) rather than knowledge injection and suggest rethinking the role of fine-tuning data in the medical domain. We release S-MedQA and all code needed to reproduce all our experiments to the research community.
nan
Article 928
Title@2025-05-26 (1): DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset
Title: DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset | DeepDialogue: Ein multi-Turn emotional-Rich gesprochener Dialog Datensatz | 深对话:多发情感- Rich 口语对话框数据集 2505.19978v1 |
Authors: Alkis Koudounas, Moreno La Quatra, Elena Baralis
Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g., “cars,” “travel”) yield more meaningful conversations than abstract ones (e.g., “philosophy”); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.
nan
Article 929
Title@2025-05-26 (1): Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language
Title: Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language | Conversational Lexicography: Abfrage Lexicographic Data on Knowledge Graphs mit SPARQL durch natürliche Sprache | 通过自然语言查询与SPARQL通过自然语言的 SPARQL 知识图的文献资料 2505.19971v1 |
Authors: Kilian Sennrich, Sina Ahmadi
Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata’s lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.
nan
Article 930
Title@2025-05-26 (1): CP-Router: An Uncertainty-Aware Router Between LLM and LRM
Title: CP-Router: An Uncertainty-Aware Router Between LLM and LRM | CP-Router: Ein unsicherer Router zwischen LLM und LRM | CP-Router:LLM和LRM之间的不确定软件路由器 2505.19970v1 |
Authors: Jiayuan Su, Fulin Lin, Zhaopeng Feng, Han Zheng, Teng Wang, Zhenyu Xiao, Xinlong Zhao, Zuozhu Liu, Lu Cheng, Hongwei Wang
Recent advances in Large Reasoning Models (LRMs) have significantly improved long-chain reasoning capabilities over Large Language Models (LLMs). However, LRMs often produce unnecessarily lengthy outputs even for simple queries, leading to inefficiencies or even accuracy degradation compared to LLMs. To overcome this, we propose CP-Router, a training-free and model-agnostic routing framework that dynamically selects between an LLM and an LRM, demonstrated with multiple-choice question answering (MCQA) prompts. The routing decision is guided by the prediction uncertainty estimates derived via Conformal Prediction (CP), which provides rigorous coverage guarantees. To further refine the uncertainty differentiation across inputs, we introduce Full and Binary Entropy (FBE), a novel entropy-based criterion that adaptively selects the appropriate CP threshold. Experiments across diverse MCQA benchmarks, including mathematics, logical reasoning, and Chinese chemistry, demonstrate that CP-Router efficiently reduces token usage while maintaining or even improving accuracy compared to using LRM alone. We also extend CP-Router to diverse model pairings and open-ended QA, where it continues to demonstrate strong performance, validating its generality and robustness.
nan
Article 931
Title@2025-05-26 (1): The Limits of Preference Data for Post-Training
Title: The Limits of Preference Data for Post-Training | Die Grenzen der Präferenzdaten für das Post-Training | 培训后优先数据限值 2505.19964v1 |
Authors: Eric Zhao, Jessica Dai, Pranjal Awasthi
Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$-wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF’s ability to elicit robust strategies – a class that encompasses most reasoning behaviors.
nan
Article 932
Title@2025-05-26 (1): Explanatory Summarization with Discourse-Driven Planning
Title: Explanatory Summarization with Discourse-Driven Planning | Erklärende Zusammenfassung mit diskursgetriebener Planung | 与 “ 分流规划 “ 结合的解释性总结 2504.19339v3 |
Authors: Dongqi Liu, Xi Yu, Vera Demberg, Mirella Lapata
Lay summaries for scientific documents typically include explanations to help readers grasp sophisticated concepts or arguments. However, current automatic summarization methods do not explicitly model explanations, which makes it difficult to align the proportion of explanatory content with human-written summaries. In this paper, we present a plan-based approach that leverages discourse frameworks to organize summary generation and guide explanatory sentences by prompting responses to the plan. Specifically, we propose two discourse-driven planning strategies, where the plan is conditioned as part of the input or part of the output prefix, respectively. Empirical experiments on three lay summarization datasets show that our approach outperforms existing state-of-the-art methods in terms of summary quality, and it enhances model robustness, controllability, and mitigates hallucination.
nan
Article 933
Title@2025-05-26 (1): MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
Title: MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models | MiniLongBench: Der kostengünstige Long Context Benchmark für große Sprachmodelle verstehen | MiniLongBunench:大语言模式低成本长方背景理解基准 2505.19959v1 |
Authors: Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin
Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.
nan
Article 934
Title@2025-05-26 (1): DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph
Title: DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph | DCG-SQL: Verbesserung des In-Context-Lernens für Text-zu-SQL mit Deep Contextual Schema Link Graph | DCG-SQL:加强内文学习,以便用深背景图示链接图进行文字到SQL的内文学习 2505.19956v1 |
Authors: Jihyung Lee, Jin-Seop Lee, Jaehoon Lee, YunSeok Choi, Jee-Hyong Lee
Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. Our code will be released.
nan
Article 935
Title@2025-05-26 (1): MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Title: MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research | MLR-Bench: Bewertung von KI-Agenten auf Open-Ended Machine Learning Research | MLR-Bench:评估AI公司在开放式机械学习研究方面的代理机构 2505.19955v1 |
Authors: Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results–posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
nan
Article 936
Title@2025-05-26 (1): An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning
Title: An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning | Ein erklärbares Diagnose-Framework für neurodegenerative Dementias durch Verstärkungsoptimierte LLM-Reasoning | 通过强化-优化LLM解释性理疗理由的神经医学性痴呆症可解释的诊断框架 2505.19954v1 |
Authors: Andrew Zamai, Nathanael Fijalkow, Boris Mansencal, Laurent Simon, Eloi Navet, Pierrick Coupe
The differential diagnosis of neurodegenerative dementias is a challenging clinical task, mainly because of the overlap in symptom presentation and the similarity of patterns observed in structural neuroimaging. To improve diagnostic efficiency and accuracy, deep learning-based methods such as Convolutional Neural Networks and Vision Transformers have been proposed for the automatic classification of brain MRIs. However, despite their strong predictive performance, these models find limited clinical utility due to their opaque decision making. In this work, we propose a framework that integrates two core components to enhance diagnostic transparency. First, we introduce a modular pipeline for converting 3D T1-weighted brain MRIs into textual radiology reports. Second, we explore the potential of modern Large Language Models (LLMs) to assist clinicians in the differential diagnosis between Frontotemporal dementia subtypes, Alzheimer’s disease, and normal aging based on the generated reports. To bridge the gap between predictive accuracy and explainability, we employ reinforcement learning to incentivize diagnostic reasoning in LLMs. Without requiring supervised reasoning traces or distillation from larger models, our approach enables the emergence of structured diagnostic rationales grounded in neuroimaging findings. Unlike post-hoc explainability methods that retrospectively justify model decisions, our framework generates diagnostic rationales as part of the inference process-producing causally grounded explanations that inform and guide the model’s decision-making process. In doing so, our framework matches the diagnostic performance of existing deep learning methods while offering rationales that support its diagnostic conclusions.
nan
Article 937
Title@2025-05-26 (1): Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation
Title: Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation | Weniger für mehr: Verbesserte feedbackorientierte gemischte LLMs für die Erzeugung von Molekülen und eine feinkörnige NLI-Bewertung | 减少更多:加强用于分子制导和精细国家低排放指数评价的反馈-调整混合混合LLM(MMLM) 2405.13984v3 |
Authors: Dimitris Gkoumas, Maria Liakata
Scientific language models drive research innovation but require extensive fine-tuning on large datasets. This work enhances such models by improving their inference and evaluation capabilities with minimal or no additional training. Focusing on molecule caption generation, we explore post-training synergies between alignment fine-tuning and model merging in a cross-modal setup. We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models. Moreover, we propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain. Our experiments demonstrate that our evaluation operates at the right level of granularity, effectively handling multiple content units and subsentence reasoning, while widely adopted NLI methods consistently misalign with assessment criteria.
nan
Article 938
Title@2025-05-26 (1): Can Visual Encoder Learn to See Arrows?
Title: Can Visual Encoder Learn to See Arrows? | Kann Visual Encoder lernen, Pfeile zu sehen? | 视觉编码器能学会看到箭头吗 ? 2505.19944v1 |
Authors: Naoyuki Terashita, Yusuke Tozaki, Hideaki Omote, Congkha Nguyen, Ryosuke Nakamoto, Yuta Koreeda, Hiroaki Ozaki
The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram–caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.
nan
Article 939
Title@2025-05-26 (1): Constructing a BPE Tokenization DFA
Title: Constructing a BPE Tokenization DFA | Aufbau einer BPE Tokenization DFA | 正在构建 BPE 磁盘化 DFA 2405.07671v2 |
Authors: Martin Berglund, Willeke Martens, Brink van der Merwe
Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata (DFA) designed to operate directly on tokenizations produced by the popular byte pair encoding (BPE) technique. This makes it possible to apply many existing techniques and algorithms to the tokenized case, such as pattern matching, equivalence checking of tokenization dictionaries, and composing tokenized languages in various ways. The construction preserves some key properties of the automaton, and we use this to establish asymptotic bounds on the state complexity of the automata that result. Finally, we demonstrate how to construct an input-deterministic (subsequential) string-to-string transducer which precisely describes the relationship between strings and their correct tokenizations.
nan
Article 940
Title@2025-05-26 (1): ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
Title: ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs | ALAS: Latente Sprach-Text-Ausrichtung für gesprochenes Sprachverständnis in multimodalen LLMs messen | ALAS: 计量多种模式LM 中口语语言理解的暗中语音-文本对齐 2505.19937v1 |
Authors: Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan
Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.
nan
Article 941
Title@2025-05-26 (1): MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning
Title: MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning | MELoRA: Mini-Ensemble Low-Rank-Adapter für ein parametereffizientes Feintuning | MELORA: 用于准计有效微调的小型组合式低射速适应器 2402.17263v3 |
Authors: Pengjie Ren, Chengshun Shi, Shiguang Wu, Mengqi Zhang, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Jiahuan Pei
Parameter-efficient fine-tuning (PEFT) is a popular method for tailoring pre-trained large language models (LLMs), especially as the models’ scale and the diversity of tasks increase. Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional, i.e., significant model changes can be represented with relatively few parameters. However, decreasing the rank encounters challenges with generalization errors for specific tasks when compared to full-parameter fine-tuning. We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank, thereby offering improved performance potential. The core idea is to freeze original pretrained weights and train a group of mini LoRAs with only a small number of parameters. This can capture a significant degree of diversity among mini LoRAs, thus promoting better generalization ability. We conduct a theoretical analysis and empirical studies on various NLP tasks. Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks, which demonstrates the effectiveness of MELoRA.
nan
Article 942
Title@2025-05-26 (1): GeoEdit: Geometric Knowledge Editing for Large Language Models
Title: GeoEdit: Geometric Knowledge Editing for Large Language Models | GeoEdit: Geometrische Wissensbearbeitung für große Sprachmodelle | GeoEdit:大语言模型的几何知识编辑 2502.19953v2 |
Authors: Yujie Feng, Liming Zhan, Zexin Lu, Yongxin Xu, Xu Chu, Yasha Wang, Jiannong Cao, Philip S. Yu, Xiao-Ming Wu
Regular updates are essential for maintaining up-to-date knowledge in large language models (LLMs). Consequently, various model editing methods have been developed to update specific knowledge within LLMs. However, training-based approaches often struggle to effectively incorporate new knowledge while preserving unrelated general knowledge. To address this challenge, we propose a novel framework called Geometric Knowledge Editing (GeoEdit). GeoEdit utilizes the geometric relationships of parameter updates from fine-tuning to differentiate between neurons associated with new knowledge updates and those related to general knowledge perturbations. By employing a direction-aware knowledge identification method, we avoid updating neurons with directions approximately orthogonal to existing knowledge, thus preserving the model’s generalization ability. For the remaining neurons, we integrate both old and new knowledge for aligned directions and apply a “forget-then-learn” editing strategy for opposite directions. Additionally, we introduce an importance-guided task vector fusion technique that filters out redundant information and provides adaptive neuron-level weighting, further enhancing model editing performance. Extensive experiments on two publicly available datasets demonstrate the superiority of GeoEdit over existing state-of-the-art methods.
nan
Article 943
Title@2025-05-26 (1): A Cognitive Writing Perspective for Constrained Long-Form Text Generation
Title: A Cognitive Writing Perspective for Constrained Long-Form Text Generation | Eine Kognitive Schreibperspektive für die eingeschränkte Langform-Textgenerierung | 受约束的长期形式制长式制式文本生成的认知式写作视角 2502.12568v3 |
Authors: Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, Xiuying Chen
Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: \href{https://github.com/KaiyangWan/CogWriter}{CogWriter}.
nan
Article 944
Title@2025-05-26 (1): JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Title: JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs | JailbreakRadar: Umfassende Bewertung von Jailbreak Attacken gegen LLMs | Jailbreb Radar:全面评估对LLMs的越狱袭击 2402.05668v3 |
Authors: Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang
Jailbreak attacks aim to bypass the LLMs’ safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation – either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to mitigate by defenses, causing low practicality. Our study offers valuable insights for future research on jailbreak attacks and defenses. We hope our work could help the community avoid incremental work and serve as an effective benchmark tool for practitioners.
nan
Article 945
Title@2025-05-26 (1): Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
Title: Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles | Enigmata: Scaling Logical Reasoning in großen Sprachmodellen mit synthetischen überprüfbaren Puzzles | 英格玛塔:在使用合成可核实拼图的大型语言模型中扩大逻辑理由 2505.19914v1 |
Authors: Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang
Large Language Models (LLMs), such as OpenAI’s o1 and DeepSeek’s R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.
nan
Article 946
Title@2025-05-26 (1): APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization
Title: APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization | APE: Ein datenzentrischer Benchmark für effiziente LLM-Anpassung in der Textzusammenfassung | APE: 文本摘要中高效LLM适应数据中心基准 2505.19912v1 |
Authors: Javier Marín
We present Adjacent Possible Exploration (APE), a simple yet effective method for adapting large language models to specific tasks using minimal computational resources. Unlike traditional fine-tuning that requires extensive compute, APE iteratively fine-tunes models on small, carefully selected data batches (200 examples), retaining only improvements. On news summarization, APE achieves 40 percent BLEU improvement using just a T4 GPU in 60 minutes, matching or exceeding more complex methods like LoRA while remaining conceptually simple. Our approach is particularly valuable for researchers and practitioners with limited computational resources. We provide open-source code and demonstrate APE’s effectiveness through both automatic metrics and human evaluation. While inspired by evolutionary theory’s “adjacent possible”, APE’s core insight has a very practical application: small, iterative data perturbations can efficiently guide LLMs toward task-specific performance without expensive retraining.
nan
Article 947
Title@2025-05-26 (1): Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models
Title: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | Lineare Kontrolle des Testbewusstseins zeigt unterschiedliche Compliance in vernünftigen Modellen | 对试验认知值的线性控制 2505.14617v2 |
Authors: Sahar Abdelnabi, Ahmed Salem
Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated, an effect analogous to the Hawthorne phenomenon, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such “test awareness” impacts model behavior, particularly its safety alignment. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-source reasoning LLMs across both realistic and hypothetical tasks. Our results demonstrate that test awareness significantly impact safety alignment, and is different for different models. By providing fine-grained control over this latent effect, our work aims to increase trust in how we perform safety evaluation.
nan
Article 948
Title@2025-05-26 (1): ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Title: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows | ScienceBoard: Bewertung multimodaler autonomer Agenzien in realistischen wissenschaftlichen Workflows | 科学理事会:评估现实科学工作流程中的多式联运自治机构 2505.19897v1 |
Authors: Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers’ workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.
nan
Article 949
Title@2025-05-26 (1): Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program
Title: Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program | Große Sprachmodelle als autonome Raumfahrzeugbetreiber im Kerbal-Raumprogramm | 作为Kerbal空间方案自主航天器运营商的大型语言模型 2505.19896v1 |
Authors: Alejandro Carrasco, Victor Rodriguez-Fernandez, Richard Linares
Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \href{https://github.com/ARCLab-MIT/kspdg}{GitHub}, while the trained models and datasets are available on \href{https://huggingface.co/OhhTuRnz}{Hugging Face}. Additionally, experiment tracking and detailed results can be reviewed on \href{https://wandb.ai/carrusk/huggingface}{Weights \& Biases
nan
Article 950
Title@2025-05-26 (1): MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
Title: MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System | MoC: Mischungen von Text Chunking Learners für retrieval-Augmented Generation System | MoC: 用于检索增强型生成系统的 文本冲击学习者混合体 2503.09600v2 |
Authors: Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
nan
Article 951
Title@2025-05-26 (1): ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining
Title: ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining | ESLM: Risiko-Averse Selective Language Modeling für effizientes Vortraining | ESLM: 有效培训前风险-反风险选择语言建模 2505.19893v1 |
Authors: Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to distributionally robust optimization. We extend our approach to Ada-ESLM, which adaptively tunes the selection confidence during training. Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines. Our approach also scales across model sizes, pretraining corpora, and integrates naturally with knowledge distillation.
nan
Article 952
Title@2025-05-26 (1): Phare: A Safety Probe for Large Language Models
Title: Phare: A Safety Probe for Large Language Models | Phare: Eine Sicherheitssonde für große Sprachmodelle | 法尔:大语言模型的安全检测 2505.11365v4 |
Authors: Pierre Le Jeune, Benoît Malézieux, Weixuan Xiao, Matteo Dora
Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.
nan
Article 953
Title@2025-05-26 (1): APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
Title: APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs | APB: Beschleunigen des verteilten Long-Context-Schlussfolgerungens durch Übergeben von komprimierten Kontextblöcken über GPUs | APP: 通过通过横跨 GPU 传递压缩的上下文区块加速分布式长文字推文 2502.12085v2 |
Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.
nan
Article 954
Title@2025-05-26 (1): Explaining the role of Intrinsic Dimensionality in Adversarial Training
Title: Explaining the role of Intrinsic Dimensionality in Adversarial Training | Erklärung der Rolle der Intrinsischen Dimensionalität im Adversarial Training | 解释内在多面性在相互培训中的作用 2405.17130v2 |
Authors: Enes Altinisik, Safa Messaoud, Husrev Taha Sencar, Hassan Sajjad, Sanjay Chawla
Adversarial Training (AT) impacts different architectures in distinct ways: vision models gain robustness but face reduced generalization, encoder-based models exhibit limited robustness improvements with minimal generalization loss, and recent work in latent-space adversarial training (LAT) demonstrates that decoder-based models achieve improved robustness by applying AT across multiple layers. We provide the first explanation for these trends by leveraging the manifold conjecture: off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization. We show that vision and decoder-based models exhibit low intrinsic dimensionality in earlier layers (favoring off-manifold AEs), whereas encoder-based models do so in later layers (favoring on-manifold AEs). Exploiting this property, we introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality. This reduces the projected gradient descent (PGD) chain length required for AE generation, cutting GPU time by 25-33% while significantly boosting robustness. We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval augmented generation setups, demonstrating superior robustness with comparable generalization to standard training.
nan
Article 955
Title@2025-05-26 (1): HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation
Title: HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation | HS-STAR: Hierarchische Probenahme für selbstlernende Vernunfter über Schwierigkeitsschätzung und Budget-Umverteilung | HS-STAR:通过难以估计和预算重新定位为自学理性者进行等级抽样 2505.19866v1 |
Authors: Feng Xiong, Hongling Xu, Yifei Wang, Runxi Cheng, Yong Wang, Xiangxiang Chu
Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM’s reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.
nan
Article 956
Title@2025-05-26 (1): REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models
Title: REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models | REA-RL: Reflection-Aware Online-Verstärkungs-Lernen für effiziente große Vernunftmodelle | REA-RL:为高效大型理由模型进行反思-软件在线强化学习 2505.19862v1 |
Authors: Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang
Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but tends to lose the reflection ability and harm the performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 35% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for simpler ones without losing reflection ability. Codes are available at https://github.com/hexuandeng/REA-RL.
nan
Article 957
Title@2025-05-26 (1): Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages
Title: Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages | Über die Spezialisierung hinaus: Benchmarking LLMs für die Transliteration indischer Sprachen | 超越专业:为印度语言转写确定基准的LLMs 2505.19851v1 |
Authors: Gulfarogh Azam, Mohd Sadique, Saif Ali, Mohammad Nadeem, Erik Cambria, Shahab Saquib Sohail, Mohammad Sultan Alam
Transliteration, the process of mapping text from one script to another, plays a crucial role in multilingual natural language processing, especially within linguistically diverse contexts such as India. Despite significant advancements through specialized models like IndicXlit, recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training. The current work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages. Experiments utilized standard benchmarks, including Dakshina and Aksharantar datasets, with performance assessed via Top-1 Accuracy and Character Error Rate. Our findings reveal that while GPT family models generally outperform other LLMs and IndicXlit for most instances. Additionally, fine-tuning GPT-4o improves performance on specific languages notably. An extensive error analysis and robustness testing under noisy conditions further elucidate strengths of LLMs compared to specialized models, highlighting the efficacy of foundational models for a wide spectrum of specialized applications with minimal overhead.
nan
Article 958
Title@2025-05-26 (1): Improving Multilingual Math Reasoning for African Languages
Title: Improving Multilingual Math Reasoning for African Languages | Mehrsprachige mathematische Grundlagen für afrikanische Sprachen verbessern | 改进非洲语文多语种计算法 2505.19848v1 |
Authors: Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Esther Adenuga, David Ifeoluwa Adelani, Jimmy Lin
Researchers working on low-resource languages face persistent challenges due to limited data availability and restricted access to computational resources. Although most large language models (LLMs) are predominantly trained in high-resource languages, adapting them to low-resource contexts, particularly African languages, requires specialized techniques. Several strategies have emerged for adapting models to low-resource languages in todays LLM landscape, defined by multi-stage pre-training and post-training paradigms. However, the most effective approaches remain uncertain. This work systematically investigates which adaptation strategies yield the best performance when extending existing LLMs to African languages. We conduct extensive experiments and ablation studies to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations. Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.
nan
Article 959
Title@2025-05-26 (1): FoodTaxo: Generating Food Taxonomies with Large Language Models
Title: FoodTaxo: Generating Food Taxonomies with Large Language Models | FoodTaxo: Generierung von Lebensmittel-Taxonomien mit großen Sprachmodellen | FoodTaxo: 产生具有大语言模式的食品分类学 2505.19838v1 |
Authors: Pascal Wullschleger, Majid Zarharan, Donnacha Daly, Marc Pouly, Jennifer Foster
We investigate the utility of Large Language Models for automated taxonomy generation and completion specifically applied to taxonomies from the food technology industry. We explore the extent to which taxonomies can be completed from a seed taxonomy or generated without a seed from a set of known concepts, in an iterative fashion using recent prompting techniques. Experiments on five taxonomies using an open-source LLM (Llama-3), while promising, point to the difficulty of correctly placing inner nodes.
nan
Article 960
Title@2025-05-26 (1): FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
Title: FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow | FullFront: Benchmarking von MLLMs über den Full Front-End Engineering Workflow | FullFront:在全前端工程工作流程中确定MLLMs基准 2505.17399v2 |
Authors: Haoyu Sun, Huichen Will Wang, Jiawei Gu, Linjie Li, Yu Cheng
Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in https://github.com/Mikivishy/FullFront.
nan
Article 961
Title@2025-05-26 (1): DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer’s Disease
Title: DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer’s Disease | DECT: LLM-unterstütztes feinkörniges Sprachwissen und etikettierte und etikettierte Datengenerierung zur Diagnose der Alzheimer-Krankheit | DECT:利用LLM协助的LLM协助的精精细语言知识以及用于诊断阿尔茨海默氏病的标签和标签保密数据生成 2502.04394v2 |
Authors: Tingyu Mo, Jacqueline C. K. Lam, Victor O. K. Li, Lawrence Y. L. Cheung
Alzheimer’s Disease (AD) is an irreversible neurodegenerative disease affecting 50 million people worldwide. Low-cost, accurate identification of key markers of AD is crucial for timely diagnosis and intervention. Language impairment is one of the earliest signs of cognitive decline, which can be used to discriminate AD patients from normal control individuals. Patient-interviewer dialogues may be used to detect such impairments, but they are often mixed with ambiguous, noisy, and irrelevant information, making the AD detection task difficult. Moreover, the limited availability of AD speech samples and variability in their speech styles pose significant challenges in developing robust speech-based AD detection models. To address these challenges, we propose DECT, a novel speech-based domain-specific approach leveraging large language models (LLMs) for fine-grained linguistic analysis and label-switched label-preserved data generation. Our study presents four novelties: We harness the summarizing capabilities of LLMs to identify and distill key Cognitive-Linguistic information from noisy speech transcripts, effectively filtering irrelevant information. We leverage the inherent linguistic knowledge of LLMs to extract linguistic markers from unstructured and heterogeneous audio transcripts. We exploit the compositional ability of LLMs to generate AD speech transcripts consisting of diverse linguistic patterns to overcome the speech data scarcity challenge and enhance the robustness of AD detection models. We use the augmented AD textual speech transcript dataset and a more fine-grained representation of AD textual speech transcript data to fine-tune the AD detection model. The results have shown that DECT demonstrates superior model performance with an 11% improvement in AD detection accuracy on the datasets from DementiaBank compared to the baselines.
nan
Article 962
Title@2025-05-26 (1): Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents
Title: Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents | Hierarchische Retrieval mit Evidenz-Kuration für Open-Domain-Finanzfrage-Antworten auf standardisierte Dokumente | 标准化文件开放域财务问题证据说明的梯级检索 2505.20368v1 |
Authors: Jaeyoung Choe, Jihoon Kim, Woohwan Jung
Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.
nan
Article 963
Title@2025-05-26 (1): Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Title: Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric | Modell Utility Law: Bewertung von LLMs jenseits der Leistung durch Mechanism Interpretable Metric | 示范效用法:通过解释计量机制评价业绩以外的有限利妇女 2504.07440v3 |
Authors: Yixin Cao, Jiahao Ying, Yaoning Wang, Xipeng Qiu, Xuanjing Huang, Yugang Jiang
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. One core challenge of evaluation in the large language model (LLM) era is the generalization issue: how to infer a model’s near-unbounded abilities from inevitably bounded benchmarks. We address this challenge by proposing Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores. MUI quantifies the effort a model expends on a task, defined as the proportion of activated neurons or features during inference. Intuitively, a truly capable model should achieve higher performance with lower effort. Extensive experiments across popular LLMs reveal a consistent inverse logarithmic relationship between MUI and performance, which we formulate as the Utility Law. From this law we derive four practical corollaries that (i) guide training diagnostics, (ii) expose data contamination issue, (iii) enable fairer model comparisons, and (iv) design model-specific dataset diversity. Our code can be found at https://github.com/ALEX-nlp/MUI-Eva.
nan
Article 964
Title@2025-05-26 (1): Generalizable Prompt Learning of CLIP: A Brief Overview
Title: Generalizable Prompt Learning of CLIP: A Brief Overview | Generalisierbares Prompt Lernen von CLIP: Ein kurzer Überblick | CLIP:简要概述 2503.01263v5 |
Authors: Fangming Cui, Yonggang Zhang, Xuan Wang, Xule Wang, Liang Xiao
Existing vision-language models (VLMs) such as CLIP have showcased an impressive capability to generalize well across various downstream tasks. These models leverage the synergy between visual and textual information, enabling them to understand and reason about the content present in images and text in a unified manner. This article provides a brief overview of CLIP based on few-shot prompt learning, including experimental data and technical characteristics of some methods. The purpose of this review is to provide a reference for researchers who have just started their research in generalizable prompting of CLIP through few-shot training for classification across 15 datasets and also to facilitate the integration of this field by researchers in other downstream tasks.
nan
Article 965
Title@2025-05-26 (1): Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
Title: Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation | Registrierung von Quellen-Token zu Zielspracheräumen in mehrsprachiger neuraler maschineller Übersetzung | 多种语言神经机翻译中目标语言空间 2501.02979v3 |
Authors: Zhi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe
The multilingual neural machine translation (MNMT) aims for arbitrary translations across multiple languages. Although MNMT-specific models trained on parallel data offer low costs in training and deployment, their performance consistently lags behind that of large language models (LLMs). In this work, we introduce registering, a novel method that enables a small MNMT-specific model to compete with LLMs. Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens. By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space. Experiments on EC-40, a large-scale benchmark, show that our method advances the state-of-the-art of MNMT. We further pre-train two models, namely MITRE (multilingual translation with registers), by 9.3 billion sentence pairs across 24 languages collected from public corpora. One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning. Finally, we open-source our models to facilitate further research and development in MNMT: https://github.com/zhiqu22/mitre.
nan
Article 966
Title@2025-05-26 (1): Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
Title: Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective | Entschlüsselung bahngestützter LLM-Reasoning: Eine Optimierungsperspektive | 解码轨迹辅助LLM 理由说明:优化视角 2505.19815v1 |
Authors: Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen
We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM’s parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.
nan
Article 967
Title@2025-05-26 (1): Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks
Title: Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks | Erforschung des Bewusstseins in LLMs: Eine systematische Untersuchung von Theorien, Implementierungen und Grenzrisiken | 探索LLMM中的觉悟:对理论、实施和前沿风险的系统调查 2505.19806v1 |
Authors: Sirui Chen, Shuqin Ma, Shu Yu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at https://github.com/OpenCausaLab/Awesome-LLM-Consciousness.
nan
Article 968
Title@2025-05-26 (1): Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation
Title: Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation | Compliance-to-Code: Verbesserung der finanziellen Compliance-Prüfung durch Codegenerierung | 遵守到守则:通过代码生成加强金融合规检查 2505.19804v1 |
Authors: Siyuan Li, Jian Chen, Rui Yao, Xuming Hu, Peilin Zhou, Weihua Qiu, Simin Zhang, Chucheng Dong, Zhiyao Li, Qipeng Xie, Zixuan Yuan
Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.
nan
Article 969
Title@2025-05-26 (1): QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language
Title: QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | QueryAttack: Jailbreaking Aligned Large Language Models Verwendung strukturierter, nicht-natürlicher Abfragesprache | 查询:使用结构化非自然查询语言的监狱破碎的大型语言统一模式 2502.09723v3 |
Authors: Qingsong Zou, Jingyu Xiao, Qing Li, Zhi Yan, Yuhang Wang, Li Xu, Wenxuan Wang, Kuofeng Gao, Ruoyu Li, Yong Jiang
Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack not only can achieve high attack success rates (ASRs), but also can jailbreak various defense methods. Furthermore, we tailor a defense method against QueryAttack, which can reduce ASR by up to $64\%$ on GPT-4-1106. Our code is available at https://github.com/horizonsinzqs/QueryAttack.
nan
Article 970
Title@2025-05-26 (1): MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
Title: MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs | MOLE: Metadatenextraktion und -validierung in wissenschaftlichen Papieren mit LLMs | MOLE: 利用LLMs在科学文件中提取和验证元数据 2505.19800v1 |
Authors: Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets’ scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.
nan
Article 971
Title@2025-05-26 (1): R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning
Title: R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning | R1-T1: Volle Förderung der Übersetzungsfähigkeit in LLMs über das Reasoning Learning | R1-T1:通过解释学习充分激励LLMs翻译能力 2502.19735v3 |
Authors: Minggui He, Yilun Liu, Shimin Tao, Yuanchang Luo, Hongyong Zeng, Chang Su, Li Zhang, Hongxia Ma, Daimeng Wei, Weibin Meng, Hao Yang, Boxing Chen, Osamu Yoshie
Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to overfitting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation to broader MT scenarios (e.g., multilingual MT, domain MT) unseen in the training phase; (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery through RL. Both human and automatic evaluation results indicate a steady translation performance improvement in a total of 10+ languages and 40+ translation directions on Flores-101 test set and four domain-specific MT tasks, especially on the languages unseen from training.
nan
Article 972
Title@2025-05-26 (1): O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering
Title: O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering | O$^2$-Sucher: Ein Such-basiertes Agentenmodell für Open-Domain Open-Ended Question Answering | O$2美元-Searcher:基于搜索的开放域开放式问题解答代理模式 2505.16582v2 |
Authors: Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao
Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model’s sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
nan
Article 973
Title@2025-05-26 (1): Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification
Title: Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification | Analyse politischer Bias in LLMs über zielorientierte Sentiment-Klassifikation | 通过定向感知分类分析LLMMs中的政治偏见 2505.19776v1 |
Authors: Akram Elbouanani, Evan Dufraisse, Adrian Popescu
Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts.
nan
Article 974
Title@2025-05-26 (1): What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
Title: What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs | Was spielt bei vielen scharfen Angriffen wirklich eine Rolle? Eine empirische Studie über langanhaltende Schwachstellen in LLMs | 许多热攻击的真正问题是什么? 2505.19773v1 |
Authors: Sangyeop Kim, Yohan Lee, Yongwoo Song, Kimin Lee
We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.
nan
Article 975
Title@2025-05-26 (1): Query Performance Prediction using Relevance Judgments Generated by Large Language Models
Title: Query Performance Prediction using Relevance Judgments Generated by Large Language Models | Abfrage der Leistungsvorhersage anhand von Relevanzurteilen, die von großen Sprachmodellen erzeugt werden | 使用大语言模型产生的相关性判断的查询性绩效预测 2404.01012v3 |
Authors: Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, Maarten de Rijke
Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item’s relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019 to 2022 deep learning tracks and CAsT-19 and 20 datasets show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.
nan
Article 976
Title@2025-05-26 (1): Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Title: Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO | Verständnis der Leistungslücke im Preference Learning: Eine Dichotomie von RLHF und DPO | 了解优先学习方面的绩效差距:RLHF和DPO的二分切开术 2505.19770v1 |
Authors: Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du
We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model – highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
nan
Article 977
Title@2025-05-26 (1): T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search
Title: T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search | T^2Agent Ein Tool-augmented Multimodale Fehlinformation Detection Agent mit Monte Carlo Baumsuche | T2 A A 工具增强的多式错误信息检测代理 蒙特卡洛树搜索工具 2505.19768v1 |
Authors: Xing Cui, Yueying Zou, Zekun Li, Peipei Li, Xinyuan Xu, Xuannan Liu, Huaibo Huang, Ran He
Real-world multimodal misinformation often arises from mixed forgery sources, requiring dynamic reasoning and adaptive verification. However, existing methods mainly rely on static pipelines and limited tool usage, limiting their ability to handle such complexity and diversity. To address this challenge, we propose T2Agent, a novel misinformation detection agent that incorporates an extensible toolkit with Monte Carlo Tree Search (MCTS). The toolkit consists of modular tools such as web search, forgery detection, and consistency analysis. Each tool is described using standardized templates, enabling seamless integration and future expansion. To avoid inefficiency from using all tools simultaneously, a Bayesian optimization-based selector is proposed to identify a task-relevant subset. This subset then serves as the action space for MCTS to dynamically collect evidence and perform multi-source verification. To better align MCTS with the multi-source nature of misinformation detection, T2Agent extends traditional MCTS with multi-source verification, which decomposes the task into coordinated subtasks targeting different forgery sources. A dual reward mechanism containing a reasoning trajectory score and a confidence score is further proposed to encourage a balance between exploration across mixed forgery sources and exploitation for more reliable evidence. We conduct ablation studies to confirm the effectiveness of the tree search mechanism and tool usage. Extensive experiments further show that T2Agent consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks, demonstrating its strong potential as a training-free approach for enhancing detection accuracy. The code will be released.
nan
Article 978
Title@2025-05-26 (1): SGM: A Framework for Building Specification-Guided Moderation Filters
Title: SGM: A Framework for Building Specification-Guided Moderation Filters | SGM: Ein Rahmen für gebäudespezifikationsgeführte Moderationsfilter | SGM: 构建规格引导调控过滤器的框架 2505.19766v1 |
Authors: Masoomali Fatehkia, Enes Altinisik, Husrev Taha Sencar
Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.
nan
Article 979
Title@2025-05-26 (1): In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement
Title: In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement | In-Context-Demonstrationsfragen: Zur Prompt-Optimierung für Pseudo-Supervision-Verfeinerung | 内文示范事项:关于Psuedo-监督改进的迅速优化 2410.03124v2 |
Authors: Zhen-Yu Zhang, Jiandong Zhang, Huaxiu Yao, Gang Niu, Masashi Sugiyama
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality. Most existing methods rely on human supervision or parameter retraining, both of which are costly in terms of data collection and computational resources. To handle these challenges, a direct solution is to generate ``high-confidence’’ data from unsupervised downstream tasks and use them for in-context prompting or prompt optimization to refine the pseudo-supervision. However, relying solely on such data may lead to overfitting. In this paper, we leverage the in-context learning (ICL) abilities of LLMs and propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision. The proposed learning objective ensures that the optimized prompt guides the LLM to generate consistent responses for a given input when pseudo-supervised data from the downstream task are used as demonstrations, enabling refinement over the entire pseudo-supervision. The prompt is optimized by translating gradient signals into textual critiques, which serve as feedback to iteratively refine the prompt and model responses. Theoretical analysis in a simplified classification setting shows that the refined pseudo-supervision exhibits a geometric clustering structure, helping to mitigate overfitting. Experiments on question answering, natural language inference benchmarks, and a real-world molecule optimization task, show the effectiveness of the proposed algorithm.
nan
Article 980
Title@2025-05-26 (1): CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement
Title: CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement | CIDRe: Ein referenzfreies Multi-Aspekt-Kriterium für die Qualitätsmessung von Code Comment | CIDRe: 守则评论质量衡量的无参考性、无参考性、多特征的多标准标准 2505.19757v1 |
Authors: Maria Dziuba, Valentin Malykh
Effective generation of structured code comments requires robust quality metrics for dataset curation, yet existing approaches (SIDE, MIDQ, STASIS) suffer from limited code-comment analysis. We propose CIDRe, a language-agnostic reference-free quality criterion combining four synergistic aspects: (1) relevance (code-comment semantic alignment), (2) informativeness (functional coverage), (3) completeness (presence of all structure sections), and (4) description length (detail sufficiency). We validate our criterion on a manually annotated dataset. Experiments demonstrate CIDRe’s superiority over existing metrics, achieving improvement in cross-entropy evaluation. When applied to filter comments, the models finetuned on CIDRe-filtered data show statistically significant quality gains in GPT-4o-mini assessments.
nan
Article 981
Title@2025-05-26 (1): Efficient Reasoning via Chain of Unconscious Thought
Title: Efficient Reasoning via Chain of Unconscious Thought | Effiziente Vernunft durch Kette des unbewussten Denkens | 通过无意识思维链进行高效率的思考 2505.19756v1 |
Authors: Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang
Large Reasoning Models (LRMs) achieve promising performance but compromise token efficiency due to verbose reasoning processes. Unconscious Thought Theory (UTT) posits that complex problems can be solved more efficiently through internalized cognitive processes. Inspired by UTT, we propose a new reasoning paradigm, termed Chain of Unconscious Thought (CoUT), to improve the token efficiency of LRMs by guiding them to mimic human unconscious thought and internalize reasoning processes. Concretely, we first prompt the model to internalize the reasoning by thinking in the hidden layer. Then, we design a bag of token-efficient strategies to further help models reduce unnecessary tokens yet preserve the performance. Our work reveals that models may possess beneficial unconscious thought, enabling improved efficiency without sacrificing performance. Extensive experiments demonstrate the effectiveness of CoUT. Remarkably, it surpasses CoT by reducing token usage by 47.62% while maintaining comparable accuracy, as shown in Figure 1. The code of CoUT is available at this link: https://github.com/Rohan-GRH/CoUT
nan
Article 982
Title@2025-05-26 (1): NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
Title: NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering | NeuSym-RAG: Hybrides neurales Symbolisches Retrieval mit Multiview-Strukturierung für PDF-Fragebeantwortung | NeuSym-RAG: PDF 问题解答混合神经符号回收与多视图结构结构 2505.19754v1 |
Authors: Ruisheng Cao, Hanchong Zhang, Tiancheng Huang, Zhangyi Kang, Yuxin Zhang, Liangtai Sun, Hanqi Li, Yuxun Miao, Shuai Fan, Lu Chen, Kai Yu
The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at https://github.com/X-LANCE/NeuSym-RAG.
nan
Article 983
Title@2025-05-26 (1): Discrete Markov Bridge
Title: Discrete Markov Bridge | Diskretierte Markov-Brücke | 分立马尔科夫桥 2505.19752v1 |
Authors: Hengli Li, Yuxuan Wang, Song-Chun Zhu, Ying Nian Wu, Zilong Zheng
Discrete diffusion has recently emerged as a promising paradigm in discrete data modeling. However, existing methods typically rely on a fixed rate transition matrix during training, which not only limits the expressiveness of latent representations, a fundamental strength of variational methods, but also constrains the overall design space. To address these limitations, we propose Discrete Markov Bridge, a novel framework specifically designed for discrete representation learning. Our approach is built upon two key components: Matrix Learning and Score Learning. We conduct a rigorous theoretical analysis, establishing formal performance guarantees for Matrix Learning and proving the convergence of the overall framework. Furthermore, we analyze the space complexity of our method, addressing practical constraints identified in prior studies. Extensive empirical evaluations validate the effectiveness of the proposed Discrete Markov Bridge, which achieves an Evidence Lower Bound (ELBO) of 1.38 on the Text8 dataset, outperforming established baselines. Moreover, the proposed model demonstrates competitive performance on the CIFAR-10 dataset, achieving results comparable to those obtained by image-specific generation approaches.
nan
Article 984
Title@2025-05-26 (1): Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents
Title: Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents | Mobile-Bench-v2: Ein realistischerer und umfassenderer Benchmark für VLM-basierte mobile Agenten | 移动-Bench-v2:基于VLM的移动剂更加现实和全面的基准 2505.11891v2 |
Authors: Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An
VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent’s ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\&A interactions is released to evaluate the agent’s proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.
nan
Article 985
Title@2025-05-26 (1): Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models
Title: Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models | Erforschung der Auswirkungen von Corpus Diversity auf vorschulische Sprachmodelle | 探讨公司多样性对财务方面缺乏培训语言模式的影响 2310.13312v2 |
Authors: Jaeyoung Choe, Keonwoong Noh, Nayeon Kim, Seyun Ahn, Woohwan Jung
Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and have outperformed general-domain PLMs in specialized areas such as biomedical, scientific, and clinical domains. In addition, financial PLMs have been studied because of the high economic impact of financial data analysis. However, we found that financial PLMs were not pretrained on sufficiently diverse financial data. This lack of diverse training data leads to a subpar generalization performance, resulting in general-purpose PLMs, including BERT, often outperforming financial PLMs on many downstream tasks. To address this issue, we collected a broad range of financial corpus and trained the Financial Language Model (FiLM) on these diverse datasets. Our experimental results confirm that FiLM outperforms not only existing financial PLMs but also general domain PLMs. Furthermore, we provide empirical evidence that this improvement can be achieved even for unseen corpus groups.
nan
Article 986
Title@2025-05-26 (1): Stuffed Mamba: Oversized States Lead to the Inability to Forget
Title: Stuffed Mamba: Oversized States Lead to the Inability to Forget | Gefüllte Mamba: Übergroße Staaten führen zu der Unfähigkeit zu vergessen | 马姆巴:国家规模过大,导致无法忘却 2410.07145v2 |
Authors: Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun
Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to “forget” earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
nan
Article 987
Title@2025-05-26 (1): Distilling Closed-Source LLM’s Knowledge for Locally Stable and Economic Biomedical Entity Linking
Title: Distilling Closed-Source LLM’s Knowledge for Locally Stable and Economic Biomedical Entity Linking | Brennen von geschlossener Quelle LLMs Wissen für lokal stabile und wirtschaftliche biomedizinische Entitätsverknüpfung | 保留秘密来源LLM的当地稳定和经济生物医学实体联系知识 2505.19722v1 |
Authors: Yihao Ai, Zhiyuan Ning, Weiwei Dai, Pengfei Wang, Yi Du, Wenjuan Cui, Kunpeng Liu, Yuanchun Zhou
Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR’’, a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.
nan
Article 988
Title@2025-05-26 (1): Graceful Forgetting in Generative Language Models
Title: Graceful Forgetting in Generative Language Models | Anmutiges Vergessen in generativen Sprachmodellen | 在创用语言模型中优雅地忘却 2505.19715v1 |
Authors: Chunyang Jiang, Chi-min Chan, Yiyang Cai, Yulong Liu, Wei Xue, Yike Guo
Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
nan
Article 989
Title@2025-05-26 (1): MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning
Title: MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning | MT$^{3}$: Skalierung von MLLM-basierten Textbildmaschinenübersetzungen über Multi-Task-Verstärkungslernen | MT$=%3}$:通过多任务强化学习,扩大基于MLLM的文本图像机翻译 2505.19714v1 |
Authors: Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhe Xu, Yao Hu, Wenxuan Huang, Jian Wu, Zuozhu Liu
Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT’s intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.
nan
Article 990
Title@2025-05-26 (1): FamilyTool: A Multi-hop Personalized Tool Use Benchmark
Title: FamilyTool: A Multi-hop Personalized Tool Use Benchmark | FamilyTool: Ein Multi-Hop Personalisiertes Tool Benchmark | FamilyTool:多希望个性化工具使用基准 2504.06766v2 |
Authors: Yuxin Wang, Yiran Guo, Yining Zheng, Zhangyue Yin, Shuo Chen, Jie Yang, Jiajun Chen, Yuan Li, Xuanjing Huang, Xipeng Qiu
The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real-world personalized scenarios, particularly those requiring multi-hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios. FamilyTool, including base and extended datasets, challenges LLMs with queries spanning from 1 to 4 relational hops (e.g., inferring familial connections and preferences) and 2 to 6 hops respectively, and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re-training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG-augmented evaluation pipeline to systematically assess LLMs’ tool use ability in these settings. Experiments reveal significant performance gaps in state-of-the-art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real-world contexts and highlight the urgent need for advancements in tool-learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents’ reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at \href{https://github.com/yxzwang/FamilyTool}{https://github.com/yxzwang/FamilyTool}.
nan
Article 991
Title@2025-05-26 (1): Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision
Title: Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision | Fehler-Typierung für intelligentere Belohnungen: Verbesserung der Prozess-Reward-Modelle mit Fehler-Aware Hierarchische Überwachung | 为智能奖赏打字出错: 改进有错误- 软件等级监督的流程评分模型 2505.19706v1 |
Authors: Tej Deep Pala, Panshul Sharma, Amir Zadeh, Chuan Li, Soujanya Poria
Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.
nan
Article 992
Title@2025-05-26 (1): Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Title: Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models | Nutzung von wichtigen Stichproben zur Abgleichung von Alignment-Modulen aus großen Sprachmodellen | 从大语言模型中利用重要性取样到分离对齐模块 2505.19700v1 |
Authors: Yi Liu, Dianqing Liu, Mingye Zhu, Junbo Guo, Yongdong Zhang, Zhendong Mao
The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.
nan
Article 993
Title@2025-05-26 (1): Large Language Models for Planning: A Comprehensive and Systematic Survey
Title: Large Language Models for Planning: A Comprehensive and Systematic Survey | Große Sprachmodelle für die Planung: Eine umfassende und systematische Erhebung | 规划大语言模式:全面和系统调查 2505.19683v1 |
Authors: Pengfei Cao, Tianyi Men, Wencan Liu, Jingwen Zhang, Xuzhao Li, Xixun Lin, Dianbo Sui, Yanan Cao, Kang Liu, Jun Zhao
Planning represents a fundamental capability of intelligent agents, requiring comprehensive environmental understanding, rigorous logical reasoning, and effective sequential decision-making. While Large Language Models (LLMs) have demonstrated remarkable performance on certain planning tasks, their broader application in this domain warrants systematic investigation. This paper presents a comprehensive review of LLM-based planning. Specifically, this survey is structured as follows: First, we establish the theoretical foundations by introducing essential definitions and categories about automated planning. Next, we provide a detailed taxonomy and analysis of contemporary LLM-based planning methodologies, categorizing them into three principal approaches: 1) External Module Augmented Methods that combine LLMs with additional components for planning, 2) Finetuning-based Methods that involve using trajectory data and feedback signals to adjust LLMs in order to improve their planning abilities, and 3) Searching-based Methods that break down complex tasks into simpler components, navigate the planning space, or enhance decoding strategies to find the best solutions. Subsequently, we systematically summarize existing evaluation frameworks, including benchmark datasets, evaluation metrics and performance comparisons between representative planning methods. Finally, we discuss the underlying mechanisms enabling LLM-based planning and outline promising research directions for this rapidly evolving field. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this field.
nan
Article 994
Title@2025-05-26 (1): Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings
Title: Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings | Aufwärmen, bevor Sie trainieren: Entsperren der allgemeinen Vernunft in ressourcenbeschränkten Einstellungen | 在您之前暖暖的列车 : 在受资源限制的设置中解锁一般理由 2505.13718v2 |
Authors: Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, Keith Ross
Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we “warm up” the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: $(i)$ the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval$^{+}$, and MMLU-Pro; $(ii)$ When both the base model and the warmed-up model are RLVR trained on the same small dataset ($\leq100$ examples), the warmed-up model consistently outperforms the base model; $(iii)$ Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; $(iv)$ Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.
nan
Article 995
Title@2025-05-26 (1): Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors
Title: Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors | Ihr Sprachmodell kann geheim wie Menschen schreiben: Kontrastive Paraphrasenangriffe auf LLM-generierte Textdetektoren | 您的语言模式可以像人类一样秘密写作:对LLM-Generated 文本探测器的矛盾性插词攻击 2505.15337v2 |
Authors: Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang
The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.
nan
Article 996
Title@2025-05-26 (1): Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
Title: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis | LLM-generierter koreanischer Text durch Linguistik-Feature-Analyse erkennen | 通过语言特征分析探测LLM-发光韩文文本 2503.00032v3 |
Authors: Shinwoo Park, Shubin Kim, Do-Kyung Kim, Yo-Sub Han
The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns can hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUROC compared to the best-performing existing detection method. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_generated_korean_text_through_linguistic_analysis.
nan
Article 997
Title@2025-05-26 (1): UniICL: An Efficient Unified Framework Unifying Compression, Selection, and Generation
Title: UniICL: An Efficient Unified Framework Unifying Compression, Selection, and Generation | UniICL: Ein effizientes einheitliches Framework, das Komprimierung, Auswahl und Generierung vereint | UNIICL: 统一压缩、甄选和生成的有效统一框架 2405.17062v3 |
Authors: Jun Gao, Qi Lv, Zili Wang, Tianxiang Wu, Ziqiang Cao, Wenjie Li
In-context learning (ICL) enhances the reasoning abilities of Large Language Models (LLMs) by prepending a few demonstrations. It motivates researchers to introduce more examples to provide additional contextual information for the generation. However, existing methods show a significant limitation due to the problem of excessive growth in context length, which causes a large hardware burden. In addition, shallow-relevant examples selected by off-the-shelf tools hinder LLMs from capturing useful contextual information for generation. In this paper, we propose \textbf{UniICL}, a novel \textbf{Uni}fied \textbf{ICL} framework that unifies demonstration compression, demonstration selection, and final response generation. Furthermore, to boost inference efficiency, we design a tailored compression strategy that allows UniICL to cache compression results into \textbf{Demonstration Bank} (\textbf{DB}), which avoids repeated compression of the same demonstration. Extensive out-of-domain evaluations prove the advantages of UniICL in both effectiveness and efficiency.
nan
Article 998
Title@2025-05-26 (1): KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization
Title: KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization | Low-Resource Speech Translation Systems des KIT für IWSLT2025: Systemverbesserung mit synthetischen Daten und Modellregularisierung | KIT的IWSLT2025低资源语音翻译系统:利用合成数据和模型规范化加强系统 2505.19679v1 |
Authors: Zhaolin Li, Yining Liu, Danni Liu, Tuan Nam Nguyen, Enes Yavuz Ugan, Tu Anh Dinh, Carlos Mullov, Alexander Waibel, Jan Niehues
This paper presents KIT’s submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
nan
Article 999
Title@2025-05-26 (1): Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Title: Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs | Erdungssprache mit Vision: Eine bedingte Gegenseitige Informationskalibrierte Dekodierungsstrategie zur Reduktion von Halluzinationen in LVLMs | 具有远见的地面语言:减少低地低地飘移中幻觉的有条件相互信息校准标记战略 2505.19678v1 |
Authors: Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Tao Liang, Guojun Ma, Shu-Tao Xia
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs’ over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
nan
Article 1000
Title@2025-05-26 (1): Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement
Title: Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement | Kalibrierung vortrainierter Sprachklassifikatoren auf LLM-generierten Noisy-Labels über iterative Veredelung | 通过迭代精炼校准LLM产生的噪音标签上的训练前语言分类校准 2505.19675v1 |
Authors: Liqin Ye, Agam Shah, Chao Zhang, Sudheer Chava
The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model’s generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier’s prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.
nan
Article 1001
Title@2025-05-26 (1): Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models
Title: Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models | Trennen Sie das Weizen vom Chaff: Ein Post-Hoc-Ansatz für die Wiederausrichtung der Sicherheit für feingetönte Sprachmodelle | 将小麦与Chaff区分开来:对精美语言模式的安全调整后方法 2412.11041v3 |
Authors: Di Wu, Xin Lu, Yanyan Zhao, Bing Qin
Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at: https://anonymous.4open.science/r/IRR-BD4F.
nan
Article 1002
Title@2025-05-26 (1): A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit
Title: A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit | Ein voll generativer Motivationsgespräch Berater Chatbot für den Umzug Raucher auf dem Weg zu der Entscheidung zu beenden | 全面创造动机的访谈参赞Chatbot 移动吸烟者争取决定退出 2505.17362v2 |
Authors: Zafarullah Mahmood, Soliman Ali, Jiading Zhu, Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Jodi Wolff, Osnat Melamed, Nadia Minian, Marta Maslej, Carolynne Cooper, Matt Ratto, Peter Selby, Jonathan Rose
The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot’s adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants’ confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants’ language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.
nan
Article 1003
Title@2025-05-26 (1): Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
Title: Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models | Reformieren von Repräsentationsräumen, um Sicherheit und Überrejektion in großen Audio-Sprachenmodellen auszugleichen | 重塑代表空间以平衡大型音频语言模型中的安全和过度拒绝 2505.19670v1 |
Authors: Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model’s representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.
nan
Article 1004
Title@2025-05-26 (1): GTR: Graph-Table-RAG for Cross-Table Question Answering
Title: GTR: Graph-Table-RAG for Cross-Table Question Answering | GTR: Graph-Table-RAG für Cross-Table-Frageantworten | GTR:用于跨表问题解答的图表表-RAG 2504.01346v3 |
Authors: Jiaru Zou, Dongqi Fu, Sirui Chen, Xinrui He, Zihao Li, Yada Zhu, Jiawei Han, Jingrui He
Beyond pure text, a substantial amount of knowledge is stored in tables. In real-world scenarios, user questions often require retrieving answers that are distributed across multiple tables. GraphRAG has recently attracted much attention for enhancing LLMs’ reasoning capabilities by organizing external knowledge to address ad-hoc and complex questions, exemplifying a promising direction for cross-table question answering. In this paper, to address the current gap in available data, we first introduce a multi-table benchmark, MutliTableQA, comprising 60k tables and 25k user queries collected from real-world sources. Then, we propose the first Graph-Table-RAG framework, namely GTR, which reorganizes table corpora into a heterogeneous graph, employs a hierarchical coarse-to-fine retrieval process to extract the most relevant tables, and integrates graph-aware prompting for downstream LLMs’ tabular reasoning. Extensive experiments show that GTR exhibits superior cross-table question-answering performance while maintaining high deployment efficiency, demonstrating its real-world practical applicability.
nan
Article 1005
Title@2025-05-26 (1): LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation
Title: LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation | LeCoDe: Ein Benchmark-Datensatz für interaktive Rechtsberatungs-Dialog-Evaluierung | LeCode:交互式法律协商对话评价的基准数据集 2505.19667v1 |
Authors: Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu
Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs’ legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs’ consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs’ legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.
nan
Article 1006
Title@2025-05-26 (1): Conditioning LLMs to Generate Code-Switched Text
Title: Conditioning LLMs to Generate Code-Switched Text | LLMs konditionieren, um codegeschalteten Text zu erzeugen | 将LLM 限定为生成代码开关文本 2502.12924v2 |
Authors: Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa
Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models’ performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.
nan
Article 1007
Title@2025-05-26 (1): Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning
Title: Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning | Mehrbildbeschreibungen für mehrsprachige, leichte Kognitive Impairment-Erkennung durch kontrastives Lernen enthüllen | 通过差异学习发现多语种轻视认知缺陷的单形多语种描述 2505.17067v2 |
Authors: Kristin Qi, Jiali Cheng, Youxiang Zhu, Hadi Amiri, Xiaohui Liang
Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the ‘Cookie Theft’). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework’s effectiveness in multilingual and multi-picture MCI detection.
nan
Article 1008
Title@2025-05-26 (1): GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models
Title: GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models | GenKI: Verbesserung der Open-Domain-Fragebeantwortung mit Wissensintegration und kontrollierbarer Generierung in großen Sprachmodellen | GenKI:加强以大语言模式在知识整合和可控生成方面答案的开放性问题 2505.19660v1 |
Authors: Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model’s ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI
nan
Article 1009
Title@2025-05-26 (1): A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?
Title: A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language? | Eine Geschichte von zwei Strukturen: Erfassen LLMs die Fraktalkomplexität der Sprache? | 两种结构的故事:LLMs是否捕捉语言的分形复杂性? 2502.14924v2 |
Authors: Ibrahim Alabdulmohsin, Andreas Steiner
Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs’ output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.
nan
Article 1010
Title@2025-05-26 (1): Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation
Title: Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation | Auswählen, Lesen und Schreiben: Ein multi-agenter Rahmen volltextbasierter verwandter Arbeit Generation | 选择、读取和写入:全文本相关工作生成的多机构代理框架 2505.19647v1 |
Authors: Xiaochuan Liu, Ruihua Song, Xiting Wang, Xu Chen
Automatic related work generation (RWG) can save people’s time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at https://github.com/1190200817/Full_Text_RWG.
nan
Article 1011
Title@2025-05-26 (1): Interleaved Reasoning for Large Language Models via Reinforcement Learning
Title: Interleaved Reasoning for Large Language Models via Reinforcement Learning | Interleaved Reasoning für große Sprachmodelle durch Verstärkungslernen | 通过强化学习促进大语言模式 2505.19640v1 |
Authors: Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra
Long chain-of-thought (CoT) significantly enhances large language models’ (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.
nan
Article 1012
Title@2025-05-26 (1): Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
Title: Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models | Segment First or Comprehend First? Erforschen Sie die Grenzen der unüberwachten Wortsegmentierung mit großen Sprachmodellen | 首段或首段理解 ? 探索以大语言模式进行不受监督的单词分割的限制 。 2505.19631v1 |
Authors: Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of “comprehend first, segment later”, we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs’ “comprehension”. Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA
nan
Article 1013
Title@2025-05-26 (1): DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue
Title: DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue | DoctorAgent-RL: Ein multi-agent-kollaboratives Verstärkungs-Lernsystem für den multi-Turn-Klinischen Dialog | DocrAgentor-RL:多轮临床对话多机构合作强化学习系统 2505.19630v1 |
Authors: Yichun Feng, Jiawei Wang, Lu Zhou, Yixue Li
Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Existing systems rely on a one-way information transmission mode where patients must fully describe their symptoms in a single round, leading to nonspecific diagnostic recommendations when complaints are vague. Traditional multi-turn dialogue methods based on supervised learning are constrained by static data-driven paradigms, lacking generalizability and struggling to intelligently extract key clinical information. To address these limitations, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that DoctorAgent-RL outperforms existing models in both multi-turn reasoning capability and final diagnostic performance, demonstrating practical value in assisting clinical consultations. https://github.com/JarvisUSTC/DoctorAgent-RL
nan
Article 1014
Title@2025-05-26 (1): Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Title: Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models | Denken Sie noch einmal! Die Wirkung von Test-Time Compute auf Präferenzen, Meinungen und Überzeugungen von großen Sprachmodellen | 再想想!测试时间计算对大语言模式的优惠、意见和信仰的影响 2505.19621v1 |
Authors: George Kour, Itay Nakash, Ateret Anaby-Tavor, Michal Shmueli-Scheuer
As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it’s crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs’ subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: https://ibm.github.io/POBS
nan
Article 1015
Title@2025-05-26 (1): Lens: Rethinking Multilingual Enhancement for Large Language Models
Title: Lens: Rethinking Multilingual Enhancement for Large Language Models | Objektiv: Mehrsprachige Erweiterung für große Sprachmodelle neu denken | 镜头:重新思考为大语言模式重新思考多语种增强大语言模式 2410.04407v2 |
Authors: Weixiang Zhao, Yulin Hu, Jiahe Guo, Xingyu Sui, Tongtong Wu, Yang Deng, Yanyan Zhao, Bing Qin, Wanxiang Che, Ting Liu
As global demand for multilingual large language models (LLMs) grows, most LLMs still remain overly focused on English, leading to the limited access to advanced AI for non-English speakers. Current methods to enhance multilingual capabilities largely rely on data-driven post-training techniques, such as multilingual instruction tuning or continual pre-training. However, these approaches exhibit significant limitations, including high resource cost, exacerbation of off-target issue and catastrophic forgetting of central language abilities. To this end, we propose Lens, a novel approach that enhances multilingual capabilities by leveraging LLMs’ internal language representation spaces. Lens operates on two subspaces: the language-agnostic subspace, where it aligns target languages with the central language to inherit strong semantic representations, and the language-specific subspace, where it separates target and central languages to preserve linguistic specificity. Experiments on three English-centric LLMs show that Lens significantly improves multilingual performance while maintaining the model’s English proficiency, achieving better results with less computational cost compared to existing post-training approaches.
nan
Article 1016
Title@2025-05-26 (1): Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization
Title: Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization | Erforschung der Verallgemeinerbarkeit von Factual Halluzination Mitigation durch die Verbesserung präziser Wissensnutzung | 探索通过增强利用精确的知识来减轻事实幻觉的普及性 2502.19127v2 |
Authors: Siyuan Zhang, Yichi Zhang, Yinpeng Dong, Hang Su
Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. Although post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in different capabilities. In this paper, we propose to address it by directly augmenting LLM’s fundamental ability to precisely leverage its knowledge and introduce PKUE, which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments demonstrate that PKUE significantly improves LLM overall performance, with consistent enhancement across factual tasks of various forms, general tasks beyond factuality, and tasks in a different language.
nan
Article 1017
Title@2025-05-26 (1): Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs
Title: Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs | Sind die versteckten Staaten etwas verbergen? Testen Sie die Grenzen der Faktizität-Encoding Fähigkeiten in LLMs | 隐秘国是否隐藏着什么?测试LLMM中实际质量-编码能力限度。 2505.16520v2 |
Authors: Giovanni Servedio, Alessandro De Bellis, Dario Di Palma, Vito Walter Anelli, Tommaso Di Noia
Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.
nan
Article 1018
Title@2025-05-26 (1): Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically
Title: Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically | Sprachen in mehrsprachigen Sprachstiftungsmodellen richten sowohl phonetisch als auch semantisch | 多语种语言语言基金会 2505.19606v1 |
Authors: Ryan Soh-Eun Shim, Domenico De Cristofaro, Chengzhi Martin Hu, Alessandro Vietti, Barbara Plank
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.
nan
Article 1019
Title@2025-05-26 (1): Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis
Title: Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis | Machine Translation Models für Englisch-Hindi Sprachpaare bewerten: Eine vergleichende Analyse | 英文-中文语文配对评价机器翻译模型:比较分析 2505.19604v1 |
Authors: Ahan Prasannakumar Shetty
Machine translation has become a critical tool in bridging linguistic gaps, especially between languages as diverse as English and Hindi. This paper comprehensively evaluates various machine translation models for translating between English and Hindi. We assess the performance of these models using a diverse set of automatic evaluation metrics, both lexical and machine learning-based metrics. Our evaluation leverages an 18000+ corpus of English Hindi parallel dataset and a custom FAQ dataset comprising questions from government websites. The study aims to provide insights into the effectiveness of different machine translation approaches in handling both general and specialized language domains. Results indicate varying performance levels across different metrics, highlighting strengths and areas for improvement in current translation systems.
nan
Article 1020
Title@2025-05-26 (1): Preference Optimization by Estimating the Ratio of the Data Distribution
Title: Preference Optimization by Estimating the Ratio of the Data Distribution | Präferenzoptimierung durch Schätzung des Verhältnisses der Datenverteilung | 通过估计数据分配比率实现最佳优化 2505.19601v1 |
Authors: Yeongmin Kim, Heesun Bae, Byeonghu Na, Il-Chul Moon
Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu’s power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-Instruct-8B, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9\% length-controlled win rate on AlpacaEval2.
nan
Article 1021
Title@2025-05-26 (1): Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Title: Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar | Inkonsistente Tokenisierungen führen dazu, dass Sprachmodelle von japanischer Grammatik verblüfft werden. | 前后不一致的招数导致语言模式被日语语法所混淆 2505.19599v1 |
Authors: Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the “first person psych predicate restriction” grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab’s uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3’s perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.
nan
Article 1022
Title@2025-05-26 (1): Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
Title: Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study | Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie | 评估大音频语言模型对音频注射的威力:经验研究 2505.19598v1 |
Authors: Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, Wenbo Jiang
Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.
nan
Article 1023
Title@2025-05-26 (1): Multi-Agent Collaboration via Evolving Orchestration
Title: Multi-Agent Collaboration via Evolving Orchestration | Multi-Agenten-Zusammenarbeit über Evolving Orchestration | 通过不断演变的管弦化多机构协作 2505.19591v1 |
Authors: Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator (“puppeteer”) dynamically directs agents (“puppets”) in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator’s evolution.
nan
Article 1024
Title@2025-05-26 (1): SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation
Title: SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation | SepALM: Audio Sprachmodelle sind Fehlerkorrekturen für robuste Sprachtrennung | SepALM: 音频语言模型是强力语音分离错误纠正器 2505.03273v2 |
Authors: Zhaoxi Mu, Xinyu Yang, Gang Wang
While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.
nan
Article 1025
Title@2025-05-26 (1): Learning to Reason without External Rewards
Title: Learning to Reason without External Rewards | Vernunft lernen ohne externe Belohnungen | 学习没有外部奖励的理性 2505.19590v1 |
Authors: Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model’s own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO’s performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
nan
Article 1026
Title@2025-05-26 (1): Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing
Title: Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing | Beschleunigung der Vorfüllung für Langkontext-LLMs über Sparse Pattern Sharing | 通过 Sparse 模式共享加速预填长文本 LLMs 2505.19578v1 |
Authors: Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, Jun Wang
Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.
nan
Article 1027
Title@2025-05-26 (1): Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
Title: Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch | Cheems: Eine praktische Anleitung für das Bauen und Evaluieren chinesischer Belohnungsmodelle von Scratch | Cheems:从Scratch建立和评估中国奖励模型实用指南 2502.17173v3 |
Authors: Xueru Wen, Jie Lou, Zichao Li, Yaojie Lu, Xing Yu, Yuqiu Ji, Guohai Xu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Debing Zhang
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.
nan
Article 1028
Title@2025-05-26 (1): DocMEdit: Towards Document-Level Model Editing
Title: DocMEdit: Towards Document-Level Model Editing | DocMEdit: Auf dem Weg zur Dokumenten-Level-Modellbearbeitung | DocMEdit:走向文件级别示范编辑 2505.19572v1 |
Authors: Li Zeng, Zeming Liu, Chong Feng, Heyan Huang, Yuhang Guo
Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document-level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce \benchmarkname, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.
nan
Article 1029
Title@2025-05-26 (1): Rethinking Text-based Protein Understanding: Retrieval or LLM?
Title: Rethinking Text-based Protein Understanding: Retrieval or LLM? | Rethinking Text-basierte Protein-Verständnis: Retrieval oder LLM? | 重新思考基于文本的蛋白质理解:检索还是LLM? 2505.20354v1 |
Authors: Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model’s performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.
nan
Article 1030
Title@2025-05-26 (1): Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights
Title: Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights | Automatisierter Text-zu-Tisch für reasoning-intensive Tabelle QA: Pipeline-Design und Benchmarking-Insights | QA:管道设计和基准透视 2505.19563v1 |
Authors: Shi-Yu Tian, Zhi Zhou, Wei Dong, Ming Yang, Kun-Yang Yu, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li
Reasoning with tabular data holds increasing importance in modern applications, yet comprehensive evaluation methodologies for reasoning-intensive Table Question Answering (QA) tasks remain nascent. Existing research is constrained by two primary bottlenecks: 1) Reliance on costly manually annotated real-world data, which is difficult to cover complex reasoning scenarios; 2) The heterogeneity of table structures hinders systematic analysis of the intrinsic mechanisms behind the underperformance of LLMs, especially in reasoning-intensive tasks. To address these issues, we propose an automated generation pipeline AutoT2T that transforms mathematical word problems into table-based reasoning tasks, eliminating the need for manual annotation. The pipeline can generate multiple variants of a table for the same reasoning problem, including noisy versions to support robustness evaluation. Based on this, we construct a new benchmark TabularGSM, which systematically spans a range of table complexities and trap problems. Experimental analyses through AutoT2T and TabularGSM reveal that the tight coupling between reasoning and retrieval or identification processes is a key factor underlying the failure of LLMs in complex Table QA tasks. This highlights the necessity for models to develop synergistic reasoning capabilities in order to perform effectively in complex Table QA tasks.
nan
Article 1031
Title@2025-05-26 (1): On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation
Title: On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation | On-Policy-Selbstjustierung mit feinkörnigem Wissen Feedback zur Halluzination Mitigation | 政策上与精精精细知识的自我协调以缓解幻觉的反馈 2406.12221v6 |
Authors: Xueru Wen, Jie Lou, Xinyu Lu, Ji Yuqiu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, Le Sun
Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH’s effectiveness in hallucination mitigation.
nan
Article 1032
Title@2025-05-26 (1): Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
Title: Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent | Benchmarking multimodaler Retrieval Augmented Generation mit dynamischem VQA-Datensatz und selbstadaptivem Planungs-Agent | 具有动态VQA数据集和自适应规划剂的多式回收增强型 2411.02937v5 |
Authors: Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S. Yu, Fei Huang, Jingren Zhou
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the “hallucination” issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of “dynamic” questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at https://github.com/Alibaba-NLP/OmniSearch.
nan
Article 1033
Title@2025-05-26 (1): Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents
Title: Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents | Auf dem Weg zu Multi-Granularität Memory Association und Auswahl für langfristige Conversational Agents | 走向多群体记忆协会和选择长期对话代理人 2505.19549v1 |
Authors: Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, Tong Xu
Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings.
nan
Article 1034
Title@2025-05-26 (1): How Syntax Specialization Emerges in Language Models
Title: How Syntax Specialization Emerges in Language Models | Wie Syntax Spezialisierung in Sprachmodelle auftaucht | 语言模式中的语法专门化如何出现 2505.19548v1 |
Authors: Xufeng Duan, Zhaoqian Yao, Yunhao Zhang, Shaonan Wang, Zhenguang G. Cai
Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a ‘critical period’ of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance.
nan
Article 1035
Title@2025-05-26 (1): Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements
Title: Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements | Betrug-R1 : Multi-Round Benchmark für die Bewertung der Robustheit von LLM gegen Augmented Betrug und Phishing Inducings | 欺诈R1:评估防止增加欺诈和钓鱼诱骗行为LLM的有力程度的多基准 2502.12904v2 |
Authors: Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
We introduce Fraud-R1, a benchmark designed to evaluate LLMs’ ability to defend against internet fraud and phishing in dynamic, real-world scenarios. Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job postings, social media, and news, categorized into 5 major fraud types. Unlike previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to assess LLMs’ resistance to fraud at different stages, including credibility building, urgency creation, and emotional manipulation. Furthermore, we evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM provides general decision-making assistance, and 2. Role-play, where the model assumes a specific persona, widely used in real-world agent-based interactions. Our evaluation reveals the significant challenges in defending against fraud and phishing inducement, especially in role-play settings and fake job postings. Additionally, we observe a substantial performance gap between Chinese and English, underscoring the need for improved multilingual fraud detection capabilities.
nan
Article 1036
Title@2025-05-26 (1): R3: Robust Rubric-Agnostic Reward Models
Title: R3: Robust Rubric-Agnostic Reward Models | R3: Robuste Rubric-Agnostische Belohnungsmodelle | R3:坚固的Rubric-不可知奖赏模型 2505.13388v2 |
Authors: David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, Genta Indra Winata
Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce R3, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. R3 enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3
nan
Article 1037
Title@2025-05-26 (1): Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs
Title: Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs | Amulett: Neuausrichtung während der Testzeit für Personalisierte Präferenzanpassung von LLMs | 缩略图:在试验期间重新对准,以适应LLMM的个性化偏好 2502.19148v2 |
Authors: Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang
How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users’ personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.
nan
Article 1038
Title@2025-05-26 (1): DoctorRAG: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients
Title: DoctorRAG: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients | DoctorRAG: Medizinische RAG Durch Textabstufungen Wissen mit Patient Analogie fusionieren | 医生RAG:通过文字梯度将医学RAG知识与病人分析知识与病人分析相融合 2505.19538v1 |
Authors: Yuxing Lu, Gecheng Fu, Wei Wu, Xukai Zhao, Sin Yee Goi, Jinzhuo Wang
Existing medical RAG systems mainly leverage knowledge from medical knowledge bases, neglecting the crucial role of experiential knowledge derived from similar patient cases – a key component of human clinical reasoning. To bridge this gap, we propose DoctorRAG, a RAG framework that emulates doctor-like reasoning by integrating both explicit clinical knowledge and implicit case-based experience. DoctorRAG enhances retrieval precision by first allocating conceptual tags for queries and knowledge sources, together with a hybrid retrieval mechanism from both relevant knowledge and patient. In addition, a Med-TextGrad module using multi-agent textual gradients is integrated to ensure that the final output adheres to the retrieved knowledge and patient query. Comprehensive experiments on multilingual, multitask datasets demonstrate that DoctorRAG significantly outperforms strong baseline RAG models and gains improvements from iterative refinements. Our approach generates more accurate, relevant, and comprehensive responses, taking a step towards more doctor-like medical reasoning systems.
nan
Article 1039
Title@2025-05-26 (1): Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation
Title: Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation | Können große Sprachmodelle ein guter emotionaler Unterstützer sein? Preference Bias auf Emotional Support Conversation abmildern | 大语言模式能否成为情感支持的良好支持者? 2402.13211v3 |
Authors: Dongjin Kang, Sunghwan Kim, Taeyoon Kwon, Seungjun Moon, Hyunsouk Cho, Youngjae Yu, Dongha Lee, Jinyoung Yeo
Emotional Support Conversation (ESC) is a task aimed at alleviating individuals’ emotional distress through daily conversation. Given its inherent complexity and non-intuitive nature, ESConv dataset incorporates support strategies to facilitate the generation of appropriate responses. Recently, despite the remarkable conversational ability of large language models (LLMs), previous studies have suggested that they often struggle with providing useful emotional support. Hence, this work initially analyzes the results of LLMs on ESConv, revealing challenges in selecting the correct strategy and a notable preference for a specific strategy. Motivated by these, we explore the impact of the inherent preference in LLMs on providing emotional support, and consequently, we observe that exhibiting high preference for specific strategies hinders effective emotional support, aggravating its robustness in predicting the appropriate strategy. Moreover, we conduct a methodological study to offer insights into the necessary approaches for LLMs to serve as proficient emotional supporters. Our findings emphasize that (1) low preference for specific strategies hinders the progress of emotional support, (2) external assistance helps reduce preference bias, and (3) existing LLMs alone cannot become good emotional supporters. These insights suggest promising avenues for future research to enhance the emotional intelligence of LLMs.
nan
Article 1040
Title@2025-05-26 (1): FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
Title: FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models | FlowCut: Redundanz über Informationsfluss für effiziente Vision-Sprachenmodelle neu denken | 流程:通过信息流动重新思考通过信息流动实现高效愿景-语言模型的冗余 2505.19536v1 |
Authors: Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li
Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model’s inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
nan
Article 1041
Title@2025-05-26 (1): SLOT: Sample-specific Language Model Optimization at Test-time
Title: SLOT: Sample-specific Language Model Optimization at Test-time | Steckplatz: Beispielspezifische Sprachmodelloptimierung zur Testzeit | SPLOT: 测试时特定抽样语文示范模式优化 2505.12392v2 |
Authors: Yang Hu, Xingyu Zhang, Xueji Fang, Zhiyang Chen, Xiao Wang, Huatian Zhang, Guojun Qi
We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model’s ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at https://github.com/maple-research-lab/SLOT.
nan
Article 1042
Title@2025-05-26 (1): SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback
Title: SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback | SIPDO: Closed-Loop Prompt Optimierung über Synthetic Data Feedback | SIPDO:通过合成数据反馈,通过闭闭电话快速优化 2505.19514v1 |
Authors: Yaoning Yu, Ye Yu, Kai Wei, Haojing Luo, Haohan Wang
Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
nan
Article 1043
Title@2025-05-26 (1): Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models
Title: Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models | Kausaldestillation: Übertragen strukturierter Erklärungen von großen zu kompakten Sprachmodellen | 因果蒸馏:将结构化解释从大语言模式转移到集约语言模式 2505.19511v1 |
Authors: Aggrey Muhebwa, Khalid K. Osman
Large proprietary language models exhibit strong causal reasoning abilities that smaller open-source models struggle to replicate. We introduce a novel framework for distilling causal explanations that transfers causal reasoning skills from a powerful teacher model to a compact open-source model. The key idea is to train the smaller model to develop causal reasoning abilities by generating structured cause-and-effect explanations consistent with those of the teacher model. To evaluate the quality of the student-generated explanations, we introduce a new metric called Causal Explanation Coherence (CEC) to assess the structural and logical consistency of causal reasoning. This metric uses sentence-level semantic alignment to measure how well each part of the generated explanation corresponds to the teacher’s reference, capturing both faithfulness and coverage of the underlying causal chain. Our framework and the CEC metric provide a principled foundation for training smaller models to perform robust causal reasoning and for systematically assessing the coherence of explanations in language model outputs.
nan
Article 1044
Title@2025-05-26 (1): StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization
Title: StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization | StepSearch: LLMs entzünden Suche Fähigkeit über Schritt-Wise Proximal Policy Optimization | 切换搜索:通过 “ 一步步Wise “ 方案最佳政策优化化,将LLMs搜索能力化 2505.15107v2 |
Authors: Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu
Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our code will be released on https://github.com/Zillwang/StepSearch.
nan
Article 1045
Title@2025-05-26 (1): DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Title: DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation | DOGe: Defensive Output Generation für LLM-Schutz vor Wissensdestillation | DOGe: 防知识蒸馏保护LLM的防御性产出产生 2505.19504v1 |
Authors: Pingzhi Li, Zhen Tan, Huaizhi Qu, Huan Liu, Tianlong Chen
Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher’s internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method’s effectiveness as a practical safeguard against KD-based model imitation.
nan
Article 1046
Title@2025-05-26 (1): QAEncoder: Towards Aligned Representation Learning in Question Answering System
Title: QAEncoder: Towards Aligned Representation Learning in Question Answering System | QAEncoder: Auf dem Weg zu einem abgestimmten Repräsentationslernen im Fragebeantwortungssystem | QAEncolder:在问题解答系统中实现代表性统一学习 2409.20434v2 |
Authors: Zhengren Wang, Qinhan Yu, Shida Wei, Zhiyu Li, Feiyu Xiong, Xiaoxing Wang, Simin Niu, Hao Liang, Wentao Zhang
Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. We introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments across diverse datasets, languages, and embedding models confirmed QAEncoder’s alignment capability, which offers a simple-yet-effective solution with zero additional index storage, retrieval latency, training costs, or catastrophic forgetting and hallucination issues. The repository is publicly available at https://github.com/IAAR-Shanghai/QAEncoder.
nan
Article 1047
Title@2025-05-26 (1): Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents
Title: Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents | Anveshana: Ein neuer Benchmark-Datensatz für Cross-Lingual Information Retrieval über englische Abfragen und Sanskrit-Dokumente | Anveshana:英语问答和梵文文件跨语言信息检索新基准数据集 2505.19494v1 |
Authors: Manoj Balaji Jagadeeshan, Prince Raj, Pawan Goyal
The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit’s linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at https://huggingface.co/datasets/manojbalaji1/anveshana
nan
Article 1048
Title@2025-05-26 (1): NExtLong: Toward Effective Long-Context Training without Long Documents
Title: NExtLong: Toward Effective Long-Context Training without Long Documents | NExtLong: Auf dem Weg zu effektiver Langtext-Schulung ohne lange Dokumente | NExtLong:争取在无长文件的情况下进行有效长文培训 2501.12766v2 |
Authors: Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu
Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong’s ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
nan
Article 1049
Title@2025-05-26 (1): When can isotropy help adapt LLMs’ next word prediction to numerical domains?
Title: When can isotropy help adapt LLMs’ next word prediction to numerical domains? | Wann kann Isotropie helfen, die nächste Wortvorhersage von LLMs an numerische Domänen anzupassen? | 何时才能帮助LLMS的下一个字词预测适应数字域? 2505.17135v2 |
Authors: Rashed Shelim, Shengzhe Xu, Walid Saad, Naren Ramakrishnan
Recent studies have shown that vector representations of contextual embeddings learned by pre-trained large language models (LLMs) are effective in various downstream tasks in numerical domains. Despite their significant benefits, the tendency of LLMs to hallucinate in such domains can have severe consequences in applications such as energy, nature, finance, healthcare, retail and transportation, among others. To guarantee prediction reliability and accuracy in numerical domains, it is necessary to open the black-box and provide performance guarantees through explanation. However, there is little theoretical understanding of when pre-trained language models help solve numeric downstream tasks. This paper seeks to bridge this gap by understanding when the next-word prediction capability of LLMs can be adapted to numerical domains through a novel analysis based on the concept of isotropy in the contextual embedding space. Specifically, we consider a log-linear model for LLMs in which numeric data can be predicted from its context through a network with softmax in the output layer of LLMs (i.e., language model head in self-attention). We demonstrate that, in order to achieve state-of-the-art performance in numerical domains, the hidden representations of the LLM embeddings must possess a structure that accounts for the shift-invariance of the softmax function. By formulating a gradient structure of self-attention in pre-trained models, we show how the isotropic property of LLM embeddings in contextual embedding space preserves the underlying structure of representations, thereby resolving the shift-invariance problem and providing a performance guarantee. Experiments show that different characteristics of numeric data and model architecture could have different impacts on isotropy.
nan
Article 1050
Title@2025-05-26 (1): PASS-FC: Progressive and Adaptive Search Scheme for Fact Checking of Comprehensive Claims
Title: PASS-FC: Progressive and Adaptive Search Scheme for Fact Checking of Comprehensive Claims | PASS-FC: Progressives und adaptives Suchschema für die Prüfung umfassender Ansprüche | PASS-FC: 全面索赔事实核实渐进和适应性搜索计划 2504.09866v2 |
Authors: Ziyu Zhuang
Automated fact-checking (AFC) still falters on claims that are time-sensitive, entity-ambiguous, or buried beneath noisy search-engine results. We present PASS-FC, a Progressive and Adaptive Search Scheme for Fact Checking. Each atomic claim is first grounded with a precise time span and disambiguated entity descriptors. An adaptive search loop then issues structured queries, filters domains through credible-source selection, and expands queries cross-lingually; when necessary, a lightweight reflection routine restarts the loop. Experiments on six benchmark–covering general knowledge, scientific literature, real-world events, and ten languages–show that PASS-FC consistently outperforms prior systems, even those powered by larger backbone LLMs. On the multilingual X-FACT set, performance of different languages partially correlates with typological closeness to English, and forcing the model to reason in low-resource languages degrades accuracy. Ablations highlight the importance of temporal grounding and the adaptive search scheme, while detailed analysis shows that cross-lingual retrieval contributes genuinely new evidence. Code and full results will be released to facilitate further research.
nan
Article 1051
Title@2025-05-26 (1): HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning
Title: HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning | HellaSwag-Pro: Ein großformatiger zweisprachiger Benchmark zur Bewertung der Robustheit von LLMs in Commonsense Reasoning | HellaSwag-Pro:用于评价常识理由解释中LLMs是否强劲的大型双语双语基准 2502.11393v2 |
Authors: Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu, Junyang Lin
Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.
nan
Article 1052
Title@2025-05-26 (1): MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
Title: MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation | MTR-Bench: Umfassender Benchmark für die Bewertung von Multi-Turn-Reasoning | 中期审查-后期:多重理由评价综合基准 2505.17123v2 |
Authors: Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin
Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs’ Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.
nan
Article 1053
Title@2025-05-26 (1): Parrot: Multilingual Visual Instruction Tuning
Title: Parrot: Multilingual Visual Instruction Tuning | Papagei: Mehrsprachige visuelle Anleitung | Parrot: 多语言视觉教学图示 2406.02539v3 |
Authors: Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. PARROT achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot
nan
Article 1054
Title@2025-05-26 (1): ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search
Title: ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search | ARise: Auf dem Weg zu einer wissensbasierten Vernunft durch Risiko-Adaptive Search | ARise:通过风险-减轻风险的搜索寻求知识推理 2504.10893v2 |
Authors: Yize Zhang, Tianshu Wang, Sirui Chen, Kun Wang, Xingyu Zeng, Hongyu Lin, Xianpei Han, Le Sun, Chaochao Lu
Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test–time compute. However, their application in open–ended, knowledge–intensive, complex reasoning scenarios is still limited. Reasoning–oriented methods struggle to generalize to open–ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge–augmented reasoning (KAR) methods fail to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore–exploit tradeoff arises in multi–branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval–augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state–of–the–art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%. Our project page is at https://opencausalab.github.io/ARise.
nan
Article 1055
Title@2025-05-26 (1): Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin
Title: Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin | Auf dem Weg zum Ende der Ausbildung zur automatischen Spracherkennung für nigerianische Pidgin | 走向尼日利亚皮吉纳自动语音识别的端至端培训 2010.11123v2 |
Authors: Amina Mardiyyah Rufai, Afolabi Abeeb, Esther Oduntan, Tayo Arulogun, Oluwabukola Adegboro, Daniel Ajisafe
The prevalence of automatic speech recognition (ASR) systems in spoken language applications has increased significantly in recent years. Notably, many African languages lack sufficient linguistic resources to support the robustness of these systems. This paper focuses on the development of an end-to-end speech recognition system customized for Nigerian Pidgin English. We investigated and evaluated different pretrained state-of-the-art architectures on a new dataset. Our empirical results demonstrate a notable performance of the variant Wav2Vec2 XLSR-53 on our dataset, achieving a word error rate (WER) of 29.6% on the test set, surpassing other architectures such as NEMO QUARTZNET and Wav2Vec2.0 BASE-100H in quantitative assessments. Additionally, we demonstrate that pretrained state-of-the-art architectures do not work well out-of-the-box. We performed zero-shot evaluation using XLSR-English as the baseline, chosen for its similarity to Nigerian Pidgin. This yielded a higher WER of 73.7%. By adapting this architecture to nuances represented in our dataset, we reduce error by 59.84%. Our dataset comprises 4,288 recorded utterances from 10 native speakers, partitioned into training, validation, and test sets. This study underscores the potential for improving ASR systems for under-resourced languages like Nigerian Pidgin English, contributing to greater inclusion in speech technology applications. We publicly release our unique parallel dataset (speech-to-text) on Nigerian Pidgin, as well as the model weights on Hugging Face. Our code would be made available to foster future research from the community.
nan
Article 1056
Title@2025-05-26 (1): FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models
Title: FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models | FastCurL: Curriculum-Verstärkungs-Lernen mit Stage-Wise-Kontext-Skalierung für effizientes Training R1-ähnliche Reasoning-Modelle | FastCuRL: 课程强化学习,分阶段为高效率培训提供R1类理由模型 2503.17287v4 |
Authors: Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, Feng Zhang
Improving training efficiency continues to be one of the primary challenges in large-scale Reinforcement Learning (RL). In this paper, we investigate how context length and the complexity of training data influence the RL scaling training process of R1-distilled small reasoning models, e.g., DeepSeek-R1-Distill-Qwen-1.5B. Our experimental results reveal that: (1) simply controlling the context length and curating the training data based on the input prompt length can effectively improve the training efficiency of scaling RL, achieving better performance with more concise CoT; (2) properly scaling the context length helps mitigate entropy collapse; and (3) choosing an optimal context length can improve the efficiency of model training and incentivize the model’s chain-of-thought reasoning capabilities. Inspired by these insights, we propose FastCuRL, a curriculum RL framework with stage-wise context scaling to achieve efficient training and concise CoT reasoning. Experiment results demonstrate that FastCuRL-1.5B-V3 significantly outperforms state-of-the-art reasoning models on five competition-level benchmarks and achieves 49.6\% accuracy on AIME 2024. Furthermore, FastCuRL-1.5B-Preview surpasses DeepScaleR-1.5B-Preview on five benchmarks while only using a single node with 8 GPUs and a total of 50\% of training steps. %The code, training data, and models will be publicly released.
nan
Article 1057
Title@2025-05-26 (1): BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
Title: BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs | BizFinBench: Ein geschäftsgetriebener Real-World-Finanz-Benchmark für die Bewertung von LLMs | BizFin BinBenench:商业驱动的现实世界评价长效信贷额度的金融基准 2505.19457v1 |
Authors: Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
nan
Article 1058
Title@2025-05-26 (1): HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation
Title: HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation | HopRAG: Multi-Hop-Gründung für die Logic-Aware Retrieval-Augmented Generation | HOPRAG: 逻辑-软件检索多功能原因 2502.12442v2 |
Authors: Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, Wentao Zhang
Retrieval-Augmented Generation (RAG) systems often struggle with imperfect retrieval, as traditional retrievers focus on lexical or semantic similarity rather than logical relevance. To address this, we propose \textbf{HopRAG}, a novel RAG framework that augments retrieval with logical reasoning through graph-structured knowledge exploration. During indexing, HopRAG constructs a passage graph, with text chunks as vertices and logical connections established via LLM-generated pseudo-queries as edges. During retrieval, it employs a \textit{retrieve-reason-prune} mechanism: starting with lexically or semantically similar passages, the system explores multi-hop neighbors guided by pseudo-queries and LLM reasoning to identify truly relevant ones. Experiments on multiple multi-hop benchmarks demonstrate that HopRAG’s \textit{retrieve-reason-prune} mechanism can expand the retrieval scope based on logical connections and improve final answer quality.
nan
Article 1059
Title@2025-05-26 (1): Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Title: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning | Pixel Reasoner: Anreize für Pixel-Space-Reasoning mit kuriositätsgetriebenem Verstärkungslernen | 像素理由:激励像素空间与好奇-驱动强化学习相结合的像素空间理由 2505.15966v2 |
Authors: Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen
Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model’s initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.
nan
Article 1060
Title@2025-05-26 (1): Discovering Forbidden Topics in Language Models
Title: Discovering Forbidden Topics in Language Models | Verbotene Themen in Sprachmodellen entdecken | 发现语言模型中的禁止专题 2505.17441v2 |
Authors: Can Rager, Chris Wendler, Rohit Gandikota, David Bau
Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits “thought suppression” behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, LLM-crawler elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.
nan
Article 1061
Title@2025-05-26 (1): Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
Title: Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering | Ausrichtung großer Sprachmodelle, um Anweisungen zu folgen und weniger Halluzinate über effektive Datenfilterung | 通过有效的数据过滤使大语言模型与遵循指令和低致幻模型相匹配 2502.07340v3 |
Authors: Shuzheng Si, Haozhe Zhao, Gang Chen, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Kaikai An, Kangyang Luo, Chen Qian, Fanchao Qi, Baobao Chang, Maosong Sun
Training LLMs on data containing unfamiliar knowledge during the instruction tuning stage can encourage hallucinations. To address this challenge, we introduce NOVA, a novel framework designed to identify high-quality data that aligns well with the LLM’s learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM’s understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, to ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less.
nan
Article 1062
Title@2025-05-26 (1): Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Title: Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI | Vibe Coding vs. Agentic Coding: Grundlagen und praktische Implikationen von Agentic AI | Vibe 编码与 Agentic 编码:Agent AI 的基本要素和实际影响 2505.19443v1 |
Authors: Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.
nan
Article 1063
Title@2025-05-26 (1): The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models
Title: The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models | Die Geburt des Wissens: Emergente Funktionen über Zeit, Raum und Maßstab in großen Sprachmodellen | 知识的诞生:跨越时间、空间和大语言模型规模的新兴特征 2505.19440v1 |
Authors: Shashata Sawmya, Micah Adler, Nir Shavit
This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.
nan
Article 1064
Title@2025-05-26 (1): Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
Title: Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers | Surrogate Signale aus Format und Länge: Verstärkungslernen zur Lösung mathematischer Probleme ohne Grundwahrheitsantworten | 格式和长度的代用信号:为解决没有事实答案的数学问题进行强化学习 2505.19439v1 |
Authors: Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, Bingning Wang
Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers.Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.
nan
Article 1065
Title@2025-05-26 (1): Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents
Title: Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents | Task Memory Engine: Raumspeicher für robuste, mehrstufige LLM-Agenten | 任务记忆引擎:强力多级LLM代理器的空间内存 2505.19436v1 |
Authors: Ye Ye
Large Language Models (LLMs) falter in multi-step interactions – often hallucinating, repeating actions, or misinterpreting user corrections – due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph – either a tree or directed acyclic graph (DAG) – to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing – TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME’s modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME’s codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME’s scalable architecture addresses a critical gap in agent performance across complex, interactive settings.
nan
Article 1066
Title@2025-05-26 (1): Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection
Title: Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection | Weg zur Vernunft: Adaptives Routing für die LLM und die Strategieauswahl | 原因路线:LLM和理由选择战略的适应性分流 2505.19435v1 |
Authors: Zhihong Pan, Kai Zhang, Yuze Zhao, Yupeng Han
The inherent capabilities of a language model (LM) and the reasoning strategies it employs jointly determine its performance in reasoning tasks. While test-time scaling is regarded as an effective approach to tackling complex reasoning tasks, it incurs substantial computational costs and often leads to “overthinking”, where models become trapped in “thought pitfalls”. To address this challenge, we propose Route-To-Reason (RTR), a novel unified routing framework that dynamically allocates both LMs and reasoning strategies according to task difficulty under budget constraints. RTR learns compressed representations of both expert models and reasoning strategies, enabling their joint and adaptive selection at inference time. This method is low-cost, highly flexible, and can be seamlessly extended to arbitrary black-box or white-box models and strategies, achieving true plug-and-play functionality. Extensive experiments across seven open source models and four reasoning strategies demonstrate that RTR achieves an optimal trade-off between accuracy and computational efficiency among all baselines, achieving higher accuracy than the best single model while reducing token usage by over 60%.
nan
Article 1067
Title@2025-05-26 (1): One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs
Title: One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs | One-Shot reicht: Konsolidierung von Multi-Turn-Angriffen in effiziente Single-Turn-Prompts für LLMs | 将多发攻击合并为LLMs的高效单发提示 2503.04856v2 |
Authors: Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim
We introduce a novel framework for consolidating multi-turn adversarial jailbreak'' prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates, they demand considerable human effort and time. Our multi-turn-to-single-turn (M2S) methods -- Hyphenize, Numberize, and Pythonize -- systematically reformat multi-turn dialogues into structured single-turn prompts. Despite removing iterative back-and-forth interactions, these prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods achieve attack success rates from 70.6 percent to 95.9 percent across several state-of-the-art LLMs. Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points while cutting token usage by more than half on average. Further analysis shows that embedding malicious requests in enumerated or code-like structures exploits
contextual blindness’’, bypassing both native guardrails and external input-output filters. By converting multi-turn conversations into concise single-turn prompts, the M2S framework provides a scalable tool for large-scale red teaming and reveals critical weaknesses in contemporary LLM defenses.
nan
Article 1068
Title@2025-05-26 (1): Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
Title: Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation | Strategische Markteinblicke mit großen Sprachmodellen ableiten: Ein Benchmark für die vorausschauende kontrafaktische Generation | 具有大语言模式的战略市场展望:前瞻性反实际生成基准 2505.19430v1 |
Authors: Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo
Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.
nan
Article 1069
Title@2025-05-26 (1): Rhapsody: A Dataset for Highlight Detection in Podcasts
Title: Rhapsody: A Dataset for Highlight Detection in Podcasts | Rhapsody: Ein Datensatz für Highlight-Erkennung in Podcasts | Rhapsody: 用于播客中高亮度探测的数据集 2505.19429v1 |
Authors: Younghan Park, Anuj Diwan, David Harwath, Eunsol Choi
Podcasts have become daily companions for half a billion users. Given the enormous amount of podcast content available, highlights provide a valuable signal that helps viewers get the gist of an episode and decide if they want to invest in listening to it in its entirety. However, identifying highlights automatically is challenging due to the unstructured and long-form nature of the content. We introduce Rhapsody, a dataset of 13K podcast episodes paired with segment-level highlight scores derived from YouTube’s ‘most replayed’ feature. We frame the podcast highlight detection as a segment-level binary classification task. We explore various baseline approaches, including zero-shot prompting of language models and lightweight finetuned language models using segment-level classification heads. Our experimental results indicate that even state-of-the-art language models like GPT-4o and Gemini struggle with this task, while models finetuned with in-domain data significantly outperform their zero-shot performance. The finetuned model benefits from leveraging both speech signal features and transcripts. These findings highlight the challenges for fine-grained information access in long-form spoken media.
nan
Article 1070
Title@2025-05-26 (1): Frictional Agent Alignment Framework: Slow Down and Don’t Break Things
Title: Frictional Agent Alignment Framework: Slow Down and Don’t Break Things | Frictional Agent Alignment Framework: Langsam nach unten und nicht brechen Dinge | 波动剂对齐框架:慢下来,不要打破 2505.19428v1 |
Authors: Abhijnan Nath, Carine Graff, Andrei Bachinin, Nikhil Krishnaswamy
AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware “friction” that prompts for deliberation and re-examination of existing evidence. FAAF’s two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive “thought partners” – not passive responders – FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.
nan
Article 1071
Title@2025-05-26 (1): MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
Title: MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision | MAS-ZERO: Konzipieren von Multi-Agenten-Systemen mit Zero Supervision | MAS-ZERO: 设计无监督的多机构系统 2505.14996v2 |
Authors: Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs’ strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM backbones of varying sizes, demonstrate that MAS-ZERO outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-evolved design for creating effective and adaptive MAS.
nan
Article 1072
Title@2025-05-26 (1): The Role of Diversity in In-Context Learning for Large Language Models
Title: The Role of Diversity in In-Context Learning for Large Language Models | Die Rolle der Vielfalt im In-Context-Lernen für große Sprachmodelle | 多样性在为大语言模式进行内文学习方面的作用 2505.19426v1 |
Authors: Wenyang Xiao, Haoyu Zhao, Lingxiao Huang
In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.
nan
Article 1073
Title@2025-05-26 (1): Each Graph is a New Language: Graph Learning with LLMs
Title: Each Graph is a New Language: Graph Learning with LLMs | Jeder Graph ist eine neue Sprache: Graph Learning mit LLMs | 每图都是一种新语言:用LLMM学习图表 2501.11478v3 |
Authors: Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose \textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge \textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs.
nan
Article 1074
Title@2025-05-26 (1): Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers
Title: Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers | Three Minds, One Legend: Jailbreak Large Reasoning Model mit adaptiven Stacked Ciphers | 三个心灵,一个传说:监狱破裂大型理性模型,有适应性堆叠加密码 2505.16241v3 |
Authors: Viet-Anh Nguyen, Shiqian Zhao, Gia Dao, Runyi Hu, Yi Xie, Luu Anh Tuan
Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.
nan
Article 1075
Title@2025-05-26 (1): Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering
Title: Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering | Selbstreflektierende Planung mit Wissensgraphen: Verbesserung der LLM-Begründetheit bei der Beantwortung von Fragen | 带有知识图的自反规划:加强LLM 问题解答的可靠性 2505.19410v1 |
Authors: Jiajun Zhu, Ye Liu, Meikai Bao, Kai Zhang, Yanghai Zhang, Qi Liu
Recently, large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, yet they remain prone to hallucinations when reasoning with insufficient internal knowledge. While integrating LLMs with knowledge graphs (KGs) provides access to structured, verifiable information, existing approaches often generate incomplete or factually inconsistent reasoning paths. To this end, we propose Self-Reflective Planning (SRP), a framework that synergizes LLMs with KGs through iterative, reference-guided reasoning. Specifically, given a question and topic entities, SRP first searches for references to guide planning and reflection. In the planning process, it checks initial relations and generates a reasoning path. After retrieving knowledge from KGs through a reasoning path, it implements iterative reflection by judging the retrieval result and editing the reasoning path until the answer is correctly retrieved. Extensive experiments on three public datasets demonstrate that SRP surpasses various strong baselines and further underscore its reliable reasoning ability.
nan
Article 1076
Title@2025-05-26 (1): CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems
Title: CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems | CoTGuard: Mit Chain-of-Thought-Triggering für Urheberrechtsschutz in Multi-Agent LLM-Systemen | COTGuard: 利用探索链在多个高级LLM系统中启动版权保护 2505.19405v1 |
Authors: Yan Wen, Junfeng Guo, Heng Huang
As large language models (LLMs) evolve into autonomous agents capable of collaborative reasoning and task execution, multi-agent LLM systems have emerged as a powerful paradigm for solving complex problems. However, these systems pose new challenges for copyright protection, particularly when sensitive or copyrighted content is inadvertently recalled through inter-agent communication and reasoning. Existing protection techniques primarily focus on detecting content in final outputs, overlooking the richer, more revealing reasoning processes within the agents themselves. In this paper, we introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought (CoT) reasoning. Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts. This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios. We evaluate CoTGuard on various benchmarks in extensive experiments and show that it effectively uncovers content leakage with minimal interference to task performance. Our findings suggest that reasoning-level monitoring offers a promising direction for safeguarding intellectual property in LLM-based agent systems.
nan
Article 1077
Title@2025-05-26 (1): Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs
Title: Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs | Können LLMs helfen, Erkenntnisse über LLMs zu enthüllen? Eine groß angelegte, sich entwickelnde Literaturanalyse von Frontier LLMs | LLMs 帮助发现关于LLM的见识? 大型、不断发展的前沿LMS文学分析 2502.18791v3 |
Authors: Jungsoo Park, Junmo Kang, Gabriel Stanovsky, Alan Ritter
The surge of LLM studies makes synthesizing their findings challenging. Analysis of experimental results from literature can uncover important trends across studies, but the time-consuming nature of manual data extraction limits its use. Our study presents a semi-automated approach for literature analysis that accelerates data extraction using LLMs. It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset, LLMEvalDB. We then conduct an automated literature analysis of frontier LLMs, reducing the effort of paper surveying and data extraction by more than 93% compared to manual approaches. We validate LLMEvalDB by showing that it reproduces key findings from a recent manual analysis of Chain-of-Thought (CoT) reasoning and also uncovers new insights that go beyond it, showing, for example, that in-context examples benefit coding & multimodal tasks but offer limited gains in math reasoning tasks compared to zero-shot CoT. Our automatically updatable dataset enables continuous tracking of target models by extracting evaluation studies as new data becomes available. Through LLMEvalDB and empirical analysis, we provide insights into LLMs while facilitating ongoing literature analyses of their behavior.
nan
Article 1078
Title@2025-05-26 (1): ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL
Title: ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL | ROUTE: Robustes Multitask Tuning und Zusammenarbeit für Text-zu-SQL | ROUTE: 文本到 SQL 的强有力的多任务调试和协作 2412.10138v3 |
Authors: Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye
Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model’s understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.
nan
Article 1079
Title@2025-05-26 (1): What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context for Multi-Hop QA
Title: What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context for Multi-Hop QA | Welches externe Wissen wird von LLMs bevorzugt? Charakterisieren und Erforschen von Beweiskette im unvollkommenen Kontext für Multi-Hop QA | 普惠制普惠制普惠制普惠制普惠制所偏爱的外部知识是什么? 2412.12632v3 |
Authors: Zhiyuan Chang, Mingyang Li, Xiaojun Jia, Junjie Wang, Yuekai Huang, Qing Wang, Yihao Huang, Yang Liu
Incorporating external knowledge has emerged as a promising way to mitigate outdated knowledge and hallucinations in LLM. However, external knowledge is often imperfect, encompassing substantial extraneous or even inaccurate content, which interferes with the LLM’s utilization of useful knowledge in the context. This paper seeks to characterize the features of preferred external knowledge and perform empirical studies in imperfect contexts. Inspired by the chain of evidence (CoE), we characterize that the knowledge preferred by LLMs should maintain both relevance to the question and mutual support among the textual pieces. Accordingly, we propose a CoE discrimination approach and conduct a comparative analysis between CoE and Non-CoE samples across significance, deceptiveness, and robustness, revealing the LLM’s preference for external knowledge that aligns with CoE features. Furthermore, we selected three representative tasks (RAG-based multi-hop QA, external knowledge poisoning and poisoning defense), along with corresponding SOTA or prevalent baselines. By integrating CoE features, the variants achieved significant improvements over the original baselines.
nan
Article 1080
Title@2025-05-26 (1): Simple and Effective Baselines for Code Summarisation Evaluation
Title: Simple and Effective Baselines for Code Summarisation Evaluation | Einfache und effektive Grundlagen für die Code-Summarisation-Bewertung | 用于代码摘要评价的简单有效基线 2505.19392v1 |
Authors: Jade Robinson, Jonathan K. Kummerfeld
Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.
nan
Article 1081
Title@2025-05-26 (1): gec-metrics: A Unified Library for Grammatical Error Correction Evaluation
Title: gec-metrics: A Unified Library for Grammatical Error Correction Evaluation | gec-metrics: Eine einheitliche Bibliothek für die Bewertung der grammatischen Fehlerkorrektur | 几何:一个用于校正校正错误校正评价的统一图书馆 2505.19388v1 |
Authors: Takumi Goto, Yusuke Sakai, Taro Watanabe
We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.
nan
Article 1082
Title@2025-05-26 (1): SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
Title: SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence | SelfElicit: Ihr Sprachmodell weiß geheim, wo die relevanten Beweise sind | 自 己: 您的语言模型秘密知道相关证据在哪里 2502.08767v2 |
Authors: Zhining Liu, Rana Ali Amjad, Ravinarayana Adkathimar, Tianxin Wei, Hanghang Tong
Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide better-grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information, an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.
nan
Article 1083
Title@2025-05-26 (1): GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor
Title: GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor | GSA-TTS : Auf dem Weg zur Null-Schuss-Sprachsynthese auf Basis eines graduellen Style-Adapters | GSA-TTS:在渐进式样调适器基础上实现零热话合成 2505.19384v1 |
Authors: Seokgi Lee, Jungjun Kim
We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability, which stems from its hierarchical structure.
nan
Article 1084
Title@2025-05-26 (1): JingFang: An Expert-Level Large Language Model for Traditional Chinese Medicine Clinical Consultation and Syndrome Differentiation-Based Treatment
Title: JingFang: An Expert-Level Large Language Model for Traditional Chinese Medicine Clinical Consultation and Syndrome Differentiation-Based Treatment | JingFang: Ein sachverständiges Sprachmodell für die traditionelle chinesische Medizin Klinische Beratung und Syndromdifferenzierungsbasierte Behandlung | JingFang:中国传统医学临床咨询和综合症差别治疗专家级大语言模式 2502.04345v2 |
Authors: Yehan Yang, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
The effective application of traditional Chinese medicine (TCM) requires extensive knowledge of TCM and clinical experience. The emergence of Large Language Models (LLMs) provides a solution to this, while existing LLMs for TCM exhibit critical limitations of incomplete clinical consultation and diagnoses, as well as inaccurate syndrome differentiation. To address these issues, we establish JingFang (JF), a novel TCM LLM that demonstrates the level of expertise in clinical consultation and syndrome differentiation. We propose a Multi-Agent Collaborative Chain-of-Thought Mechanism (MACCTM) for comprehensive and targeted clinical consultation, enabling JF with effective and accurate diagnostic ability. In addition, a Syndrome Agent and a Dual-Stage Recovery Scheme (DSRS) are developed to accurately enhance the differentiation of the syndrome and the subsequent corresponding treatment. JingFang not only facilitates the application of LLMs but also promotes the effective application of TCM for healthcare.
nan
Article 1085
Title@2025-05-26 (1): Identifying Knowledge Editing Types in Large Language Models
Title: Identifying Knowledge Editing Types in Large Language Models | Identifikation von Wissensbearbeitungstypen in großen Sprachmodellen | 确定大语言模式中的知识编辑类型 2409.19663v3 |
Authors: Xiaopeng Li, Shasha Li, Shangwen Wang, Shezheng Song, Bin Ji, Huijun Liu, Jun Ma, Jie Yu
Knowledge editing has emerged as an efficient technique for updating the knowledge of large language models (LLMs), attracting increasing attention in recent years. However, there is a lack of effective measures to prevent the malicious misuse of this technique, which could lead to harmful edits in LLMs. These malicious modifications could cause LLMs to generate toxic content, misleading users into inappropriate actions. In front of this risk, we introduce a new task, $\textbf{K}$nowledge $\textbf{E}$diting $\textbf{T}$ype $\textbf{I}$dentification (KETI), aimed at identifying different types of edits in LLMs, thereby providing timely alerts to users when encountering illicit edits. As part of this task, we propose KETIBench, which includes five types of harmful edits covering the most popular toxic types, as well as one benign factual edit. We develop five classical classification models and three BERT-based models as baseline identifiers for both open-source and closed-source LLMs. Our experimental results, across 92 trials involving four models and three knowledge editing methods, demonstrate that all eight baseline identifiers achieve decent identification performance, highlighting the feasibility of identifying malicious edits in LLMs. Additional analyses reveal that the performance of the identifiers is independent of the reliability of the knowledge editing methods and exhibits cross-domain generalization, enabling the identification of edits from unknown sources. All data and code are available in https://github.com/xpq-tech/KETI.
nan
Article 1086
Title@2025-05-26 (1): Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality
Title: Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality | Glaube Attribution als mentale Erklärung: Die Rolle der Genauigkeit, Informatizität und Kausalität | 信仰归属作为精神解释:准确性、信息化和因果关系的作用 2505.19376v1 |
Authors: Lance Ying, Almog Hillel, Ryan Truong, Vikash K. Mansinghka, Joshua B. Tenenbaum, Tan Zhi-Xuan
A key feature of human theory-of-mind is the ability to attribute beliefs to other agents as mentalistic explanations for their behavior. But given the wide variety of beliefs that agents may hold about the world and the rich language we can use to express them, which specific beliefs are people inclined to attribute to others? In this paper, we investigate the hypothesis that people prefer to attribute beliefs that are good explanations for the behavior they observe. We develop a computational model that quantifies the explanatory strength of a (natural language) statement about an agent’s beliefs via three factors: accuracy, informativity, and causal relevance to actions, each of which can be computed from a probabilistic generative model of belief-driven behavior. Using this model, we study the role of each factor in how people selectively attribute beliefs to other agents. We investigate this via an experiment where participants watch an agent collect keys hidden in boxes in order to reach a goal, then rank a set of statements describing the agent’s beliefs about the boxes’ contents. We find that accuracy and informativity perform reasonably well at predicting these rankings when combined, but that causal relevance is the single factor that best explains participants’ responses.
nan
Article 1087
Title@2025-05-26 (1): MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
Title: MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations | MOSAIC: Modellierung sozialer KI für die Verbreitung von Inhalten und Regulierung in Multi-Agent-Simulationen | MOSAIC:多机构模拟中内容传播和监管模拟社会AI 2504.07830v2 |
Authors: Genglin Liu, Vivian Le, Salman Rahman, Elisa Kreiss, Marzyeh Ghassemi, Saadia Gabriel
We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine-grained personas, our system enables multi-agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non-factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents’ articulated reasoning for their social interactions truly aligns with their collective engagement patterns. We open-source our simulation software to encourage further research within AI and social sciences.
nan
Article 1088
Title@2025-05-25 (7): ChartLens: Fine-grained Visual Attribution in Charts
Title: ChartLens: Fine-grained Visual Attribution in Charts | ChartLens: Feinkörnige visuelle Zuordnung in Charts | 图表边:图表中精细的可视属性 2505.19360v1 |
Authors: Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha
The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
nan
Article 1089
Title@2025-05-25 (7): Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval
Title: Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval | Optimierte Text-Embedding-Modelle und Benchmarks für die Amharische Passage Retrieval | 阿姆光通过通过检索的最佳文本嵌入模型和基准 2505.19356v1 |
Authors: Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke
Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.
nan
Article 1090
Title@2025-05-25 (7): Estimating Online Influence Needs Causal Modeling! Counterfactual Analysis of Social Media Engagement
Title: Estimating Online Influence Needs Causal Modeling! Counterfactual Analysis of Social Media Engagement | Schätzung des Online-Einflusses braucht kausale Modellierung! Gegenfaktische Analyse von Social Media Engagement | 估计在线影响需求因果建模:反事实分析社会媒体参与 2505.19355v1 |
Authors: Lin Tian, Marian-Andrei Rizoiu
Understanding true influence in social media requires distinguishing correlation from causation–particularly when analyzing misinformation spread. While existing approaches focus on exposure metrics and network structures, they often fail to capture the causal mechanisms by which external temporal signals trigger engagement. We introduce a novel joint treatment-outcome framework that leverages existing sequential models to simultaneously adapt to both policy timing and engagement effects. Our approach adapts causal inference techniques from healthcare to estimate Average Treatment Effects (ATE) within the sequential nature of social media interactions, tackling challenges from external confounding signals. Through our experiments on real-world misinformation and disinformation datasets, we show that our models outperform existing benchmarks by 15–22% in predicting engagement across diverse counterfactual scenarios, including exposure adjustment, timing shifts, and varied intervention durations. Case studies on 492 social media users show our causal effect measure aligns strongly with the gold standard in influence estimation, the expert-based empirical influence.
nan
Article 1091
Title@2025-05-25 (7): Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Title: Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning | Datumsfragmente: Ein versteckter Engpass an Tokenisierung für zeitliche Vernunft | 日期碎片: 用于时间原因的 托肯化的隐藏瓶头 2505.16088v2 |
Authors: Gagan Bhatia, Maxime Peyrard, Wei Zhao
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day). Our datasets and code are made publicly available \href{https://github.com/gagan3012/date-fragments}{here}.
nan
Article 1092
Title@2025-05-25 (7): GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance
Title: GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | GC-KBVQA: Ein neues Vier-Stufen-Framework zur Verbesserung der wissensbasierten visuellen Frageantwortleistung | GC-KKBVQA:加强基于知识的视觉回答问题业绩的四步新框架 2505.19354v1 |
Authors: Mohammad Mahdi Moradi, Sudhir Mudur
Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information. This is combined with knowledge from external sources to create highly informative prompts for the LLM. GC-KBVQA can address a variety of VQA tasks, and does not require task-specific fine-tuning, thus reducing both costs and deployment complexity by leveraging general-purpose, pre-trained LLMs. Comparison with competing KB-VQA methods shows significantly improved performance. Our code will be made public.
nan
Article 1093
Title@2025-05-25 (7): Optimizing Decomposition for Optimal Claim Verification
Title: Optimizing Decomposition for Optimal Claim Verification | Optimierung der Zersetzung für eine optimale Prüfung des Anspruchs | 优化最佳索赔核实的分解 2503.15354v2 |
Authors: Yining Lu, Noah Ziems, Hy Dang, Meng Jiang
Current research on the \textit{Decompose-Then-Verify} paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity – a novel metric quantifying information density – leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.
nan
Article 1094
Title@2025-05-25 (7): Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation
Title: Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation | Architekturen des Irrtums: Eine philosophische Untersuchung der KI- und menschlichen Code-Generation | 错误结构结构:对大赦国际和人类代码生成的哲学调查 2505.19353v1 |
Authors: Camilo Chacón Sartori
With the rise of generative AI (GenAI), Large Language Models are increasingly employed for code generation, becoming active co-authors alongside human programmers. Focusing specifically on this application domain, this paper articulates distinct ``Architectures of Error’’ to ground an epistemic distinction between human and machine code generation. Examined through their shared vulnerability to error, this distinction reveals fundamentally different causal origins: human-cognitive versus artificial-stochastic. To develop this framework and substantiate the distinction, the analysis draws critically upon Dennett’s mechanistic functionalism and Rescher’s methodological pragmatism. I argue that a systematic differentiation of these error profiles raises critical philosophical questions concerning semantic coherence, security robustness, epistemic limits, and control mechanisms in human-AI collaborative software development. The paper also utilizes Floridi’s levels of abstraction to provide a nuanced understanding of how these error dimensions interact and may evolve with technological advancements. This analysis aims to offer philosophers a structured framework for understanding GenAI’s unique epistemological challenges, shaped by these architectural foundations, while also providing software engineers a basis for more critically informed engagement.
nan
Article 1095
Title@2025-05-25 (7): PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
Title: PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims | PatentScore: Mehrdimensionale Bewertung von LLM-generierten Patentansprüchen | 专利核心:对LLM-专利专利权主张的多维评价 2505.19345v1 |
Authors: Yongmin Yoo, Qiongkai Xu, Longbing Cao
Natural language generation (NLG) metrics play a central role in evaluating generated texts, but are not well suited for the structural and legal characteristics of patent documents. Large language models (LLMs) offer strong potential in automating patent generation, yet research on evaluating LLM-generated patents remains limited, especially in evaluating the generation quality of patent claims, which are central to defining the scope of protection. Effective claim evaluation requires addressing legal validity, technical accuracy, and structural compliance. To address this gap, we introduce PatentScore, a multi-dimensional evaluation framework for assessing LLM-generated patent claims. PatentScore incorporates: (1) hierarchical decomposition for claim analysis; (2) domain-specific validation patterns based on legal and technical standards; and (3) scoring across structural, semantic, and legal dimensions. Unlike general-purpose NLG metrics, PatentScore reflects patent-specific constraints and document structures, enabling evaluation beyond surface similarity. We evaluate 400 GPT-4o-mini generated Claim 1s and report a Pearson correlation of $r = 0.819$ with expert annotations, outperforming existing NLG metrics. Furthermore, we conduct additional evaluations using open models such as Claude-3.5-Haiku and Gemini-1.5-flash, all of which show strong correlations with expert judgments, confirming the robustness and generalizability of our framework.
nan
Article 1096
Title@2025-05-25 (7): LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs
Title: LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs | LLM-basiertes Prompt-Ensemble für zuverlässige medizinische Entitätserkennung von EHRs | 以LLM为基础,从EHRs为可靠医疗实体识别而迅速加入 2505.08704v2 |
Authors: K M Sajjadul Islam, Ayesha Siddika Nipu, Jiawei Wu, Praveen Madiraju
Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using large language models (LLMs), specifically GPT-4o and DeepSeek-R1, guided by various prompt engineering techniques, including zero-shot, few-shot, and an ensemble approach. Among all strategies, GPT-4o with prompt ensemble achieved the highest classification performance with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1 on the task. The ensemble method improved reliability by aggregating outputs through embedding-based similarity and majority voting.
nan
Article 1097
Title@2025-05-25 (7): Regress, Don’t Guess – A Regression-like Loss on Number Tokens for Language Models
Title: Regress, Don’t Guess – A Regression-like Loss on Number Tokens for Language Models | Regress, nicht raten – Ein Rückschritt-ähnlicher Verlust an Zahlenzeichen für Sprachmodelle | Regress, don’t guess - 语言模型数字调的回归式损失 2411.02083v2 |
Authors: Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Morris Danziger, Jannis Born
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the Lp norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives. The code is available via: https://tum-ai.github.io/number-token-loss/
nan
Article 1098
Title@2025-05-25 (7): Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?
Title: Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data? | Sind Transformer durch die Verbindung getrennter Kenntnisse in Trainingsdaten in der Lage, Vernunft zu erreichen? | 将培训数据方面的单独知识连接起来的变换者是否具有理性? 2501.15857v6 |
Authors: Yutong Yin, Zhaoran Wang
Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, “FTCT” (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.
nan
Article 1099
Title@2025-05-25 (7): Patent-CR: A Dataset for Patent Claim Revision
Title: Patent-CR: A Dataset for Patent Claim Revision | Patent-CR: Ein Datensatz für Patentanspruchsrevision | 专利专利权:专利权索赔修订数据集 2412.02549v2 |
Authors: Lekang Jiang, Pascal A Scherz, Stephan Goetz
This paper presents Patent-CR, the first dataset created for the patent claim revision task in English. It includes both initial patent applications rejected by patent examiners and the final granted versions. Unlike normal text revision tasks that predominantly focus on enhancing sentence quality, such as grammar correction and coherence improvement, patent claim revision aims at ensuring the claims meet stringent legal criteria. These criteria are beyond novelty and inventiveness, including clarity of scope, technical accuracy, language precision, and legal robustness. We assess various large language models (LLMs) through professional human evaluation, including general LLMs with different sizes and architectures, text revision models, and domain-specific models. Our results indicate that LLMs often bring ineffective edits that deviate from the target revisions. In addition, domain-specific models and the method of fine-tuning show promising results. Notably, GPT-4 outperforms other tested LLMs, but further revisions are still necessary to reach the examination standard. Furthermore, we demonstrate the inconsistency between automated and human evaluation results, suggesting that GPT-4-based automated evaluation has the highest correlation with human judgment. This dataset, along with our preliminary empirical research, offers invaluable insights for further exploration in patent claim revision.
nan
Article 1100
Title@2025-05-25 (7): ODIN: A NL2SQL Recommender to Handle Schema Ambiguity
Title: ODIN: A NL2SQL Recommender to Handle Schema Ambiguity | ODIN: Ein NL2SQL-Empfänger zum Umgang mit Schema-Ambiguität | ODIN: 处理 Schema 模糊性的NL2SQL建议 2505.19302v1 |
Authors: Kapil Vaidya, Abishek Sankararaman, Jialin Ding, Chuan Lei, Xiao Qin, Balakrishnan Narayanaswamy, Tim Kraska
NL2SQL (natural language to SQL) systems translate natural language into SQL queries, allowing users with no technical background to interact with databases and create tools like reports or visualizations. While recent advancements in large language models (LLMs) have significantly improved NL2SQL accuracy, schema ambiguity remains a major challenge in enterprise environments with complex schemas, where multiple tables and columns with semantically similar names often co-exist. To address schema ambiguity, we introduce ODIN, a NL2SQL recommendation engine. Instead of producing a single SQL query given a natural language question, ODIN generates a set of potential SQL queries by accounting for different interpretations of ambiguous schema components. ODIN dynamically adjusts the number of suggestions based on the level of ambiguity, and ODIN learns from user feedback to personalize future SQL query recommendations. Our evaluation shows that ODIN improves the likelihood of generating the correct SQL query by 1.5-2$\times$ compared to baselines.
nan
Article 1101
Title@2025-05-25 (7): SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking
Title: SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking | RituatedThinker: LLM-Grundlegung mit Real-World durch Rituated Thinking | 地势感知者:通过地势思维将LLM定位在现实世界中 2505.19300v1 |
Authors: Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari
Recent advances in large language models (LLMs) demonstrate their impressive reasoning capabilities. However, the reasoning confined to internal parametric space limits LLMs’ access to real-time information and understanding of the physical world. To overcome this constraint, we introduce SituatedThinker, a novel framework that enables LLMs to ground their reasoning in real-world contexts through situated thinking, which adaptively combines both internal knowledge and external information with predefined interfaces. By utilizing reinforcement learning, SituatedThinker incentivizes deliberate reasoning with the real world to acquire information and feedback, allowing LLMs to surpass their knowledge boundaries and enhance reasoning. Experimental results demonstrate significant performance improvements on multi-hop question-answering and mathematical reasoning benchmarks. Furthermore, SituatedThinker demonstrates strong performance on unseen tasks, such as KBQA, TableQA, and text-based games, showcasing the generalizable real-world grounded reasoning capability. Our codes are available at https://github.com/jnanliu/SituatedThinker.
nan
Article 1102
Title@2025-05-25 (7): Can Large Language Models Generate High-quality Patent Claims?
Title: Can Large Language Models Generate High-quality Patent Claims? | Können große Sprachmodelle hochwertige Patentansprüche generieren? | 大语言模型能否产生高质量的专利索赔? 2406.19465v3 |
Authors: Lekang Jiang, Caiqi Zhang, Pascal A Scherz, Stephan Goetz
Large language models (LLMs) have shown exceptional performance across various text generation tasks but remain under-explored in the patent domain, which offers highly structured and precise language. This paper constructs a dataset to investigate the performance of current LLMs in patent claim generation. Our results demonstrate that generating claims based on patent descriptions outperforms previous research relying on abstracts. Interestingly, current patent-specific LLMs perform much worse than state-of-the-art general LLMs, highlighting the necessity for future research on in-domain LLMs. We also find that LLMs can produce high-quality first independent claims, but their performances markedly decrease for subsequent dependent claims. Moreover, fine-tuning can enhance the completeness of inventions’ features, conceptual clarity, and feature linkage. Among the tested LLMs, GPT-4 demonstrates the best performance in comprehensive human evaluations by patent experts, with better feature coverage, conceptual clarity, and technical coherence. Despite these capabilities, comprehensive revision and modification are still necessary to pass rigorous patent scrutiny and ensure legal robustness.
nan
Article 1103
Title@2025-05-25 (7): Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Title: Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions | Nicht-Just-Scaling-Gesetze: Auf dem Weg zu einem besseren Verständnis der Auswirkungen von Sprachmodellgestaltungsentscheidungen | 《非公正衡量法律:更好地了解语言设计示范设计决定下游影响》 2503.03862v2 |
Authors: Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig
Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
nan
Article 1104
Title@2025-05-25 (7): A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations
Title: A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations | Ein notwendiger Schritt zur Treue: Konsistenz in Freitexterklärungen messen und verbessern | 迈向信仰的必要步骤:衡量和增进自由解释中的一致性 2505.19299v1 |
Authors: Lingjun Zhao, Hal Daumé III
Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.
nan
Article 1105
Title@2025-05-25 (7): Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data
Title: Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data | Prompting ist nicht alles, was Sie brauchen! Bewertung LLM Agent Simulation Methoden mit Real-World Online Kunden Verhalten Daten | 提示并非你所需要的全部! 使用真实世界在线客户行为数据评估LLM代理模拟方法 2503.20749v5 |
Authors: Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen, Wang, Qi He, Dakuo Wang
Recent research shows that LLMs can simulate believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating LLM's objective
accuracy’’ rather than the subjective ``believability’’ in simulating human behavior, leveraging a large-scale, real-world dataset collected from customers’ online shopping actions. We present the first comprehensive evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web shopping action generation. Our results show that out-of-the-box LLM-generated actions are often misaligned with actual human behavior, whereas fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate accurate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasonings into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work evaluates state-of-the-art LLMs in behavior simulation and provides actionable insights into how real-world action data can enhance the fidelity of LLM agents.
nan
Article 1106
Title@2025-05-25 (7): Towards Reliable Large Audio Language Model
Title: Towards Reliable Large Audio Language Model | Zuverlässiges großes Audio-Sprachenmodell | 努力实现可靠的大型音频语言模式 2505.19294v1 |
Authors: Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don’t know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a “meta ability”, which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.
nan
Article 1107
Title@2025-05-25 (7): 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Title: 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? | 100-LongBench: Sind de facto Long-Context-Benchmarks wortwörtlich die Lang-Context-Fähigkeit zu bewerten? | 100-LongBench:事实上的长文本基准是否实际评价长文本能力? 2505.19293v1 |
Authors: Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks – e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model’s baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
nan
Article 1108
Title@2025-05-25 (7): Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning
Title: Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning | Ausbalancieren von Wahrhaftigkeit und Aufklärung mit unsicherer Anleitung Feintuning | 平衡真实和知情与不确定性软件指示 2502.11962v2 |
Authors: Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
Instruction fine-tuning (IFT) can increase the informativeness of large language models (LLMs), but may reduce their truthfulness. This trade-off arises because IFT steers LLMs to generate responses containing long-tail knowledge that was not well covered during pre-training. As a result, models become more informative but less accurate when generalizing to unseen tasks. In this paper, we empirically demonstrate how unfamiliar knowledge in IFT datasets can negatively affect the truthfulness of LLMs, and we introduce two new IFT paradigms, $UNIT_{cut}$ and $UNIT_{ref}$, to address this issue. $UNIT_{cut}$ identifies and removes unfamiliar knowledge from IFT datasets to mitigate its impact on model truthfulness, whereas $UNIT_{ref}$ trains LLMs to recognize their uncertainty and explicitly indicate it at the end of their responses. Our experiments show that $UNIT_{cut}$ substantially improves LLM truthfulness, while $UNIT_{ref}$ maintains high informativeness and reduces hallucinations by distinguishing between confident and uncertain statements.
nan
Article 1109
Title@2025-05-25 (7): Next Token Prediction Is a Dead End for Creativity
Title: Next Token Prediction Is a Dead End for Creativity | Nächster Token Prediction ist ein totes Ende für Kreativität | 下个 Tok 预测是创造性的死胡同 2505.19277v1 |
Authors: Ibukun Olatunji, Mark Sheppard
This paper argues that token prediction is fundamentally misaligned with real creativity. While next-token models have enabled impressive advances in language generation, their architecture favours surface-level coherence over spontaneity, originality, and improvisational risk. We use battle rap as a case study to expose the limitations of predictive systems, demonstrating that they cannot truly engage in adversarial or emotionally resonant exchanges. By reframing creativity as an interactive process rather than a predictive output, we offer a vision for AI systems that are more expressive, responsive, and aligned with human creative practice.
nan
Article 1110
Title@2025-05-25 (7): TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding
Title: TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding | TheoremExplainAgent: Auf dem Weg zu videobasierten multimodalen Erklärungen für LLM-Theorem-Verständnis | 理论专家:争取为LLM理论理解提供基于视频的多式解释 2502.19400v2 |
Authors: Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen
Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.
nan
Article 1111
Title@2025-05-25 (7): A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations
Title: A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations | Ein individueller Gesprächs-Benchmark: Auf dem Weg zur Simulation personalisierter Gespräche | 个人对话基准:模拟个人对话 2505.14106v2 |
Authors: Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang, Xiangliang Zhang, Philip S. Yu, Xiyang Hu, Yue Zhao
We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.
nan
Article 1112
Title@2025-05-25 (7): Unveiling Dual Quality in Product Reviews: An NLP-Based Approach
Title: Unveiling Dual Quality in Product Reviews: An NLP-Based Approach | Enthüllung von Dual Quality in Produktbewertungen: Ein NLP-basierter Ansatz | 产品审查中不固定的双重质量:基于NLP的方法 2505.19254v1 |
Authors: Rafał Poświata, Marcin Michał Mirończuk, Sławomir Dadas, Małgorzata Grębowiec, Michał Perełkiewicz
Consumers often face inconsistent product quality, particularly when identical products vary between markets, a situation known as the dual quality problem. To identify and address this issue, automated techniques are needed. This paper explores how natural language processing (NLP) can aid in detecting such discrepancies and presents the full process of developing a solution. First, we describe in detail the creation of a new Polish-language dataset with 1,957 reviews, 540 highlighting dual quality issues. We then discuss experiments with various approaches like SetFit with sentence-transformers, transformer-based encoders, and LLMs, including error analysis and robustness verification. Additionally, we evaluate multilingual transfer using a subset of opinions in English, French, and German. The paper concludes with insights on deployment and practical applications.
nan
Article 1113
Title@2025-05-25 (7): Do Vision-Language Models Really Understand Visual Language?
Title: Do Vision-Language Models Really Understand Visual Language? | Verstehen Vision-Language-Modelle wirklich visuelle Sprache? | 视觉语言模型真的理解视觉语言吗? 2410.00193v3 |
Authors: Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan
Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.
nan
Article 1114
Title@2025-05-25 (7): Rethinking Chain-of-Thought from the Perspective of Self-Training
Title: Rethinking Chain-of-Thought from the Perspective of Self-Training | Überdenken der Gedankenkette aus der Perspektive des Selbst-Trainings | 从自我培训的角度重新思考一系列问题 2412.10827v4 |
Authors: Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng
Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, \ie over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.
nan
Article 1115
Title@2025-05-25 (7): PATS: Process-Level Adaptive Thinking Mode Switching
Title: PATS: Process-Level Adaptive Thinking Mode Switching | PATS: Prozess-Level-Adaptive-Denkmodus-Umschaltung | PATT: 进程层面适应性思维模式转换 2505.19250v1 |
Authors: Yi Wang, Junxiao Liu, Shimao Zhang, Jiajun Chen, Shujian Huang
Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.
nan
Article 1116
Title@2025-05-25 (7): ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty
Title: ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty | VergleichQA: Bewertung der Faktizität Robustheit von LLMs durch Wissensfrequenzkontrolle und Unsicherheit | 比较QA:通过知识频率控制和不确定性评估LLMs的实际情况 2412.20251v2 |
Authors: Qing Zong, Zhaowei Wang, Tianshi Zheng, Xiyu Ren, Yangqiu Song
The rapid development of LLMs has sparked extensive research into their factual knowledge. Current works find that LLMs fall short on questions around low-frequency entities. However, such proofs are unreliable since the questions can differ not only in entity frequency but also in difficulty themselves. So we introduce ComparisonQA benchmark, containing 283K abstract questions, each instantiated by a pair of high-frequency and low-frequency entities. It ensures a controllable comparison to study the role of knowledge frequency in the performance of LLMs. Because the difference between such a pair is only the entity with different frequencies. In addition, we use both correctness and uncertainty to develop a two-round method to evaluate LLMs’ knowledge robustness. It aims to avoid possible semantic shortcuts which is a serious problem of current QA study. Experiments reveal that LLMs, including GPT-4o, exhibit particularly low robustness regarding low-frequency knowledge. Besides, we find that uncertainty can be used to effectively identify high-quality and shortcut-free questions while maintaining the data size. Based on this, we propose an automatic method to select such questions to form a subset called ComparisonQA-Hard, containing only hard low-frequency questions.
nan
Article 1117
Title@2025-05-25 (7): LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Title: LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models | LLLMs: Eine datengestützte Untersuchung der sich entwickelnden Forschung über Grenzen großer Sprachmodelle | LLLMs:关于大语言模式限制的不断发展的研究数据驱动调查 2505.19240v1 |
Authors: Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger
Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that LLM-related research increases over fivefold in ACL and fourfold in arXiv. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by late 2024. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2024. We release a dataset of annotated abstracts and a validated methodology, and offer a quantitative view of trends in LLM limitations research.
nan
Article 1118
Title@2025-05-25 (7): Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
Title: Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator | Bewertung der Textkreativität über verschiedene Domänen: Ein Datensatz und großer Sprachmodell-Evaluator | 评价跨不同域域的文本创造性:数据集和大语言模式评价员 2505.19236v1 |
Authors: Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song
Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.
nan
Article 1119
Title@2025-05-25 (7): Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets
Title: Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets | Zuverlässige Ad-hoc-Wissenschaftliche Informationsextraktion: Eine Fallstudie zu zwei Materialdatensätzen | 争取实现可靠的特设科学信息提取:关于两个材料数据集的个案研究 2406.05348v3 |
Authors: Satanu Ghosh, Neal R. Brodnik, Carolina Frey, Collin Holgate, Tresa M. Pollock, Samantha Daly, Samuel Carton
We explore the ability of GPT-4 to perform ad-hoc schema based information extraction from scientific literature. We assess specifically whether it can, with a basic prompting approach, replicate two existing material science datasets, given the manuscripts from which they were originally manually extracted. We employ materials scientists to perform a detailed manual error analysis to assess where the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.
nan
Article 1120
Title@2025-05-25 (7): VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Title: VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models | VerifyBench: Benchmarking Referenzbasierte Prämiensysteme für große Sprachmodelle | 核查时间:大语言模式基准参考奖励制度基准 2505.15801v2 |
Authors: Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang
Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.
nan
Article 1121
Title@2025-05-25 (7): GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling
Title: GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling | GUARDIAN: LLM-Multiagent-Kollaborationen mit zeitlicher Graphenmodellierung sichern | GUARDIAN: 保护LLM 多机构协作与时间图建模 2505.19234v1 |
Authors: Jialong Zhou, Lichao Wang, Xiao Yang
The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi-turn dialogues. However, multi-agent collaboration face critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder-decoder architecture incorporating an incremental training paradigm, learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN’s effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization.
nan
Article 1122
Title@2025-05-25 (7): Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More
Title: Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More | Sprachmodelle, Graph Searching und Überwachung Ehebruch: Wenn mehr Aufsicht weniger ist und wie man mehr macht | 语言模式、图图搜索和监督通配:越少越少监督,如何做越多 2503.10542v3 |
Authors: Arvid Frydenlund
This work concerns the path-star task, a minimal example of searching over a graph. The graph, $G$, is star-shaped with $D$ arms radiating from a start node, $s$. A language model (LM) is given $G$, $s$, and a target node $t$, which ends one of the arms and is tasked with generating the arm containing $t$. The minimal nature of this task means only a single choice needs to be made: which of the $D$ arms contains $t$? Decoder-only LMs fail to solve this elementary task above $1/D$ chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task’s minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.
nan
Article 1123
Title@2025-05-25 (7): The Impact of LoRA Adapters for LLMs on Clinical NLP Classification Under Data Limitations
Title: The Impact of LoRA Adapters for LLMs on Clinical NLP Classification Under Data Limitations | Die Auswirkungen von LoRA-Adaptern für LLMs auf die klinische NLP-Klassifikation unter Datenbeschränkungen | LoRA适应器对LLMLMLLM对临床NLP数据限制下分类的影响 2407.19299v2 |
Authors: Thanh-Dung Le, Ti Ti Nguyen, Vu Nguyen Ha, Symeon Chatzinotas, Philippe Jouvet, Rita Noumeir
Fine-tuning Large Language Models (LLMs) for clinical Natural Language Processing (NLP) poses significant challenges due to the domain gap and limited data availability. This study investigates the effectiveness of various adapter techniques, equivalent to Low-Rank Adaptation (LoRA), for fine-tuning LLMs in a resource-constrained hospital environment. We experimented with four structures-Adapter, Lightweight, TinyAttention, and Gated Residual Network (GRN)-as final layers for clinical notes classification. We fine-tuned biomedical pre-trained models, including CamemBERT-bio, AliBERT, and DrBERT, alongside two Transformer-based models. Our extensive experimental results indicate that i) employing adapter structures does not yield significant improvements in fine-tuning biomedical pre-trained LLMs, and ii) simpler Transformer-based models, trained from scratch, perform better under resource constraints. Among the adapter structures, GRN demonstrated superior performance with accuracy, precision, recall, and an F1 score of 0.88. Moreover, the total training time for LLMs exceeded 1000 hours, compared to under 6 hours for simpler transformer-based models, highlighting that LLMs are more suitable for environments with extensive computational resources and larger datasets. Consequently, this study demonstrates that simpler Transformer-based models can be effectively trained from scratch, providing a viable solution for clinical NLP tasks in low-resource environments with limited data availability. By identifying the GRN as the most effective adapter structure, we offer a practical approach to enhance clinical note classification without requiring extensive computational resources.
nan
Article 1124
Title@2025-05-25 (7): The Overthinker’s DIET: Cutting Token Calories with DIfficulty-AwarE Training
Title: The Overthinker’s DIET: Cutting Token Calories with DIfficulty-AwarE Training | Das DIET des Überdenkers: Schneiden von Token Calories mit DIschwer-AwarE-Schulung | 过度思考家的DIET: 利用Difficulticry - AwarE 培训来切开托肯卡路里 2505.19217v1 |
Authors: Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun
Recent large language models (LLMs) exhibit impressive reasoning but often over-think, generating excessively long responses that hinder efficiency. We introduce DIET ( DIfficulty-AwarE Training), a framework that systematically cuts these “token calories” by integrating on-the-fly problem difficulty into the reinforcement learning (RL) process. DIET dynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty, to optimize the performance-efficiency trade-off. We also theoretically analyze the pitfalls of naive reward weighting in group-normalized RL algorithms like GRPO, and propose Advantage Weighting technique, which enables stable and effective implementation of these difficulty-aware objectives. Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. Beyond raw token reduction, we show two crucial benefits largely overlooked by prior work: (1) DIET leads to superior inference scaling. By maintaining high per-sample quality with fewer tokens, it enables better scaling performance via majority voting with more samples under fixed computational budgets, an area where other methods falter. (2) DIET enhances the natural positive correlation between response length and problem difficulty, ensuring verbosity is appropriately allocated, unlike many existing compression methods that disrupt this relationship. Our analyses provide a principled and effective framework for developing more efficient, practical, and high-performing LLMs.
nan
Article 1125
Title@2025-05-25 (7): When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Title: When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas | Wenn Ethik und Payoffs Diverge: LLM-Agenten in moralisch belasteten sozialen Dilemmas | 道德与报酬:道德道德与报酬:道德界的LLM代理人员充斥社会困境 2505.19212v1 |
Authors: Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs’ moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner’s dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent’s “self-interest” may conflict with ethical expectations. Our code is available at https://github.com/sbackmann/moralsim.
nan
Article 1126
Title@2025-05-25 (7): Conformity in Large Language Models
Title: Conformity in Large Language Models | Konformität in großen Sprachmodellen | 大语言模式的合规性 2410.12428v2 |
Authors: Xiaochen Zhu, Caiqi Zhang, Tom Stafford, Nigel Collier, Andreas Vlachos
The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in popular LLMs. Our findings reveal that all tested models exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions, Devil’s Advocate and Question Distillation, to mitigate conformity, providing insights into building more robust language models.
nan
Article 1127
Title@2025-05-25 (7): Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models
Title: Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models | Segment-Level Diffusion: Ein Framework für kontrollierbare Langform-Generation mit Diffusions-Sprachmodellen | 局部级传播:具有传播语言模型的可控长龄一代框架 2412.11333v2 |
Authors: Xiaochen Zhu, Georgi Karadzhov, Chenxi Whitehouse, Andreas Vlachos
Diffusion models have shown promise in text generation, but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion doesn’t model word-order dependencies explicitly and operates on short, fixed output windows, while passage-level diffusion struggles with learning robust representations for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into multiple latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on four datasets demonstrate that, when compared to other diffusion and autoregressive baselines SLD achieves competitive or superior fluency, coherence, and contextual compatibility in automatic and human evaluations.
nan
Article 1128
Title@2025-05-25 (7): MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
Title: MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search | MOOSE-Chem2: Erforschung der LLM-Grenzwerte in feinkörniger wissenschaftlicher Hypothese durch hierarchische Suche | MOOSE-Chem2:通过等级搜索探索探索精密科学假设发现时的LLM限度 2505.19209v1 |
Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs’ capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM’s internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.
nan
Article 1129
Title@2025-05-25 (7): SpeakStream: Streaming Text-to-Speech with Interleaved Data
Title: SpeakStream: Streaming Text-to-Speech with Interleaved Data | SpeakStream: Streaming von Text-zu-Speech mit interleaved Daten | 语音Stream:用断开数据流流流文本到语音 2505.19206v1 |
Authors: Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems.
nan
Article 1130
Title@2025-05-25 (7): Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Title: Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | Benign Proben Materie! Feinabstimmung auf Aussergewöhnliche Benign Proben stark bricht Sicherheit | 重大事件 重大事件 重大事件 安全 重大事件 重大事件 重大事件 重大事件 重大事件 2505.06843v2 |
Authors: Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti
Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.
nan
Article 1131
Title@2025-05-25 (7): SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
Title: SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis | SimpleDeepSearcher: Deep Information Suche via Web-Powered Reasoning Trajektorie Synthesis | 简单深海earcher:通过网络动力理性轨迹合成寻求深度信息 2505.16834v2 |
Authors: Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.
nan
Article 1132
Title@2025-05-25 (7): iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
Title: iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use | iTool: Verstärkte Feinsteuerung mit dynamischer Kalibrierung bei fortgeschrittenem Werkzeugeinsatz | i Tool:加强先进工具使用动态缺乏度校准的精细测试 2501.09766v4 |
Authors: Yirong Zeng, Xiao Ding, Yuxian Wang, Weiwen Liu, Wu Ning, Yutai Hou, Xu Huang, Bing Qin, Ting Liu
Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data, and it can not equip the model with advanced tool-use capabilities in complex scenarios. Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model’s deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.
nan
Article 1133
Title@2025-05-25 (7): A partition cover approach to tokenization
Title: A partition cover approach to tokenization | Eine Partition Abdeckung Ansatz tokenization | 分区覆盖模式 2501.06246v2 |
Authors: Jia Peng Lim, Shawn Tan, Davin Choo, Hady W. Lauw
Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.
nan
Article 1134
Title@2025-05-25 (7): Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection
Title: Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection | Irreführung durch Inkonsistenz: Ein Maßstab für politische Inkonsistenzen | 以不一致性导致的错误领导:政治不一致调查基准 2505.19191v1 |
Authors: Nursulu Sagimbayeva, Ruveyda Betül Bahçeci, Ingmar Weber
Inconsistent political statements represent a form of misinformation. They erode public trust and pose challenges to accountability, when left unnoticed. Detecting inconsistencies automatically could support journalists in asking clarification questions, thereby helping to keep politicians accountable. We propose the Inconsistency detection task and develop a scale of inconsistency types to prompt NLP-research in this direction. To provide a resource for detecting inconsistencies in a political domain, we present a dataset of 698 human-annotated pairs of political statements with explanations of the annotators’ reasoning for 237 samples. The statements mainly come from voting assistant platforms such as Wahl-O-Mat in Germany and Smartvote in Switzerland, reflecting real-world political issues. We benchmark Large Language Models (LLMs) on our dataset and show that in general, they are as good as humans at detecting inconsistencies, and might be even better than individual humans at predicting the crowd-annotated ground-truth. However, when it comes to identifying fine-grained inconsistency types, none of the model have reached the upper bound of performance (due to natural labeling variation), thus leaving room for improvement. We make our dataset and code publicly available.
nan
Article 1135
Title@2025-05-25 (7): LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Title: LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling | LIMOPro: Verfeinerung für effizientes und effektives Skalieren von Testzeiten | LIMOP: 为高效率和高成效测试时间的缩放而改进理由 2505.19187v1 |
Authors: Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu
Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.
nan
Article 1136
Title@2025-05-25 (7): Talk to Your Slides: Language-Driven Agents for Efficient Slide Editing
Title: Talk to Your Slides: Language-Driven Agents for Efficient Slide Editing | Sprechen Sie mit Ihren Folien: Sprachgetriebene Agenten für effizientes Dia-Editing | 访问您的幻灯片: 用于高效幻灯片编辑的语言驱动器 2505.11604v3 |
Authors: Kyudan Jung, Hojun Cho, Jooyeol Yun, Soyoung Yang, Jaehyeok Jang, Jaegul Choo
Editing presentation slides remains one of the most common and time-consuming tasks faced by millions of users daily, despite significant advances in automated slide generation. Existing approaches have successfully demonstrated slide editing via graphic user interface (GUI)-based agents, offering intuitive visual control. However, such methods often suffer from high computational cost and latency. In this paper, we propose Talk-to-Your-Slides, an LLM-powered agent designed to edit slides %in active PowerPoint sessions by leveraging structured information about slide objects rather than relying on image modality. The key insight of our work is designing the editing process with distinct high-level and low-level layers to facilitate interaction between user commands and slide objects. By providing direct access to application objects rather than screen pixels, our system enables 34.02% faster processing, 34.76% better instruction fidelity, and 87.42% cheaper operation than baselines. To evaluate slide editing capabilities, we introduce TSBench, a human-annotated dataset comprising 379 diverse editing instructions paired with corresponding slide variations in four categories. Our code, benchmark and demos are available at https://anonymous.4open.science/r/Talk-to-Your-Slides-0F4C.
nan
Article 1137
Title@2025-05-25 (7): DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
Title: DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation | DiTAR: Diffusion Transformer Autoregressive Modellierung für Sprachgenerierung | DITAR: 发声的传播变异器自动递减模型 2502.03930v3 |
Authors: Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang
Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
nan
Article 1138
Title@2025-05-25 (7): Position: Enough of Scaling LLMs! Lets Focus on Downscaling
Title: Position: Enough of Scaling LLMs! Lets Focus on Downscaling | Position: Genug von Scaling LLMs! Konzentriert sich auf Downscaling | 位置: 缩放 LLM 已经足够! 让我们集中关注缩放缩放 2505.00985v3 |
Authors: Yash Goel, Ayan Sengupta, Tanmoy Chakraborty
We challenge the dominant focus on neural scaling laws and advocate for a paradigm shift toward downscaling in the development of large language models (LLMs). While scaling laws have provided critical insights into performance improvements through increasing model and dataset size, we emphasize the significant limitations of this approach, particularly in terms of computational inefficiency, environmental impact, and deployment constraints. To address these challenges, we propose a holistic framework for downscaling LLMs that seeks to maintain performance while drastically reducing resource demands. This paper outlines practical strategies for transitioning away from traditional scaling paradigms, advocating for a more sustainable, efficient, and accessible approach to LLM development.
nan
Article 1139
Title@2025-05-25 (7): Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Title: Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models | Scaling Reasoning, Losing Control: Bewertung von Instruktionen in großen Reasoning-Modellen | 扩大理由、减少控制:根据大理由模型评价指示 2505.14810v2 |
Authors: Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng
Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.
nan
Article 1140
Title@2025-05-25 (7): Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
Title: Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge | Assistant-Guided Milderung von Lehrerpräferenz Bias in LLM-as-a-Richter | 助理辅导减轻在LLM-as-a法官中偏爱比阿斯的教师偏爱 2505.19176v1 |
Authors: Zhuo Liu, Moxin Li, Xun Deng, Qifan Wang, Fuli Feng
LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model’s responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at https://github.com/Liuz233/AGDe-Judge.
nan
Article 1141
Title@2025-05-25 (7): Mixture of Lookup Experts
Title: Mixture of Lookup Experts | Mischung von Lookup-Experten | 查找专家混合 2503.15798v2 |
Authors: Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang
Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert’s computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.
nan
Article 1142
Title@2025-05-25 (7): SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs
Title: SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs | SpokenNativQA: Mehrsprachige Alltagsfragen für LLMs | SpokenNativQA: 每天多语种为LLM 询问 2505.19163v1 |
Authors: Firoj Alam, Md Arid Hasan, Shammur Absar Chowdhury
Large Language Models (LLMs) have demonstrated remarkable performance across various disciplines and tasks. However, benchmarking their capabilities with multilingual spoken queries remains largely unexplored. In this study, we introduce SpokenNativQA, the first multilingual and culturally aligned spoken question-answering (SQA) dataset designed to evaluate LLMs in real-world conversational settings. The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages, including low-resource and dialect-rich languages, providing a robust benchmark for assessing LLM performance in speech-based interactions. SpokenNativQA addresses the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. We benchmark different ASR systems and LLMs for SQA and present our findings. We released the data at (https://huggingface.co/datasets/QCRI/SpokenNativQA) and the experimental scripts at (https://llmebench.qcri.org/) for the research community.
nan
Article 1143
Title@2025-05-25 (7): CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
Title: CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter | CORAL: Lerne konsistente Repräsentationen über mehrstufiges Training mit leichterem spekulativen Entwurfer | CORAL: 利用轻型投机性起草者在多阶段培训中学习一致的代表性 2502.16880v3 |
Authors: Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, Zhongchao Shi
Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training and inference. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge. To address this, we propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting. CORAL introduces Cross-Step Representation Alignment, a method that enhances consistency across multiple training steps, significantly improving speculative drafting performance. Additionally, we identify the LM head as a major bottleneck in the inference speed of the draft model. We introduce a weight-grouping mechanism that selectively activates a subset of LM head parameters during inference, substantially reducing the latency of the draft model. We evaluate CORAL on three LLM families and three benchmark datasets, achieving speedup ratios of 2.50x-4.07x, outperforming state-of-the-art methods such as EAGLE-2 and HASS. Our results demonstrate that CORAL effectively mitigates training-inference misalignment and delivers significant speedup for modern LLMs with large vocabularies.
nan
Article 1144
Title@2025-05-25 (7): FISH-Tuning: Enhancing PEFT Methods with Fisher Information
Title: FISH-Tuning: Enhancing PEFT Methods with Fisher Information | FISH-Tuning: Verbesserung der PEFT-Methoden mit Fisher Information | FISH-Tuning:加强渔业信息PEFT方法 2504.04050v3 |
Authors: Kang Xue, Ming Dong, Xinhui Tu, Tingting He
The rapid growth in the parameter size of Large Language Models (LLMs) has spurred the development of Parameter-Efficient Fine-Tuning (PEFT) methods to mitigate the substantial computational costs of fine-tuning. Among these, Fisher Induced Sparse uncHanging (FISH) Mask is a selection-based PEFT technique that identifies a critical subset of pre-trained parameters using approximate Fisher information. While addition-based and reparameterization-based PEFT methods like LoRA and Adapter already fine-tune only a small number of parameters, the newly introduced parameters within these methods themselves present an opportunity for further optimization. Selectively fine-tuning only the most impactful among these new parameters could further reduce resource consumption while maintaining, or even improving, fine-tuning effectiveness. In this paper, we propose \textbf{FISH-Tuning}, a novel approach that incorporates FISH Mask into such PEFT methods, including LoRA, Adapter, and their variants. By leveraging Fisher information to identify and update only the most significant parameters within these added or reparameterized components, FISH-Tuning aims to achieve superior performance without increasing training time or inference latency compared to the vanilla PEFT methods. Experimental results across various datasets and pre-trained models demonstrate that FISH-Tuning consistently outperforms the vanilla PEFT methods when using the same proportion of trainable parameters. Code is available at https://anonymous.4open.science/r/FISH-Tuning-6F7C.
nan
Article 1145
Title@2025-05-25 (7): Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
Title: Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs | Sparse-to-Dense: Ein kostenloses Mittagessen für verlustfreies Beschleunigen des Videoverständnisses in LLMs | 简单到感:免费午餐,促进无损失地加速视频理解,LLMM 2505.19155v1 |
Authors: Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu
Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.
nan
Article 1146
Title@2025-05-25 (7): Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization
Title: Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization | Speech-FT: Zusammenführen vortrainierter und fein abgestimmter Sprachdarstellungsmodelle für Cross-Task-Verallgemeinerung | 演讲-TF: 合并的预先培训和经过精练发言代表模式,供跨任务一般化使用 2502.12672v2 |
Authors: Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee
Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.
nan
Article 1147
Title@2025-05-25 (7): Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation
Title: Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation | Divide-Then-Aggregat: Eine effiziente Tool-Learning-Methode über parallele Tool-Invokation | 分离后生成工具:通过平行工具使用使用效率高的工具学习方法 2501.12432v2 |
Authors: Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
Although current Large Language Models (LLMs) exhibit impressive capabilities, performing complex real-world tasks still requires tool learning. Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to interact with external environments, but they are limited in perceptual scope and lack adequate task-planning capability. To address these limitations, other studies introduce the first Search-based Decision Tree (DFSDT), which still suffers from the high computational cost. In this paper, we introduce a novel parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama). First, we transform traditional tree-based tool search paths into Directed Acyclic Graph (DAG) structure, generating a high-quality parallel tool invocation dataset. The DTA-Llama is then trained on the dataset to learn to iteratively divide the current task into several parallel tool invocation sub-tasks and aggregate the invocation results to decide the next actions. Furthermore, we introduce an efficient inference framework inspired by the Process/Threads mechanism when applying the DTA-Llama to practical tasks. Experimental results show that our approach substantially enhances task performance while reducing token consumption and inference time. Llama2-7B, using our method, is comparable to the official parallel function calling method of GPT-3.5. The relevant code, dataset, and model weights are available at https://corn0205.github.io/
nan
Article 1148
Title@2025-05-25 (7): Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression | Verlagerung der KI-Effizienz von der modell-zentralen zur daten-zentralen Komprimierung | 将AI效率从示范目录转向数据中心压缩 2505.19147v1 |
Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community’s advancement.
nan
Article 1149
Title@2025-05-25 (7): How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching?
Title: How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching? | Wie wirkt sich eine Textvorverarbeitung auf die Ontologie aus? | 文本预处理管道如何影响本体学同步匹配? 2411.03962v6 |
Authors: Zhangcheng Qiang, Kerry Taylor, Weiqing Wang
The classic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for syntactic ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on syntactic OM in 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Phase 1 text preprocessing (Tokenisation and Normalisation) is more effective than Phase 2 text preprocessing (Stop Words Removal and Stemming/Lemmatisation). We propose two novel approaches to repair unwanted false mappings caused by Phase 2 text preprocessing. One is an ad hoc logic-based repair approach that employs an ontology-specific check to find common words that cause false mappings. These words are stored in a reserved word set and applied before the text preprocessing. By leveraging the power of large language models (LLMs), we also propose a post hoc LLM-based repair approach. This approach utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings after the text preprocessing. It also overcomes the tendency towards unstable true mappings by injecting the classical text preprocessing pipeline via function calling. The experimental results show that these two approaches can improve the matching correctness and the overall matching performance.
nan
Article 1150
Title@2025-05-25 (7): Efficient Reasoning for LLMs through Speculative Chain-of-Thought
Title: Efficient Reasoning for LLMs through Speculative Chain-of-Thought | Effiziente Begründung für LLMs durch spekulative Kette-of-Thought | 通过投机性研究链的探索,提高LLMs的效率 2504.19095v2 |
Authors: Jikai Wang, Juntao Li, Jianye Hou, Bowen Yan, Lijun Wu, Min Zhang
Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have recently attracted widespread attention due to their impressive task-solving abilities. However, the enormous model size and the generation of lengthy thought chains introduce significant reasoning costs and response latency. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. In this paper, we introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed through large and small model collaboration. SCoT conducts thought-level drafting using a lightweight draft model. Then it selects the best CoT draft and corrects the error cases with the target model. The proposed thinking behavior alignment improves the efficiency of drafting and the draft selection strategy maintains the prediction accuracy of the target model for complex tasks. Experimental results on GSM8K, MATH, GaoKao, CollegeMath and Olympiad datasets show that SCoT reduces reasoning latency by 48\%$\sim$66\% and 21\%$\sim$49\% for Deepseek-R1-Distill-Qwen-32B and Deepseek-R1-Distill-Llama-70B while achieving near-target-model-level performance. Our code is available at https://github.com/Jikai0Wang/Speculative_CoT.
nan
Article 1151
Title@2025-05-25 (7): SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data
Title: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data | SERL: Selbstspiel-Verstärkungs-Lernen für große Sprachmodelle mit begrenzten Daten | SeRL: 有限数据大语言模式自我强化学习 2505.20347v1 |
Authors: Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, Dacheng Tao
Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.
nan
Article 1152
Title@2025-05-25 (7): Language Fusion for Parameter-Efficient Cross-lingual Transfer
Title: Language Fusion for Parameter-Efficient Cross-lingual Transfer | Sprachfusion für Parameter-Effizient Cross-lingual Transfer | 参数有效跨语言转让语言融合 2501.06892v2 |
Authors: Philipp Borchert, Ivan Vulić, Marie-Francine Moens, Jochen De Weerdt
Limited availability of multilingual text corpora for training language models often leads to poor performance on downstream tasks due to undertrained representation spaces for languages other than English. This ‘under-representation’ has motivated recent cross-lingual transfer methods to leverage the English representation space by e.g. mixing English and ‘non-English’ tokens at the input level or extending model parameters to accommodate new languages. However, these approaches often come at the cost of increased computational complexity. We propose Fusion forLanguage Representations (FLARE) in adapters, a novel method that enhances representation quality and downstream performance for languages other than English while maintaining parameter efficiency. FLARE integrates source and target language representations within low-rank (LoRA) adapters using lightweight linear transformations, maintaining parameter efficiency while improving transfer performance. A series of experiments across representative cross-lingual natural language understanding tasks, including natural language inference, question-answering and sentiment analysis, demonstrate FLARE’s effectiveness. FLARE achieves performance improvements of 4.9% for Llama 3.1 and 2.2% for Gemma~2 compared to standard LoRA fine-tuning on question-answering tasks, as measured by the exact match metric.
nan
Article 1153
Title@2025-05-25 (7): Natural Language Generation from Visual Events: Challenges and Future Directions
Title: Natural Language Generation from Visual Events: Challenges and Future Directions | Natürliche Sprachgenerierung aus visuellen Veranstaltungen: Herausforderungen und Zukunftsrichtungen | 从视觉活动中产生自然语言:挑战和未来方向 2502.13034v2 |
Authors: Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
The ability to use natural language to talk about visual events is at the core of human intelligence and a crucial feature of any artificial intelligence system. In recent years, a substantial body of work in visually grounded NLP has focused on describing content depicted in single images. By contrast, comparatively less attention has been devoted to exhaustively modeling scenarios in which natural language is employed to interpret and talk about events presented through videos or sequences of images. In this position paper, we argue that any NLG task dealing with sequences of images or frames is an instance of the broader, more general problem of modeling the intricate relationships between visual events unfolding over time and the features of the language used to interpret, describe, or narrate them. Therefore, solving these tasks requires models to be capable of identifying and managing such intricacies. We consider five seemingly different tasks, which we argue are compelling instances of this broader multimodal problem. Consistently, we claim that these tasks pose a common set of challenges and share similarities in terms of modeling and evaluation approaches. Building on this perspective, we identify key open questions and propose several research directions for future investigation. We claim that improving language-and-vision models’ understanding of visual events is both timely and essential, given their growing applications. Additionally, this challenge offers significant scientific insight, advancing model development through principles of human cognition and language use.
nan
Article 1154
Title@2025-05-25 (7): Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks
Title: Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks | Latent-Space-Adversarial-Training mit post-aware Kalibrierung zur Verteidigung großer Sprachmodelle gegen Jailbreak-Angriffe | 为防御大型语言模式以防范越狱袭击而进行后天校准的后备空间对抗性培训 2501.10639v2 |
Authors: Xin Yi, Yue Li, dongsheng Shi, Linlin Wang, Xiaoling Wang, Liang He
Ensuring safety alignment has become a critical requirement for large language models (LLMs), particularly given their widespread deployment in real-world applications. However, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. Although numerous defense mechanisms based on adversarial training have been proposed, a persistent challenge lies in the exacerbation of over-refusal behaviors, which compromise the overall utility of the model. To address these challenges, we propose a Latent-space Adversarial Training with Post-aware Calibration (LATPC) framework. During the adversarial training phase, LATPC compares harmful and harmless instructions in the latent space and extracts safety-critical dimensions to construct refusal features attack, precisely simulating agnostic jailbreak attack types requiring adversarial mitigation. At the inference stage, an embedding-level calibration mechanism is employed to alleviate over-refusal behaviors with minimal computational overhead. Experimental results demonstrate that, compared to various defense methods across five types of jailbreak attacks, LATPC framework achieves a superior balance between safety and utility. Moreover, our analysis underscores the effectiveness of extracting safety-critical dimensions from the latent space for constructing robust refusal feature attacks.
nan
Article 1155
Title@2025-05-25 (7): RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models
Title: RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models | Alles abrufen: Ein mehrsprachiger, benannter Entity-Erkennungs-Rahmen mit großen Sprachmodellen | 检索全部:多语种实体识别框架,带有大语言模式 2505.19128v1 |
Authors: Jin Zhang, Fan Gao, Linyu Li, Yongbin Yu, Xiangxiang Wang, Nyima Tashi, Gadeng Luosang
The rise of large language models has led to significant performance breakthroughs in named entity recognition (NER) for high-resource languages, yet there remains substantial room for improvement in low- and medium-resource languages. Existing multilingual NER methods face severe language interference during the multi-language adaptation process, manifested in feature conflicts between different languages and the competitive suppression of low-resource language features by high-resource languages. Although training a dedicated model for each language can mitigate such interference, it lacks scalability and incurs excessive computational costs in real-world applications. To address this issue, we propose RetrieveAll, a universal multilingual NER framework based on dynamic LoRA. The framework decouples task-specific features across languages and demonstrates efficient dynamic adaptability. Furthermore, we introduce a cross-granularity knowledge augmented method that fully exploits the intrinsic potential of the data without relying on external resources. By leveraging a hierarchical prompting mechanism to guide knowledge injection, this approach advances the paradigm from “prompt-guided inference” to “prompt-driven learning.” Experimental results show that RetrieveAll outperforms existing baselines; on the PAN-X dataset, it achieves an average F1 improvement of 12.1 percent.
nan
Article 1156
Title@2025-05-25 (7): MMATH: A Multilingual Benchmark for Mathematical Reasoning
Title: MMATH: A Multilingual Benchmark for Mathematical Reasoning | MPATH: Mehrsprachiger Benchmark für mathematische Vernunft | MMATH: 数学理由多语种基准 2505.19126v1 |
Authors: Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen
The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.
nan
Article 1157
Title@2025-05-25 (7): Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models
Title: Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models | Mehrsprachige Ethische Bias: Der MSQAD mit statistischen Hypothesentests für große Sprachmodelle | 跳入多语言伦理比喻:高语言模型统计假设测试的MSQAD 2505.19121v1 |
Authors: Seunguk Yu, Juhwan Choi, Youngbin Kim
Despite the recent strides in large language models, studies have underscored the existence of social biases within these systems. In this paper, we delve into the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics, hypothesizing that these biases may arise from language-specific distinctions. Introducing the Multilingual Sensitive Questions & Answers Dataset (MSQAD), we collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages. We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests. The results showed that the null hypotheses were rejected in most cases, indicating biases arising from cross-language differences. It demonstrates that ethical biases in responses are widespread across various languages, and notably, these biases were prevalent even among different LLMs. By making the proposed MSQAD openly available, we aim to facilitate future research endeavors focused on examining cross-language biases in LLMs and their variant models.
nan
Article 1158
Title@2025-05-25 (7): Controlling Language Confusion in Multilingual LLMs
Title: Controlling Language Confusion in Multilingual LLMs | Sprachkonfusion in mehrsprachigen LLMs kontrollieren | 多语种LMM中控制语言混杂 2505.19116v1 |
Authors: Nahyun Lee, Yeongseo Woo, Hyunwoo Ko, Guijin Son
Large language models often suffer from language confusion, a phenomenon where responses are partially or entirely generated in unintended languages. This can critically impact user experience in low-resource settings. We hypothesize that conventional supervised fine-tuning exacerbates this issue because the softmax objective focuses probability mass only on the single correct token but does not explicitly penalize cross-lingual mixing. Interestingly, by observing loss trajectories during the pretraining phase, we observe that models fail to learn to distinguish between monolingual and language-confused text. Additionally, we find that ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppresses language-confused generations even at high decoding temperatures without degrading overall model performance. Our findings suggest that incorporating appropriate penalty terms can mitigate language confusion in low-resource settings with limited data.
nan
Article 1159
Title@2025-05-25 (7): Is Compression Really Linear with Code Intelligence?
Title: Is Compression Really Linear with Code Intelligence? | Ist Kompression wirklich linear mit Code Intelligence? | 压缩真的有代码情报线条吗? 2505.11441v2 |
Authors: Xianzhen Luo, Shijie Xuyang, Tianhao Cheng, Zheng Chu, Houyi Li, ziqi wang, Siming Huang, Qingfu Zhu, Qiufeng Wang, Xiangyu Zhang, Shuigeng Zhou, Wanxiang Che
Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs’ code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve’s tail under specific, limited conditions. Our work provides a more nuanced understanding of compression’s role in developing code intelligence and contributes a robust evaluation framework in the code domain.
nan
Article 1160
Title@2025-05-25 (7): Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering
Title: Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering | Selbstkritische iterative Begründung für Multi-Hop-Fragebeantwortung | 多点问答问题解答自创性指导性迭代理由 2505.19112v1 |
Authors: Zheng Chu, Huiming Fan, Jingchang Chen, Qianyu Wang, Mingda Yang, Jiafeng Liang, Zhongjie Wang, Hao Li, Guo Tang, Ming Liu, Bing Qin
Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the lack of intermediate guidance often results in inaccurate retrieval and flawed intermediate reasoning, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition. Additionally, the model is able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by $8.6\%$. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at Github: https://github.com/zchuz/SiGIR-MHQA.
nan
Article 1161
Title@2025-05-25 (7): Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Title: Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling | Verwandeln von Müll in Schatz: Beschleunigen von Inferenzen von großen Sprachmodellen mit Token-Recycling | 将垃圾垃圾变成宝库:加快使用 Tok 回收利用大语言模型的推论 2408.08696v3 |
Authors: Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu
Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based training-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. It stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30\% and even a widely recognized training method by 25\%.
nan
Article 1162
Title@2025-05-25 (7): MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset
Title: MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset | MALAMUTE: Ein multilingualer, hochgranularer, musterloser, bildungsbasierter Probing-Datensatz | 多种语文、高语种、无模版、以教育为基础的探测数据集 2412.10105v2 |
Authors: Sagi Shaier, George Arthur Baker, Chiranthan Sridhar, Lawrence E Hunter, Katharina von der Wense
Language models (LMs) have excelled in various broad domains. However, to ensure their safe and effective integration into real-world educational settings, they must demonstrate proficiency in specific, granular areas of knowledge. Existing cloze-style benchmarks, commonly used to evaluate LMs’ knowledge, have three major limitations. They: 1) do not cover the educational domain; 2) typically focus on low-complexity, generic knowledge or broad domains, which do not adequately assess the models’ knowledge in specific subjects; and 3) often rely on templates that can bias model predictions. Here, we introduce MALAMUTE, a multilingual, template-free, and highly granular probing dataset comprising expert-written, peer-reviewed probes from 71 university-level textbooks across three languages (English, Spanish, and Polish). MALAMUTE is the first education-based cloze-style dataset. It covers eight domains, each with up to 14 subdomains, further broken down into concepts and concept-based prompts, totaling 33,361 university curriculum concepts and 116,887 prompts. MALAMUTE’s fine granularity, educational focus, and inclusion of both sentence-level and paragraph-level prompts make it an ideal tool for evaluating LMs’ course-related knowledge. Our evaluation of masked and causal LMs on MALAMUTE shows that despite overall proficiency, they have significant gaps in knowledge when examined closely on specific subjects, hindering their safe use in classrooms and underscoring the need for further development.
nan
Article 1163
Title@2025-05-25 (7): CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
Title: CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models | CCHall: Ein neuartiger Benchmark für gemeinsame Cross-Lingual- und Cross-Modal Halluzinationen Detection in großen Sprachmodellen | CCHall:在大语言模型中联合跨语言和跨模式幻觉探测新基准 2505.19108v1 |
Authors: Yongheng Zhang, Xu Liu, Ruoxi Zhou, Qiguang Chen, Hao Fei, Wenpeng Lu, Libo Qin
Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.
nan
Article 1164
Title@2025-05-25 (7): WHISTRESS: Enriching Transcriptions with Sentence Stress Detection
Title: WHISTRESS: Enriching Transcriptions with Sentence Stress Detection | WHISTRESS: Anreicherung von Transkriptionen mit Satz-Stress-Erkennung | WHISRSES: 增加刑期压力感应检测的追踪 2505.19103v1 |
Authors: Iddo Yosha, Dorin Shteyman, Yossi Adi
Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: https://pages.cs.huji.ac.il/adiyoss-lab/whistress.
nan
Article 1165
Title@2025-05-25 (7): ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning
Title: ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning | ASPO: Adaptive Sentence-Level-Preference-Optimierung für eine feinkörnige multimodale Begründung | APPO: 调整性判决一级优惠优化有偿多模式理由 2505.19100v1 |
Authors: Yeyuan Wang, Dehong Gao, Rujiao Long, Lei Yi, Linbo Jin, Libin Yang, Xiaoyan Cai
Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.
nan
Article 1166
Title@2025-05-25 (7): AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios
Title: AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios | Beschwerdesache: Ein Datensatz und Benchmark für zivilrechtliche Beschwerdeszenarien | 上诉案例:民事案件上诉设想情况数据集和基准 2505.16514v2 |
Authors: Yuting Huang, Meitong Guo, Yiquan Wu, Ang Li, Xiaozhong Liu, Keting Yin, Changlong Sun, Fei Wu, Kun Kuang
Recent advances in LegalAI have primarily focused on individual case judgment analysis, often overlooking the critical appellate process within the judicial system. Appeals serve as a core mechanism for error correction and ensuring fair trials, making them highly significant both in practice and in research. To address this gap, we present the AppealCase dataset, consisting of 10,000 pairs of real-world, matched first-instance and second-instance documents across 91 categories of civil cases. The dataset also includes detailed annotations along five dimensions central to appellate review: judgment reversals, reversal reasons, cited legal provisions, claim-level decisions, and whether there is new information in the second instance. Based on these annotations, we propose five novel LegalAI tasks and conduct a comprehensive evaluation across 20 mainstream models. Experimental results reveal that all current models achieve less than 50% F1 scores on the judgment reversal prediction task, highlighting the complexity and challenge of the appeal scenario. We hope that the AppealCase dataset will spur further research in LegalAI for appellate case analysis and contribute to improving consistency in judicial decision-making.
nan
Article 1167
Title@2025-05-25 (7): ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models
Title: ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models | ReadBench: Vermessen der Dichte an Text Visuelle Lesefähigkeit von Vision-Sprachen-Modellen | ” 阅读 “ :衡量视觉-语言模型的阅读能力 2505.19091v1 |
Authors: Benjamin Clavié, Florian Brand
Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks…), there is limited assessment of VLMs’ ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .
nan
Article 1168
Title@2025-05-25 (7): Towards Harmonized Uncertainty Estimation for Large Language Models
Title: Towards Harmonized Uncertainty Estimation for Large Language Models | Hin zu einer harmonisierten Ungewissheitsschätzung für große Sprachmodelle | 争取为大语言模式统一不确定性估算 2505.19073v1 |
Authors: Rui Li, Jing Long, Muge Qi, Heming Xia, Lei Sha, Peiyi Wang, Zhifang Sui
To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM’s performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.
nan
Article 1169
Title@2025-05-25 (7): Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors
Title: Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors | Training Turn-by-Turn Prüfer für Dialog-Tutoring-Agenten: Der seltsame Fall von LLMs als Ihre Coding Tutoren | 对话教学代理培训转弯验证员培训:LLMs作为你的编码导师的好奇案例 2502.13311v3 |
Authors: Jian Wang, Yinpei Dai, Yichi Zhang, Ziqiao Ma, Wenjie Li, Joyce Chai
Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students towards completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student’s knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our approach can be extended beyond coding, providing valuable insights into advancing tutoring agents for human task learning.
nan
Article 1170
Title@2025-05-25 (7): UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models
Title: UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models | UNCERTAINTY-LINE: Längeninvariante Schätzung der Unsicherheit für große Sprachmodelle | UNDES-LINE: 大语言模型不确定性的长 动 变 动 估测 2505.19060v1 |
Authors: Roman Vashurin, Maiya Goloburda, Preslav Nakov, Maxim Panov
Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.
nan
Article 1171
Title@2025-05-25 (7): An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Title: An Embarrassingly Simple Defense Against LLM Abliteration Attacks | Eine erschreckend einfache Verteidigung gegen LLM-Abliterationsangriffe | 一种令人尴尬的简单防御 对付LLM 缩写攻击 2505.19056v1 |
Authors: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models’ refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.
nan
Article 1172
Title@2025-05-25 (7): Efficient Data Selection at Scale via Influence Distillation
Title: Efficient Data Selection at Scale via Influence Distillation | Effiziente Datenauswahl auf Scale durch Einflussdestillation | 通过影响蒸馏在规模上高效数据选择 2505.19051v1 |
Authors: Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, Vahab Mirrokni
Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample’s influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a $\textit{landmark-based approximation}$: influence is precisely computed for a small subset of “landmark” samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5\times$ faster selection.
nan
Article 1173
Title@2025-05-25 (7): SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
Title: SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models | SliM-LLM: Salience-getriebene Mixed-Precision-Quantisierung für große Sprachmodelle | SliM-LLM:大语言模型的盐度驱动混合精度量 2405.14917v2 |
Authors: Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: \textbf{1)} \textit{Salience-Determined Bit Allocation} adaptively assigns bit-widths to groups within each layer based on their salience; and \textbf{2)} \textit{Salience-Weighted Quantizer Calibration} optimizes quantizer parameters by incorporating element-level salience. With its structured partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48\% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM$^+$, which incorporates gradient-based quantization, further reduces perplexity by 35.1\%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM
nan
Article 1174
Title@2025-05-25 (7): PII-Scope: A Comprehensive Study on Training Data PII Extraction Attacks in LLMs
Title: PII-Scope: A Comprehensive Study on Training Data PII Extraction Attacks in LLMs | PII-Scope: Eine umfassende Studie über Trainingsdaten PII-Extraktionsangriffe in LLMs | PII-范围:关于培训数据的综合研究 2410.06704v2 |
Authors: Krishna Kanth Nakka, Ahmed Frikha, Ricardo Mendes, Xue Jiang, Xuebing Zhou
In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.
nan
Article 1175
Title@2025-05-25 (7): Domain Adaptation of Foundation LLMs for e-Commerce
Title: Domain Adaptation of Foundation LLMs for e-Commerce | Domain-Anpassung der Stiftung LLMs für e-Commerce | 用于电子商务的 “ 基础基础改造 “ 领域改造 2501.09706v3 |
Authors: Christian Herold, Michael Kozielski, Tala Bazazo, Pavel Petrushkov, Patrycja Cieplicka, Dominika Basaj, Yannick Versley, Seyyed Hadi Hashemi, Shahram Khadivi
We present the e-Llama models: 8 billion and 70 billion parameter large language models that are adapted towards the e-commerce domain. These models are meant as foundation models with deep knowledge about e-commerce, that form a base for instruction- and fine-tuning. The e-Llama models are obtained by continuously pretraining the Llama 3.1 base models on 1 trillion tokens of domain-specific data. We discuss our approach and motivate our choice of hyperparameters with a series of ablation studies. To quantify how well the models have been adapted to the e-commerce domain, we define and implement a set of multilingual, e-commerce specific evaluation tasks. We show that, when carefully choosing the training setup, the Llama 3.1 models can be adapted towards the new domain without sacrificing significant performance on general domain tasks. We also explore the possibility of merging the adapted model and the base model for a better control of the performance trade-off between domains.
nan
Article 1176
Title@2025-05-25 (7): Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
Title: Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models | Speech-IFEval: Bewertung von Instruktions-Following und Quantifying Katastrophic Forgetting in Speech-Aware Language Models | 语言-语言语言评估:评价在语言-语言软件模型中遵守教学和量化灾难性遗忘的情况 2505.19037v1 |
Authors: Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following, hindering evaluation of these distinct skills. To address this gap, we provide a benchmark for diagnosing the instruction-following abilities of SLMs. Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs. Additionally, these models are highly sensitive to prompt variations, often yielding inconsistent and unreliable outputs. We highlight core challenges and provide insights to guide future research, emphasizing the need for evaluation beyond task-level metrics.
nan
Article 1177
Title@2025-05-25 (7): DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
Title: DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models | DiffPO: Diffusion-gestylte Preference-Optimierung zur effizienten Inferenz-Zeit-Ausrichtung großer Sprachmodelle | DiffPO: 大语言模式有效推论-时间协调最佳优化 2503.04240v3 |
Authors: Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Joey Tianyi Zhou, Tony Quek, Soujanya Poria, Zuozhu Liu
Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (\model), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, \model~avoids the time latency associated with token-level generation. Designed as a plug-and-play module, \model~can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that \model~achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, \model~demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.
nan
Article 1178
Title@2025-05-25 (7): SQUiD: Synthesizing Relational Databases from Unstructured Text
Title: SQUiD: Synthesizing Relational Databases from Unstructured Text | SQUiD: Synthese von relationalen Datenbanken aus unstrukturiertem Text | SQUiD: 从无结构文本中合成关系数据库 2505.19025v1 |
Authors: Mushtari Sadia, Zhenning Yang, Yunming Xiao, Ang Chen, Amrita Roy Chowdhury
Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.
nan
Article 1179
Title@2025-05-25 (7): Fractured Chain-of-Thought Reasoning
Title: Fractured Chain-of-Thought Reasoning | Zersplitterte Kette von nachdenklichen Gründen | 断断断断断断断断断断断断的探讨链原因 2505.12992v2 |
Authors: Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.
nan
Article 1180
Title@2025-05-25 (7): AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale
Title: AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale | AM-Thinking-v1: Die Grenzen der Vernunft auf 32B-Skala verbessern | AM- Thinking-v1: 推进32B级的理性前沿 2505.08311v2 |
Authors: Yunjie Ji, Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, Xiangang Li
We present AM-Thinking-v1, a 32B dense language model that advances the frontier of reasoning, embodying the collaborative spirit of open-source innovation. Outperforming DeepSeek-R1 and rivaling leading Mixture-of-Experts (MoE) models like Qwen3-235B-A22B and Seed1.5-Thinking, AM-Thinking-v1 achieves impressive scores of 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, showcasing state-of-the-art mathematical and coding capabilities among open-source models of similar scale. Built entirely from the open-source Qwen2.5-32B base model and publicly available queries, AM-Thinking-v1 leverages a meticulously crafted post-training pipeline - combining supervised fine-tuning and reinforcement learning - to deliver exceptional reasoning capabilities. This work demonstrates that the open-source community can achieve high performance at the 32B scale, a practical sweet spot for deployment and fine-tuning. By striking a balance between top-tier performance and real-world usability, we hope AM-Thinking-v1 inspires further collaborative efforts to harness mid-scale models, pushing reasoning boundaries while keeping accessibility at the core of innovation. We have open-sourced our model on \href{https://huggingface.co/a-m-team/AM-Thinking-v1}{Hugging Face}.
nan
Article 1181
Title@2025-05-25 (7): Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis
Title: Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis | Ausbildung nichtlinearer Transformer für den Schlussfolgerungsketten-of-Thought: Eine theoretische Generalisierungsanalyse | 培训非线性非线性变换器,用于研究链推论:理论一般分析 2410.02167v3 |
Authors: Hongkang Li, Songtao Lu, Pin-Yu Chen, Xiaodong Cui, Meng Wang
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.
nan
Article 1182
Title@2025-05-25 (7): CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language
Title: CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language | CrosGrpsABS: Cross-Attention über syntaktische und semantische Graphen zur aspektbasierten Sentimentanalyse in einer Sprache mit geringem Ressourcenbedarf | CrossGrpsABS:对用于低源语言频谱感应分析的同步和语义图的交叉注意 2505.19018v1 |
Authors: Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, Md. Rajib Hossain, Md. Saifur Rahman, A. B. M. Shawkat Ali
Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text. While existing research has largely focused on resource-rich languages like English which leveraging large annotated datasets, pre-trained models, and language-specific tools. These resources are often unavailable for low-resource languages such as Bengali. The ABSA task in Bengali remains poorly explored and is further complicated by its unique linguistic characteristics and a lack of annotated data, pre-trained models, and optimized hyperparameters. To address these challenges, this research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification. The CrosGrpsABS combines transformerbased contextual embeddings with graph convolutional networks, built upon rule-based syntactic dependency parsing and semantic similarity computations. By employing bidirectional crossattention, the model effectively fuses local syntactic structure with global semantic context, resulting in improved sentiment classification performance across both low- and high-resource settings. We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset. The CrosGrpsABS consistently outperforms existing approaches, achieving notable improvements, including a 0.93% F1-score increase for the Restaurant domain and a 1.06% gain for the Laptop domain in the SemEval 2014 Task 4 benchmark.
nan
Article 1183
Title@2025-05-25 (7): Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
Title: Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection | Co-AttenDWG: Co-Attentive Dimension-Wise-Gating und Expertenfusion für Multi-Modal-Offensive Content Detection | 共同-DWG:多模式进攻性攻击物质探测联合加速维维维-韦兹交织和专家混合 2505.19010v1 |
Authors: Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha
Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.
nan
Article 1184
Title@2025-05-25 (7): AAAR-1.0: Assessing AI’s Potential to Assist Research
Title: AAAR-1.0: Assessing AI’s Potential to Assist Research | AAAR-1.0: Bewertung des Potenzials von KI zur Unterstützung der Forschung | AAAR-1.0:评估大赦国际协助研究的潜力 2410.22394v4 |
Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin
Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.
nan
Article 1185
Title@2025-05-25 (7): VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Title: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization | VerIPO: Pflege der langen Vernunft in Video-LLMs über die iterative Politikoptimierung von Prüfern | VERIPO:通过验证和研究的迭代政策优化在视频LLMs中培养长期理由 2505.19000v1 |
Authors: Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs’ capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO’s expansive search and DPO’s targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
nan
Article 1186
Title@2025-05-25 (7): Visual Program Distillation with Template-Based Augmentation
Title: Visual Program Distillation with Template-Based Augmentation | Visuelle Programmdestillation mit Template-basierter Augmentation | 利用基于模板的增量进行视觉程序蒸馏 2412.08564v3 |
Authors: Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
nan
Article 1187
Title@2025-05-25 (7): FiLLM – A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)
Title: FiLLM – A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM) | FiLLM – Ein philippinisch optimiertes Large Language Model auf Basis von Southeast Asia Large Language Model (SEALLM) | FILLM – – 基于东南亚大语言模型的菲律宾最佳大语言模型(SEALM) 2505.18995v1 |
Authors: Carlos Jude G. Maminta, Isaiah Job Enriquez, Deandre Nigel Nunez, Michael B. Dela Fuente
This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.
nan
Article 1188
Title@2025-05-25 (7): Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example | Verstärktes Lernen zur Vernunft in großen Sprachmodellen mit einem Trainingsbeispiel | 采用 “ 一个培训实例 “ 采用大语言模式强化学习 2504.20571v2 |
Authors: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the “grokking” phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR.
nan
Article 1189
Title@2025-05-25 (7): LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
Title: LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts | LLMs kennen ihre Schwachstellen: Enthüllen Sie Sicherheitslücken durch natürliche Verteilungsverschiebungen | LLM女士知道他们的脆弱性:通过自然分布变化实现的未覆盖的安全差距 2410.10700v2 |
Authors: Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour’s actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.
nan
Article 1190
Title@2025-05-25 (7): One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models
Title: One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models | One-for-All Pruning: Ein universelles Modell zur kundenspezifischen Kompression großer Sprachmodelle | ” 一为普普普 “ :大语言模式定制压缩通用模式 2505.12216v2 |
Authors: Rongguang Ye, Ming Tang
Existing pruning methods for large language models (LLMs) focus on achieving high compression rates while maintaining model performance. Although these methods have demonstrated satisfactory performance in handling a single user’s compression request, their processing time increases linearly with the number of requests, making them inefficient for real-world scenarios with multiple simultaneous requests. To address this limitation, we propose a Univeral Model for Customized Compression (UniCuCo) for LLMs, which introduces a StratNet that learns to map arbitrary requests to their optimal pruning strategy. The challenge in training StratNet lies in the high computational cost of evaluating pruning strategies and the non-differentiable nature of the pruning process, which hinders gradient backpropagation for StratNet updates. To overcome these challenges, we leverage a Gaussian process to approximate the evaluation process. Since the gradient of the Gaussian process is computable, we can use it to approximate the gradient of the non-differentiable pruning process, thereby enabling StratNet updates. Experimental results show that UniCuCo is 28 times faster than baselines in processing 64 requests, while maintaining comparable accuracy to baselines.
nan
Article 1191
Title@2025-05-25 (7): Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers
Title: Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers | Automatisierte Vertrauenswürdigkeit Oracle Generation für Machine Learning Text Klassifikatoren | 机械学习文字分类的自动可信赖性甲骨文生成 2410.22663v4 |
Authors: Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, Aldeida Aleti
Machine learning (ML) for text classification has been widely used in various domains. These applications can significantly impact ethics, economics, and human behavior, raising serious concerns about trusting ML decisions. Studies indicate that conventional metrics are insufficient to build human trust in ML models. These models often learn spurious correlations and predict based on them. In the real world, their performance can deteriorate significantly. To avoid this, a common practice is to test whether predictions are reasonable based on valid patterns in the data. Along with this, a challenge known as the trustworthiness oracle problem has been introduced. Due to the lack of automated trustworthiness oracles, the assessment requires manual validation of the decision process disclosed by explanation methods. However, this is time-consuming, error-prone, and unscalable. We propose TOKI, the first automated trustworthiness oracle generation method for text classifiers. TOKI automatically checks whether the words contributing the most to a prediction are semantically related to the predicted class. Specifically, we leverage ML explanations to extract the decision-contributing words and measure their semantic relatedness with the class based on word embeddings. We also introduce a novel adversarial attack method that targets trustworthiness vulnerabilities identified by TOKI. To evaluate their alignment with human judgement, experiments are conducted. We compare TOKI with a naive baseline based solely on model confidence and TOKI-guided adversarial attack method with A2T, a SOTA adversarial attack method. Results show that relying on prediction uncertainty cannot effectively distinguish between trustworthy and untrustworthy predictions, TOKI achieves 142% higher accuracy than the naive baseline, and TOKI-guided attack method is more effective with fewer perturbations than A2T.
nan
Article 1192
Title@2025-05-25 (7): STRICT: Stress Test of Rendering Images Containing Text
Title: STRICT: Stress Test of Rendering Images Containing Text | STRICT: Stresstest von Rendering-Bildern mit Text | STICT: 含有文字的图像的显示压力测试 2505.18985v1 |
Authors: Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.
nan
Article 1193
Title@2025-05-25 (7): LLMScan: Causal Scan for LLM Misbehavior Detection
Title: LLMScan: Causal Scan for LLM Misbehavior Detection | LLMScan: Kausalscan zur Erkennung von LLM-Missverhalten | LLMScan:用于LLM Misbehavavor探测的成因扫描 2410.16638v4 |
Authors: Mengdi Zhang, Kai Kiat Goh, Peixin Zhang, Jun Sun, Rose Lin Xin, Hongyu Zhang
Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM’s `brain’ behaves differently when misbehaving. By analyzing the causal contributions of the LLM’s input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
nan
Article 1194
Title@2025-05-25 (7): AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models
Title: AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models | AI4Math: Ein nativer spanischer Benchmark für mathematische Grundlagenforschung auf Universitätsebene in großen Sprachmodellen | AI4Matth:关于大语言模式中大学一级数学原因的土著西班牙基准 2505.18978v1 |
Authors: Miguel Angel Peñaloza Perez, Bruno Lopez Orozco, Jesus Tadeo Cruz Soto, Michelle Bruno Hernandez, Miguel Angel Alvarado Gonzalez, Sandra Malagon
Existing mathematical reasoning benchmarks are predominantly English only or translation-based, which can introduce semantic drift and mask languagespecific reasoning errors. To address this, we present AI4Math, a benchmark of 105 original university level math problems natively authored in Spanish. The dataset spans seven advanced domains (Algebra, Calculus, Geometry, Probability, Number Theory, Combinatorics, and Logic), and each problem is accompanied by a step by step human solution. We evaluate six large language models GPT 4o, GPT 4o mini, o3 mini, LLaMA 3.3 70B, DeepSeek R1 685B, and DeepSeek V3 685B under four configurations: zero shot and chain of thought, each in Spanish and English. The top models (o3 mini, DeepSeek R1 685B, DeepSeek V3 685B) achieve over 70% accuracy, whereas LLaMA 3.3 70B and GPT-4o mini remain below 40%. Most models show no significant performance drop between languages, with GPT 4o even performing better on Spanish problems in the zero shot setting. Geometry, Combinatorics, and Probability questions remain persistently challenging for all models. These results highlight the need for native-language benchmarks and domain-specific evaluations to reveal reasoning failures not captured by standard metrics.
nan
Article 1195
Title@2025-05-25 (7): PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
Title: PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues | PersuasiveToM: Ein Benchmark für die Bewertung der Maschinentheorie des Geistes in überzeugenden Dialogen | M:在有影响的对话中评价心理机器理论的基准 2502.21017v2 |
Authors: Fangxu Yu, Lai Jiang, Shenyi Huang, Zhen Wu, Xinyu Dai
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social scenarios. Although recent studies have evaluated ToM in Large Language Models (LLMs), existing benchmarks focus on simplified settings (e.g., Sally-Anne-style tasks) and overlook the complexity of real-world social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework contains two core tasks: ToM Reasoning, which tests tracking of evolving desires, beliefs, and intentions; and ToM Application, which assesses the use of inferred mental states to predict and evaluate persuasion strategies. Experiments across eight leading LLMs reveal that while models excel on multiple questions, they struggle with the tasks that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at https://github.com/Yu-Fangxu/PersuasiveToM.
nan
Article 1196
Title@2025-05-25 (7): Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE
Title: Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE | Wird architektonische Komplexität überbewertet? Wettbewerbsfähige und interpretierbare Wissensgraphenvervollständigung mit RelatE | 建筑复杂程度是否被高估了? 2505.18971v1 |
Authors: Abhijit Chakraborty, Chahana Dahal, Ashutosh Balasubramaniam, Tejas Anvekar, Vivek Gupta
We revisit the efficacy of simple, real-valued embedding models for knowledge graph completion and introduce RelatE, an interpretable and modular method that efficiently integrates dual representations for entities and relations. RelatE employs a real-valued phase-modulus decomposition, leveraging sinusoidal phase alignments to encode relational patterns such as symmetry, inversion, and composition. In contrast to recent approaches based on complex-valued embeddings or deep neural architectures, RelatE preserves architectural simplicity while achieving competitive or superior performance on standard benchmarks. Empirically, RelatE outperforms prior methods across several datasets: on YAGO3-10, it achieves an MRR of 0.521 and Hit@10 of 0.680, surpassing all baselines. Additionally, RelatE offers significant efficiency gains, reducing training time by 24%, inference latency by 31%, and peak GPU memory usage by 22% compared to RotatE. Perturbation studies demonstrate improved robustness, with MRR degradation reduced by up to 61% relative to TransE and by up to 19% compared to RotatE under structural edits such as edge removals and relation swaps. Formal analysis further establishes the model’s full expressiveness and its capacity to represent essential first-order logical inference patterns. These results position RelatE as a scalable and interpretable alternative to more complex architectures for knowledge graph completion.
nan
Article 1197
Title@2025-05-25 (7): Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study
Title: Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study | Untersuchung der Inferenzzeitskalierung für die Kette multimodaler Gedanken: Eine Vorstudie | 多式联运思维链调查推理-时间尺度:初步研究 2502.11514v2 |
Authors: Yujie Lin, Ante Wang, Moye Chen, Jingyao Liu, Hao Liu, Jinsong Su, Xinyan Xiao
Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking. Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications. We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field.
nan
Article 1198
Title@2025-05-25 (7): MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models
Title: MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models | MoLAE: Mischung aus latenten Experten für Parameter-Effiziente Sprachmodelle | MoLAE:参数有效语言模型原始专家混合 2503.23100v2 |
Authors: Zehua Liu, Han Wu, Ruifeng She, Xiaojin Fu, Xiongwei Han, Tao Zhong, Mingxuan Yuan
Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs) by selectively activating a subset of parameters for each input token. However, standard MoE architectures face significant challenges, including high memory consumption and communication overhead during distributed training. In this paper, we introduce Mixture of Latent Experts (MoLAE), a novel parameterization that addresses these limitations by reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially reduces parameter count and computational requirements, particularly in existing LLMs where hidden dimensions significantly exceed MoE intermediate dimensions. We provide a rigorous mathematical framework for transforming pre-trained MoE models into MoLAE architecture, characterizing conditions for optimal factorization, and developing a systematic two-step algorithm for this conversion. Our comprehensive theoretical analysis demonstrates that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities. Experimental results confirm that MoLAE achieves comparable performance to standard MoE with substantially reduced resource requirements.
nan
Article 1199
Title@2025-05-25 (7): BriLLM: Brain-inspired Large Language Model
Title: BriLLM: Brain-inspired Large Language Model | BriLLM: Gehirninspiriertes Large Language Model | BrILLM: 脑启发型大语言模式 2503.11299v4 |
Authors: Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong
This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of “least resistance” along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model’s working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.
nan
Article 1200
Title@2025-05-25 (7): Learning to Explain: Prototype-Based Surrogate Models for LLM Classification
Title: Learning to Explain: Prototype-Based Surrogate Models for LLM Classification | Erklären lernen: Prototypenbasierte Surrogate-Modelle für die LLM-Klassifikation | 学习解释:LLM分类原型代用模型 2505.18970v1 |
Authors: Bowen Wei, Ziwei Zhu
Large language models (LLMs) have demonstrated impressive performance on natural language tasks, but their decision-making processes remain largely opaque. Existing explanation methods either suffer from limited faithfulness to the model’s reasoning or produce explanations that humans find difficult to understand. To address these challenges, we propose \textbf{ProtoSurE}, a novel prototype-based surrogate framework that provides faithful and human-understandable explanations for LLMs. ProtoSurE trains an interpretable-by-design surrogate model that aligns with the target LLM while utilizing sentence-level prototypes as human-understandable concepts. Extensive experiments show that ProtoSurE consistently outperforms SOTA explanation methods across diverse LLMs and datasets. Importantly, ProtoSurE demonstrates strong data efficiency, requiring relatively few training examples to achieve good performance, making it practical for real-world applications.
nan
Article 1201
Title@2025-05-25 (7): Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning
Title: Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning | Nicht alle Gedanken werden gleich erzeugt: Effizientes LLM-Reasoning durch Multi-Turn-Verstärkung-Lernen | 并非所有思想都产生平等:通过多发强化学习提高学习水平的效率LLM 2505.11827v2 |
Authors: Yansong Ning, Wei Li, Jun Fang, Naiqiang Tan, Hao Liu
Compressing long chain-of-thought (CoT) from large language models (LLMs) is an emerging strategy to improve the reasoning efficiency of LLMs. Despite its promising benefits, existing studies equally compress all thoughts within a long CoT, hindering more concise and effective reasoning. To this end, we first investigate the importance of different thoughts by examining their effectiveness and efficiency in contributing to reasoning through automatic long CoT chunking and Monte Carlo rollouts. Building upon the insights, we propose a theoretically bounded metric to jointly measure the effectiveness and efficiency of different thoughts. We then propose Long$\otimes$Short, an efficient reasoning framework that enables two LLMs to collaboratively solve the problem: a long-thought LLM for more effectively generating important thoughts, while a short-thought LLM for efficiently generating remaining thoughts. Specifically, we begin by synthesizing a small amount of cold-start data to fine-tune LLMs for long-thought and short-thought reasoning styles, respectively. Furthermore, we propose a synergizing-oriented multi-turn reinforcement learning, focusing on the model self-evolution and collaboration between long-thought and short-thought LLMs. Experimental results show that our method enables Qwen2.5-7B and Llama3.1-8B to achieve comparable performance compared to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, while reducing token length by over 80% across the MATH500, AIME24/25, AMC23, and GPQA Diamond benchmarks. Our data and code are available at https://github.com/usail-hkust/LongShort.
nan
Article 1202
Title@2025-05-25 (7): SynapticRAG: Enhancing Temporal Memory Retrieval in Large Language Models through Synaptic Mechanisms
Title: SynapticRAG: Enhancing Temporal Memory Retrieval in Large Language Models through Synaptic Mechanisms | SynapticRAG: Verbesserung des Temporalen Gedächtnisses in großen Sprachmodellen durch synaptische Mechanismen | 辛亚蒂克拉戈:通过辛亚机制加强大语言模型中的时间内存检索 2410.13553v2 |
Authors: Yuki Hou, Haruki Tamoto, Qinghua Zhao, Homei Miyashita
Existing retrieval methods in Large Language Models show degradation in accuracy when handling temporally distributed conversations, primarily due to their reliance on simple similarity-based retrieval. Unlike existing memory retrieval methods that rely solely on semantic similarity, we propose SynapticRAG, which uniquely combines temporal association triggers with biologically-inspired synaptic propagation mechanisms. Our approach uses temporal association triggers and synaptic-like stimulus propagation to identify relevant dialogue histories. A dynamic leaky integrate-and-fire mechanism then selects the most contextually appropriate memories. Experiments on four datasets of English, Chinese and Japanese show that compared to state-of-the-art memory retrieval methods, SynapticRAG achieves consistent improvements across multiple metrics up to 14.66% points. This work bridges the gap between cognitive science and language model development, providing a new framework for memory management in conversational systems.
nan
Article 1203
Title@2025-05-25 (7): Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models
Title: Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models | Expansion Span: Kombinieren von Fading Memory und Retrieval in Hybrid State Space Models | 扩展空间:在混合国家空间模型中将平缓内存和检索合并 2412.13328v2 |
Authors: Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Golatkar, Wei Xia, Stefano Soatto
The “state” of State Space Models (SSMs) represents their memory, which fades exponentially over an unbounded span. By contrast, Attention-based models have “eidetic” (i.e., verbatim, or photographic) memory over a finite span (context size). Hybrid architectures combine State Space layers with Attention, but still cannot recall the distant past and can access only the most recent tokens eidetically. Unlike current methods of combining SSM and Attention layers, we allow the state to be allocated based on relevancy rather than recency. In this way, for every new set of query tokens, our models can “eidetically” access tokens from beyond the Attention span of current Hybrid SSMs without requiring extra hardware resources. We introduce a method to expand the memory span of the hybrid state by “reserving” a fraction of the Attention context for tokens retrieved from arbitrarily distant in the past, thus expanding the eidetic memory span of the overall state. We call this reserved fraction of tokens the “expansion span,” and the mechanism to retrieve and aggregate it “Span-Expanded Attention” (SE-Attn). To adapt Hybrid models to using SE-Attn, we propose a novel fine-tuning method that extends LoRA to Hybrid models (HyLoRA) and allows efficient adaptation on long spans of tokens. We show that SE-Attn enables us to efficiently adapt pre-trained Hybrid models on sequences of tokens up to 8 times longer than the ones used for pre-training. We show that HyLoRA with SE-Attn is cheaper and more performant than alternatives like LongLoRA when applied to Hybrid models on natural language benchmarks with long-range dependencies, such as PG-19, RULER, and other common natural language downstream tasks.
nan
Article 1204
Title@2025-05-25 (7): GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples
Title: GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples | GraphemeAug: Ein systematischer Ansatz zur Synthese von schwer negativen Keyword-Spotting-Beispielen | GraphemeAug:以系统方法合成硬负负关键词 2505.14814v2 |
Authors: Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang
Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword’s graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data.
nan
Article 1205
Title@2025-05-25 (7): Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?
Title: Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk? | KI für Finanzen bewerten: Ist KI bei der Bewertung von Investitionsrisiken glaubwürdig? | 评估大赦国际的融资:AI在评估投资风险方面是否可信? 2505.18953v1 |
Authors: Divij Chawla, Ashita Bhutada, Do Duc Anh, Abhinav Raghunathan, Vinod SP, Cathy Guo, Dar Win Liew, Prannaya Gupta, Rishabh Bhardwaj, Rajat Bhardwaj, Soujanya Poria
We evaluate the credibility of leading AI models in assessing investment risk appetite. Our analysis spans proprietary (GPT-4, Claude 3.7, Gemini 1.5) and open-weight models (LLaMA 3.1/3.3, DeepSeek-V3, Mistral-small), using 1,720 user profiles constructed with 16 risk-relevant features across 10 countries and both genders. We observe significant variance across models in score distributions and demographic sensitivity. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles, while LLaMA and DeepSeek show opposite gender tendencies in risk classification. While some models (e.g., GPT-4o, LLaMA 3.1) align closely with expected scores in low- and mid-risk ranges, none maintain consistent performance across regions and demographics. Our findings highlight the need for rigorous, standardized evaluations of AI systems in regulated financial contexts to prevent bias, opacity, and inconsistency in real-world deployment.
nan
Article 1206
Title@2025-05-25 (7): BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
Title: BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | BnMMLU: Maßgebendes Multitasking-Sprachverständnis in Bengalen messen | BnMMLU:用孟加拉语衡量大规模多任务语言理解 2505.18951v1 |
Authors: Saman Sarker Joy
The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.
nan
Article 1207
Title@2025-05-25 (7): The Price of Format: Diversity Collapse in LLMs
Title: The Price of Format: Diversity Collapse in LLMs | Der Preis des Formats: Diversity Collapse in LLMs | 格式价格:多样化在LLMM中崩溃 2505.18949v1 |
Authors: Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, Jingbo Shang
Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model’s output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.
nan
Article 1208
Title@2025-05-25 (7): Veracity Bias and Beyond: Uncovering LLMs’ Hidden Beliefs in Problem-Solving Reasoning
Title: Veracity Bias and Beyond: Uncovering LLMs’ Hidden Beliefs in Problem-Solving Reasoning | Veracity Bias and Beyond: LLMs versteckten Glauben an Problemlösungen enthüllen | Veracity Bias 及以后:在解决问题的理由中揭穿LLMs的隐藏的信仰 2505.16128v2 |
Authors: Yue Zhou, Barbara Di Eugenio
Despite LLMs’ explicit alignment against demographic stereotypes, they have been shown to exhibit biases under various social contexts. In this work, we find that LLMs exhibit concerning biases in how they associate solution veracity with demographics. Through experiments across five human value-aligned LLMs on mathematics, coding, commonsense, and writing problems, we reveal two forms of such veracity biases: Attribution Bias, where models disproportionately attribute correct solutions to certain demographic groups, and Evaluation Bias, where models’ assessment of identical solutions varies based on perceived demographic authorship. Our results show pervasive biases: LLMs consistently attribute fewer correct solutions and more incorrect ones to African-American groups in math and coding, while Asian authorships are least preferred in writing evaluation. In additional studies, we show LLMs automatically assign racially stereotypical colors to demographic groups in visualization code, suggesting these biases are deeply embedded in models’ reasoning processes. Our findings indicate that demographic bias extends beyond surface-level stereotypes and social context provocations, raising concerns about LLMs’ deployment in educational and evaluation settings.
nan
Article 1209
Title@2025-05-25 (7): NovelSeek: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification
Title: NovelSeek: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification | NovelSeek: Wenn Agent zum Wissenschaftler wird – das geschlossene Loop-System von der Hypothese zur Verifikation | NovellSeek:当特工成为科学家时 – – 建立从假设到核查的闭线系统 2505.16938v2 |
Authors: NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
nan
Article 1210
Title@2025-05-25 (7): MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
Title: MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems | MetaMind: Modellierung menschlicher sozialer Gedanken mit Metakognitiven Multi-Agenten-Systemen | MetMind:模拟人类社会思想与代认知多机构系统 2505.18943v1 |
Authors: Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li
Human social interactions depend on the ability to infer others’ unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses user mental states (e.g., intent, emotion), (2) a Domain Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework’s ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.
nan
Article 1211
Title@2025-05-25 (7): Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages
Title: Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages | Denken Sie außerhalb der Daten: Koloniale Biasen und systemische Probleme in automatisierten Moderationspipelines für Low-Resource-Sprachen | 《在数据之外思考:低资源语言自动调控管道中的殖民二进制和系统问题》 2501.13836v2 |
Authors: Farhana Shahid, Mona Elswah, Aditya Vashistha
Most social media users come from non-English speaking countries in the Global South, where much of harmful content appears in local languages. Yet, current AI-driven moderation systems struggle with low-resource languages spoken in these regions. This work examines the systemic challenges in building automated moderation tools for these languages. We conducted semi-structured interviews with 22 AI experts working on detecting harmful content in four low-resource languages: Tamil (South Asia), Swahili (East Africa), Maghrebi Arabic (North Africa), and Quechua (South America). Our findings show that beyond the well-known data scarcity in local languages, technical issues–such as outdated machine translation systems, sentiment and toxicity models grounded in Western values, and unreliable language detection technologies–undermine moderation efforts. Even with more data, current language models and preprocessing pipelines–primarily designed for English–struggle with the morphological richness, linguistic complexity, and code-mixing. As a result, automated moderation in Tamil, Swahili, Arabic, and Quechua remains fraught with inaccuracies and blind spots. Based on our findings, we argue that these limitations are not just technical gaps but reflect deeper structural inequities that continue to reproduce historical power imbalances. We conclude by discussing multi-stakeholder approaches to improve automated moderation for low-resource languages.
nan
Article 1212
Title@2025-05-25 (7): AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
Title: AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments | AgentClinic: ein Multimodal Agent Benchmark zur Bewertung von KI in simulierten klinischen Umgebungen | AgrClinicic:在模拟临床环境中评价AI的多式联运代理商基准 2405.07960v5 |
Authors: Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor
Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs’ ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics that this interactive environment firstly enables.
nan
Article 1213
Title@2025-05-25 (7): Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding?
Title: Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding? | Fließendes, aber kulturell Fernes: Kann regionales Training kulturelles Verständnis lehren? | 流利但文化疏远:区域培训能够教授文化理解吗? 2505.21548v1 |
Authors: Dhruv Agarwal, Anya Shukla, Sunayana Sitaram, Aditya Vashistha
Large language models (LLMs) are used around the world but exhibit Western cultural tendencies. To address this cultural misalignment, many countries have begun developing “regional” LLMs tailored to local communities. Yet it remains unclear whether these models merely speak the language of their users or also reflect their cultural values and practices. Using India as a case study, we evaluate five Indic and five global LLMs along two key dimensions: values (via the Inglehart-Welzel map and GlobalOpinionQA) and practices (via CulturalBench and NormAd). Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models. In fact, an average American person is a better proxy for Indian cultural values than any Indic model. Even prompting strategies fail to meaningfully improve alignment. Ablations show that regional fine-tuning does not enhance cultural competence and may in fact hurt it by impeding recall of existing knowledge. We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data. Our study positions cultural evaluation as a first-class requirement alongside multilingual benchmarks and offers a reusable methodology for developers. We call for deeper investments in culturally representative data to build and evaluate truly sovereign LLMs.
nan
Article 1214
Title@2025-05-25 (7): REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing
Title: REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing | REACT: Darstellungsextraktion und kontrollierbares Tuning zur Überwindung der Überlastung in LLM-Wissensbearbeitung | REACT: 在LLM知识编辑中,通过代表提取和控制可控的提款以克服超额配置 2505.18933v1 |
Authors: Haitian Zhong, Yuhuan Liu, Ziyang Xu, Guofan Liu, Qiang Liu, Shu Wu, Zhe Zhao, Liang Wang, Tieniu Tan
Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it’s contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional “belief shift” vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.
nan
Article 1215
Title@2025-05-25 (7): Can Large Language Models Infer Causal Relationships from Real-World Text?
Title: Can Large Language Models Infer Causal Relationships from Real-World Text? | Können große Sprachmodelle Kausalbeziehungen aus Real-World Text ableiten? | 大语言模型能否从真实世界文本中推断出因果关系? 2505.18931v1 |
Authors: Ryan Saklad, Aman Chadha, Oleg Pavlov, Raha Moraffah
Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work primarily focuses on synthetically generated texts which involve simple causal relationships explicitly mentioned in the text. This fails to reflect the complexities of real-world tasks. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature which includes diverse texts with respect to length, complexity of relationships (different levels of explicitness, number of events, and causal relationships), and domains and sub-domains. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on state-of-the-art LLMs evaluated on our proposed benchmark demonstrate significant challenges, with the best-performing model achieving an average F1 score of only 0.477. Analysis reveals common pitfalls: difficulty with implicitly stated information, in distinguishing relevant causal factors from surrounding contextual details, and with connecting causally relevant information spread across lengthy textual passages. By systematically characterizing these deficiencies, our benchmark offers targeted insights for further research into advancing LLM causal reasoning.
nan
Article 1216
Title@2025-05-25 (7): Meta-aware Learning in text-to-SQL Large Language Model
Title: Meta-aware Learning in text-to-SQL Large Language Model | Meta-aware Lernen im Text-zu-SQL-Großsprache-Modell | 以文本到SQL大语言模式进行多读学习 2505.18929v1 |
Authors: Wenda Zhang
The advancements of Large language models (LLMs) have provided great opportunities to text-to-SQL tasks to overcome the main challenges to understand complex domain information and complex database structures in business applications. In this paper, we propose a meta-aware learning framework to integrate domain knowledge, database schema, chain-of-thought reasoning processes, and metadata relationships to improve the SQL generation quality. The proposed framework includes four learning strategies: schema-based learning, Chain-of-Thought (CoT) learning, knowledge-enhanced learning, and key information tokenization. This approach provides a comprehensive understanding of database structure and metadata information towards LLM through fine-tuning to improve its performance on SQL generation within business domains. Through two experimental studies, we have demonstrated the superiority of the proposed methods in execution accuracy, multi-task SQL generation capability, and reduction of catastrophic forgetting.
nan
Article 1217
Title@2025-05-25 (7): iAgent: LLM Agent as a Shield between User and Recommender Systems
Title: iAgent: LLM Agent as a Shield between User and Recommender Systems | iAgent: LLM Agent als Shield zwischen Anwender- und Recommender-Systemen | iAgendy:LLM代理作为用户与建议系统之间的盾牌 2502.14662v3 |
Authors: Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang
Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform’s recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform’s benefits, which may hinder their ability to protect and capture users’ true interests. Second, these models are typically optimized using data from all users, which may overlook individual user’s preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure.
nan
Article 1218
Title@2025-05-25 (7): SCRum-9: Multilingual Stance Classification over Rumours on Social Media
Title: SCRum-9: Multilingual Stance Classification over Rumours on Social Media | SCRum-9: Mehrsprachige Stance-Klassifizierung über Gerüchte in sozialen Medien | SCRUM-9:社会媒体多语言流闻的多语言分级 2505.18916v1 |
Authors: Yue Li, Jake Vasilakes, Zhixue Zhao, Carolina Scarton
We introduce SCRum-9, a multilingual dataset for Rumour Stance Classification, containing 7,516 tweet-reply pairs from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages (9), linking examples to more fact-checked claims (2.1k), and including complex annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least three native speakers per language, totalling around 405 hours of annotation and 8,150 dollars in compensation. Experiments on SCRum-9 show that it is a challenging benchmark for both state-of-the-art LLMs (e.g. Deepseek) as well as fine-tuned pre-trained models, motivating future work in this area.
nan
Article 1219
Title@2025-05-25 (7): Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach
Title: Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach | Multimodale LLMs unter Verteilungsverschiebungen verstehen: Ein informationstheoretischer Ansatz | 在分销变更下理解多式LLMs:信息理论方法 2502.00577v2 |
Authors: Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.
nan
Article 1220
Title@2025-05-24 (6): Federated Retrieval-Augmented Generation: A Systematic Mapping Study
Title: Federated Retrieval-Augmented Generation: A Systematic Mapping Study | Federated Retrieval-Augmented Generation: Eine systematische Mapping-Studie | 联邦回收回源代:系统绘图研究 2505.18906v1 |
Authors: Abhijit Chakraborty, Chahana Dahal, Vivek Gupta
Federated Retrieval-Augmented Generation (Federated RAG) combines Federated Learning (FL), which enables distributed model training without exposing raw data, with Retrieval-Augmented Generation (RAG), which improves the factual accuracy of language models by grounding outputs in external knowledge. As large language models are increasingly deployed in privacy-sensitive domains such as healthcare, finance, and personalized assistance, Federated RAG offers a promising framework for secure, knowledge-intensive natural language processing (NLP). To the best of our knowledge, this paper presents the first systematic mapping study of Federated RAG, covering literature published between 2020 and 2025. Following Kitchenham’s guidelines for evidence-based software engineering, we develop a structured classification of research focuses, contribution types, and application domains. We analyze architectural patterns, temporal trends, and key challenges, including privacy-preserving retrieval, cross-client heterogeneity, and evaluation limitations. Our findings synthesize a rapidly evolving body of research, identify recurring design patterns, and surface open questions, providing a foundation for future work at the intersection of RAG and federated systems.
nan
Article 1221
Title@2025-05-24 (6): Building a Functional Machine Translation Corpus for Kpelle
Title: Building a Functional Machine Translation Corpus for Kpelle | Aufbau eines funktionalen Übersetzungskorpus für Kpelle | 为Kpelle建立功能机器翻译公司 2505.18905v1 |
Authors: Kweku Andoh Yamoah, Jackson Weako, Emmanuel J. Dorley
In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta’s No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle’s potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.
nan
Article 1222
Title@2025-05-24 (6): Algorithmic Language Models with Neurally Compiled Libraries
Title: Algorithmic Language Models with Neurally Compiled Libraries | Algorithmische Sprachmodelle mit neurally compiled Bibliotheken | 具有神经编译图书馆的算法语言模型 2407.04899v2 |
Authors: Lucas Saldyt, Subbarao Kambhampati
Important tasks such as reasoning and planning are fundamentally algorithmic, meaning that solving them robustly requires acquiring true reasoning or planning algorithms, rather than shortcuts. Large Language Models lack true algorithmic ability primarily because of the limitations of neural network optimization algorithms, their optimization data and optimization objective, but also due to architectural inexpressivity. To solve this, our paper proposes augmenting LLMs with a library of fundamental operations and sophisticated differentiable programs, so that common algorithms do not need to be learned from scratch. We add memory, registers, basic operations, and adaptive recurrence to a transformer architecture built on LLaMA3. Then, we define a method for directly compiling algorithms into a differentiable starting library, which is used natively and propagates gradients for optimization. In this preliminary study, we explore the feasability of augmenting LLaMA3 with a differentiable computer, for instance by fine-tuning small transformers on simple algorithmic tasks with variable computational depth.
nan
Article 1223
Title@2025-05-24 (6): StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos
Title: StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos | StandUp4AI: Ein neuer multilingualer Datensatz für Humorerkennung in Stand-up Comedy Videos | StandUp4AI:一套新的多语种数据集,用于在跳跳喜剧视频中探测湿度 2505.18903v1 |
Authors: Valentin Barriere, Nahuel Gomez, Leo Hemamou, Sofia Callejas, Brian Ravenet
Aiming towards improving current computational models of humor detection, we propose a new multimodal dataset of stand-up comedies, in seven languages: English, French, Spanish, Italian, Portuguese, Hungarian and Czech. Our dataset of more than 330 hours, is at the time of writing the biggest available for this type of task, and the most diverse. The whole dataset is automatically annotated in laughter (from the audience), and the subpart left for model validation is manually annotated. Contrary to contemporary approaches, we do not frame the task of humor detection as a binary sequence classification, but as word-level sequence labeling, in order to take into account all the context of the sequence and to capture the continuous joke tagging mechanism typically occurring in natural conversations. As par with unimodal baselines results, we propose a method for e propose a method to enhance the automatic laughter detection based on Audio Speech Recognition errors. Our code and data are available online: https://tinyurl.com/EMNLPHumourStandUpPublic
nan
Article 1224
Title@2025-05-24 (6): Do LLMs have a Gender (Entropy) Bias?
Title: Do LLMs have a Gender (Entropy) Bias? | Haben LLMs ein Gender (Entropie) Bias? | LLMs是否有性别(Entropy)偏见? 2505.20343v1 |
Authors: Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta
We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as “LLM-as-judge”). Our analyses (metric-based comparisons and “LLM-as-judge” evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which “cancel” each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.
nan
Article 1225
Title@2025-05-24 (6): Vague Knowledge: Evidence from Analyst Reports
Title: Vague Knowledge: Evidence from Analyst Reports | Vague Knowledge: Beweise aus Analystenberichten | 知识模糊:分析报告提供的证据 2505.12269v3 |
Authors: Kerry Xiao, Amy Zang
People in the real world often possess vague knowledge of future payoffs, for which quantification is not feasible or desirable. We argue that language, with differing ability to convey vague information, plays an important but less-known role in representing subjective expectations. Empirically, we find that in their reports, analysts include useful information in linguistic expressions but not numerical forecasts. Specifically, the textual tone of analyst reports has predictive power for forecast errors and subsequent revisions in numerical forecasts, and this relation becomes stronger when analyst’s language is vaguer, when uncertainty is higher, and when analysts are busier. Overall, our theory and evidence suggest that some useful information is vaguely known and only communicated through language.
nan
Article 1226
Title@2025-05-24 (6): Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Title: Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding | Warum Vision Language Models mit visueller Arithmetik kollidieren? Auf dem Weg zu einem verbesserten Chart und Geometrie-Verständnis | 为什么愿景语言模型与视觉自算斗争? 争取强化图表和几何理解 2502.11492v3 |
Authors: Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget’s theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the performance of three diverse VLMs on our proposed probing tasks. Furthermore, CogAlign enhances performance by an average of 4.6% on CHOCOLATE and 2.9% on MATH-VISION, outperforming or matching supervised fine-tuning methods while requiring only 60% less training data. These results highlight the effectiveness and generalizability of CogAlign in improving fundamental visual arithmetic capabilities and their transfer to downstream tasks.
nan
Article 1227
Title@2025-05-24 (6): CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | CRMArena-Pro: Ganzheitliche Bewertung von LLM-Agenten über unterschiedliche Geschäftsszenarien und Interaktionen | CRMARENA-Pro: 不同业务情景和相互作用的LLM代理机构综合评估 2505.18878v1 |
Authors: Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu
While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and ‘configure, price, and quote’ processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.
nan
Article 1228
Title@2025-05-24 (6): Evaluating Step-by-step Reasoning Traces: A Survey
Title: Evaluating Step-by-step Reasoning Traces: A Survey | Bewertung Schritt-für-Schritt-Reasoning-Traces: Eine Umfrage | 评价逐步说明理由的追踪:调查 2502.12289v2 |
Authors: Jinu Lee, Julia Hockenmaier
Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different evaluator implementations and recent findings, leading to promising directions for future research.
nan
Article 1229
Title@2025-05-24 (6): Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing
Title: Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing | Sci-LoRA: Mischung aus wissenschaftlichen LoRAs für Cross-Domain Lay Paraphrasing | Sci-LORA:将科学LORA混合起来,用于跨域地谱图谱绘制 2505.18867v1 |
Authors: Ming Cheng, Jiaying Gong, Hoda Eldardiry
Lay paraphrasing aims to make scientific information accessible to audiences without technical backgrounds. However, most existing studies focus on a single domain, such as biomedicine. With the rise of interdisciplinary research, it is increasingly necessary to comprehend knowledge spanning multiple technical fields. To address this, we propose Sci-LoRA, a model that leverages a mixture of LoRAs fine-tuned on multiple scientific domains. In particular, Sci-LoRA dynamically generates and applies weights for each LoRA, enabling it to adjust the impact of different domains based on the input text, without requiring explicit domain labels. To balance domain-specific knowledge and generalization across various domains, Sci-LoRA integrates information at both the data and model levels. This dynamic fusion enhances the adaptability and performance across various domains. Experimental results across twelve domains on five public datasets show that Sci-LoRA significantly outperforms state-of-the-art large language models and demonstrates flexible generalization and adaptability in cross-domain lay paraphrasing.
nan
Article 1230
Title@2025-05-24 (6): Token Sampling Uncertainty Does Not Explain Homogeneity Bias in Large Language Models
Title: Token Sampling Uncertainty Does Not Explain Homogeneity Bias in Large Language Models | Token Sampling Uncertainty erklärt Homogenität Bias nicht in großen Sprachmodellen | 在大语言模型中抽样抽样的不确定性不能解释同性比重 2501.19337v2 |
Authors: Messi H. J. Lee, Soyeon Jeon
Homogeneity bias is one form of stereotyping in AI models where certain groups are represented as more similar to each other than other groups. This bias is a major obstacle to creating equitable language technologies. We test whether the bias is driven by systematic differences in token-sampling uncertainty across six large language models. While we observe the presence of homogeneity bias using sentence similarity, we find very little difference in token sampling uncertainty across groups. This finding elucidates why temperature-based sampling adjustments fail to mitigate homogeneity bias. It suggests researchers should prioritize interventions targeting representation learning mechanisms and training corpus composition rather than inference-time output manipulations.
nan
Article 1231
Title@2025-05-24 (6): Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework
Title: Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework | Audio Jailbreak Attacks: Aufdecken von Schwachstellen in SpeechGPT in einem White-Box-Framework | 音频破室袭击:在白箱框架内揭露语音中的弱点GPPT 2505.18864v1 |
Authors: Binhao Ma, Hanqing Guo, Zhengping Jay Luo, Rui Duan
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model’s speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.
nan
Article 1232
Title@2025-05-24 (6): Writing Like the Best: Exemplar-Based Expository Text Generation
Title: Writing Like the Best: Exemplar-Based Expository Text Generation | Schreiben wie das Beste: exemplar-based expository text generation | 写作像最佳的:基于实例的展示性文本生成 2505.18859v1 |
Authors: Yuxiang Liu, Kevin Chen-Chuan Chang
We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics–imitativeness, adaptiveness, and adaptive-imitativeness–using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.
nan
Article 1233
Title@2025-05-24 (6): Large Language Models based ASR Error Correction for Child Conversations
Title: Large Language Models based ASR Error Correction for Child Conversations | Große Sprachmodelle basierende ASR-Fehlerkorrektur für Kindergespräche | 基于大语言模型的ASR大语言模型 2505.16212v2 |
Authors: Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan
Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children’s speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children’s conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
nan
Article 1234
Title@2025-05-24 (6): USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations
Title: USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations | USDC: Ein Datensatz von $\underline{U}$ser $\underline{S}$tance und $\underline{D}$ogmatism in langen $\underline{C}$onversations | USCC: 以 $\ underline{U}$ser $\ underline{S}$tance 和 $\ underline{D}$ogmatism 的数据集, 以 Long $\ underline{C} 美元对数值 2406.16833v2 |
Authors: Mounika Marreddy, Subba Reddy Oota, Venkata Charan Chinni, Manish Gupta, Lucie Flek
Analyzing user opinion changes in long conversation threads is extremely critical for applications like enhanced personalization, market research, political campaigns, customer service, targeted advertising, and content moderation. Unfortunately, previous studies on stance and dogmatism in user conversations have focused on training models using datasets annotated at the post level, treating each post as independent and randomly sampling posts from conversation threads. Hence, first, we build a dataset for studying user opinion fluctuations in 764 long multi-user Reddit conversation threads, called USDC. USDC contains annotations for 2 tasks: i) User Stance classification, which involves labeling a user’s stance in a post within a conversation on a five-point scale; ii) User Dogmatism classification, which involves labeling a user’s overall opinion in the conversation on a four-point scale. Besides being time-consuming and costly, manual annotations for USDC are challenging because: 1) Conversation threads could be very long, increasing the chances of noisy annotations; and 2) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Hence, we leverage majority voting on zero-shot, one-shot, and few-shot annotations from Mistral Large and GPT-4 to automate the annotation process. Human annotations on 200 test conversations achieved inter-annotator agreement scores of 0.49 for stance and 0.50 for dogmatism with these LLM annotations, indicating a reasonable level of consistency between human and LLM annotations. USDC is then used to finetune and instruction-tune multiple deployable small language models like LLaMA, Falcon and Vicuna for the stance and dogmatism classification tasks. We make the code and dataset publicly available [https://github.com/mounikamarreddy/USDC].
nan
Article 1235
Title@2025-05-24 (6): Inference Compute-Optimal Video Vision Language Models
Title: Inference Compute-Optimal Video Vision Language Models | Schlussfolgerung Compute-Optimal Video Vision Language Models | 计算视频视觉语言模型 2505.18855v1 |
Authors: Peiqi Wang, ShengYun Peng, Xuewen Zhang, Hanchao Yu, Yibo Yang, Lifu Huang, Fujun Liu, Qifan Wang
This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.
nan
Article 1236
Title@2025-05-24 (6): Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation
Title: Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation | Smoothie: Glättende Diffusion auf Token-Embeddings für Textgenerierung | 滑滑: 平滑的文本生成时用 Token 嵌入嵌入嵌入器进行传播 2505.18853v1 |
Authors: Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov
Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.
nan
Article 1237
Title@2025-05-24 (6): On the Limit of Language Models as Planning Formalizers
Title: On the Limit of Language Models as Planning Formalizers | An der Grenze von Sprachmodellen als Planungsformalisatoren | 关于作为规划正规化机构的语言模式限制 2412.09879v3 |
Authors: Cassie Huang, Li Zhang
Large Language Models have been found to create plans that are neither executable nor verifiable in grounded environments. An emerging line of work demonstrates success in using the LLM as a formalizer to generate a formal representation of the planning domain in some language, such as Planning Domain Definition Language (PDDL). This formal representation can be deterministically solved to find a plan. We systematically evaluate this methodology while bridging some major gaps. While previous work only generates a partial PDDL representation, given templated, and therefore unrealistic environment descriptions, we generate the complete representation given descriptions of various naturalness levels. Among an array of observations critical to improve LLMs’ formal planning abilities, we note that most large enough models can effectively formalize descriptions as PDDL, outperforming those directly generating plans, while being robust to lexical perturbation. As the descriptions become more natural-sounding, we observe a decrease in performance and provide detailed error analysis.
nan
Article 1238
Title@2025-05-24 (6): Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning
Title: Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning | Führt Reasoning Bias ein? Eine Studie über soziale Bias Evaluation und Milderung in LLM Reasoning | 是否有理由引入偏见? 社会偏见评估和减轻LLM理由研究 2502.15361v2 |
Authors: Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, Yi Fang
Recent advances in large language models (LLMs) have enabled automatic generation of chain-of-thought (CoT) reasoning, leading to strong performance on tasks such as math and code. However, when reasoning steps reflect social stereotypes (e.g., those related to gender, race or age), they can reinforce harmful associations and lead to misleading conclusions. We present the first systematic evaluation of social bias within LLM-generated reasoning, using the BBQ dataset to analyze both prediction accuracy and bias. Our study spans a wide range of mainstream reasoning models, including instruction-tuned and CoT-augmented variants of DeepSeek-R1 (8B/32B), ChatGPT, and other open-source LLMs. We quantify how biased reasoning steps correlate with incorrect predictions and often lead to stereotype expression. To mitigate reasoning-induced bias, we propose Answer Distribution as Bias Proxy (ADBP), a lightweight mitigation method that detects bias by tracking how model predictions change across incremental reasoning steps. ADBP outperforms a stereotype-free baseline in most cases, mitigating bias and improving the accuracy of LLM outputs. Code will be released upon paper acceptance.
nan
Article 1239
Title@2025-05-24 (6): Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework
Title: Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework | Signal, Bild oder Symbolisch: Die beste Eingangsdarstellung für Elektrokardiogramm-Sprachenmodelle durch ein einheitliches Framework erkunden | 信号、图像或符号:通过统一框架探索电动心电图语言模型的最佳输入代表 2505.18847v1 |
Authors: William Han, Chaojing Duan, Zhepeng Cen, Yihang Yao, Xiaoyu Song, Atharva Mhaskar, Dylan Leong, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Recent advances have increasingly applied large language models (LLMs) to electrocardiogram (ECG) interpretation, giving rise to Electrocardiogram-Language Models (ELMs). Conditioned on an ECG and a textual query, an ELM autoregressively generates a free-form textual response. Unlike traditional classification-based systems, ELMs emulate expert cardiac electrophysiologists by issuing diagnoses, analyzing waveform morphology, identifying contributing factors, and proposing patient-specific action plans. To realize this potential, researchers are curating instruction-tuning datasets that pair ECGs with textual dialogues and are training ELMs on these resources. Yet before scaling ELMs further, there is a fundamental question yet to be explored: What is the most effective ECG input representation? In recent works, three candidate representations have emerged-raw time-series signals, rendered images, and discretized symbolic sequences. We present the first comprehensive benchmark of these modalities across 6 public datasets and 5 evaluation metrics. We find symbolic representations achieve the greatest number of statistically significant wins over both signal and image inputs. We further ablate the LLM backbone, ECG duration, and token budget, and we evaluate robustness to signal perturbations. We hope that our findings offer clear guidance for selecting input representations when developing the next generation of ELMs.
nan
Article 1240
Title@2025-05-24 (6): Multi-Party Conversational Agents: A Survey
Title: Multi-Party Conversational Agents: A Survey | Multi-Parteien-Gesprächsagenten: Eine Umfrage | 多党对话代表:调查 2505.18845v1 |
Authors: Sagar Sapkota, Mohammad Saqib Hasan, Mubarak Shah, Santu Karmaker
Multi-party Conversational Agents (MPCAs) are systems designed to engage in dialogue with more than two participants simultaneously. Unlike traditional two-party agents, designing MPCAs faces additional challenges due to the need to interpret both utterance semantics and social dynamics. This survey explores recent progress in MPCAs by addressing three key questions: 1) Can agents model each participants’ mental states? (State of Mind Modeling); 2) Can they properly understand the dialogue content? (Semantic Understanding); and 3) Can they reason about and predict future conversation flow? (Agent Action Modeling). We review methods ranging from classical machine learning to Large Language Models (LLMs) and multi-modal systems. Our analysis underscores Theory of Mind (ToM) as essential for building intelligent MPCAs and highlights multi-modal understanding as a promising yet underexplored direction. Finally, this survey offers guidance to future researchers on developing more capable MPCAs.
nan
Article 1241
Title@2025-05-24 (6): Don’t Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
Title: Don’t Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation | Nicht nur einmal suchen: Auf dem Weg zu multimodaler interaktiver Reasonierung mit selektiver visueller Revisitation | 不要只看一次: 走向多模式互动理性, 选择性视觉再审视 2505.18842v1 |
Authors: Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu
We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model’s evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks – MathVista, MathVision, and MathVerse – demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.
nan
Article 1242
Title@2025-05-24 (6): Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization
Title: Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization | Identifikation von legalen Holdings mit LLMs: Eine systematische Studie über Leistung, Maßstab und Erinnerung | 确定拥有LLM女士的法律控股:系统研究业绩、规模和记忆 2505.02172v3 |
Authors: Chuck Arvin
As large language models (LLMs) continue to advance in capabilities, it is essential to assess how they perform on established benchmarks. In this study, we present a suite of experiments to assess the performance of modern LLMs (ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for identifying case holdings. Our experiments demonstrate scaling effects - performance on this task improves with model size, with more capable models like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720 respectively. These scores are competitive with the best published results on this dataset, and do not require any technically sophisticated model training, fine-tuning or few-shot prompting. To ensure that these strong results are not due to memorization of judicial opinions contained in the training data, we develop and utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious. Models maintain strong performance under these conditions (macro F1 of 0.728), suggesting the performance is not due to rote memorization. These findings demonstrate both the promise and current limitations of LLMs for legal tasks with important implications for the development and measurement of automated legal analytics and legal benchmarks.
nan
Article 1243
Title@2025-05-24 (6): On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
Title: On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization | Auf die Wirkung des negativen Gradienten in der Gruppe Relative Tiefenverstärkung Optimierung | 对群体相对深强化优化中的负梯度效应的影响 2505.18830v1 |
Authors: Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO’s widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO’s learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO’s group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.
nan
Article 1244
Title@2025-05-24 (6): Vision Meets Language: A RAG-Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance
Title: Vision Meets Language: A RAG-Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance | Vision trifft auf Sprache: Ein RAG-gesteigertes YOLOv8-Framework für Kaffeekrankheitsdiagnose und Farmer Assistance | 语言:一个RAG-AG-AG-AGed YOLOv8咖啡疾病诊断和农民援助框架 2505.21544v1 |
Authors: Semanto Mondal
As a social being, we have an intimate bond with the environment. A plethora of things in human life, such as lifestyle, health, and food are dependent on the environment and agriculture. It comes under our responsibility to support the environment as well as agriculture. However, traditional farming practices often result in inefficient resource use and environmental challenges. To address these issues, precision agriculture has emerged as a promising approach that leverages advanced technologies to optimise agricultural processes. In this work, a hybrid approach is proposed that combines the three different potential fields of model AI: object detection, large language model (LLM), and Retrieval-Augmented Generation (RAG). In this novel framework, we have tried to combine the vision and language models to work together to identify potential diseases in the tree leaf. This study introduces a novel AI-based precision agriculture system that uses Retrieval Augmented Generation (RAG) to provide context-aware diagnoses and natural language processing (NLP) and YOLOv8 for crop disease detection. The system aims to tackle major issues with large language models (LLMs), especially hallucinations and allows for adaptive treatment plans and real-time disease detection. The system provides an easy-to-use interface to the farmers, which they can use to detect the different diseases related to coffee leaves by just submitting the image of the affected leaf the model will detect the diseases as well as suggest potential remediation methodologies which aim to lower the use of pesticides, preserving livelihoods, and encouraging environmentally friendly methods. With an emphasis on scalability, dependability, and user-friendliness, the project intends to improve RAG-integrated object detection systems for wider agricultural applications in the future.
nan
Article 1245
Title@2025-05-24 (6): AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting
Title: AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting | AdaCtrl: Auf dem Weg zur adaptiven und kontrollierbaren Begründung über Schwierigkeits-Bewusst-Budgeting | AdaCtrl:通过困难意识预算编制实现适应和控制性合理理由 2505.18822v1 |
Authors: Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, Yi R. Fung
Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget allocation and explicit user control over reasoning depth. AdaCtrl dynamically adjusts its reasoning length based on self-assessed problem difficulty, while also allowing users to manually control the budget to prioritize either efficiency or effectiveness. This is achieved through a two-stage training pipeline: an initial cold-start fine-tuning phase to instill the ability to self-aware difficulty and adjust reasoning budget, followed by a difficulty-aware reinforcement learning (RL) stage that refines the model’s adaptive reasoning strategies and calibrates its difficulty assessments based on its evolving capabilities during online training. To enable intuitive user interaction, we design explicit length-triggered tags that function as a natural interface for budget control. Empirical results show that AdaCtrl adapts reasoning length based on estimated difficulty, compared to the standard training baseline that also incorporates fine-tuning and RL, it yields performance improvements and simultaneously reduces response length by 10.06% and 12.14% on the more challenging AIME2024 and AIME2025 datasets, which require elaborate reasoning, and by 62.05% and 91.04% on the MATH500 and GSM8K datasets, where more concise responses are sufficient. Furthermore, AdaCtrl enables precise user control over the reasoning budget, allowing for tailored responses to meet specific needs.
nan
Article 1246
Title@2025-05-24 (6): Preference Leakage: A Contamination Problem in LLM-as-a-judge
Title: Preference Leakage: A Contamination Problem in LLM-as-a-judge | Bevorzugte Leckage: Ein Kontaminierungsproblem im LLM-as-a-Richter | 优先渗漏:LLM-作为法官的LLM中的污染问题 2502.01534v2 |
Authors: Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.
nan
Article 1247
Title@2025-05-24 (6): MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation
Title: MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation | MAPLE: Verbesserung der Review Generation mit Multi-Aspect Prompt Learning in erklärbarer Empfehlung | MMALE: 在可解释建议中以多角度迅速和迅速的分解方式加强审查的产生 2408.09865v2 |
Authors: Ching-Wen Yang, Zhi-Quan Feng, Ying-Jia Lin, Che-Wei Chen, Kun-da Wu, Hao Xu, Jui-Feng Yao, Hung-Yu Kao
The Explainable Recommendation task is designed to receive a pair of user and item and output explanations to justify why an item is recommended to a user. Many models approach review generation as a proxy for explainable recommendations. While these models can produce fluent and grammatically correct sentences, they often lack precision and fail to provide personalized, informative recommendations. To address this issue, we propose a personalized, aspect-controlled model called Multi-Aspect Prompt LEarner (MAPLE), which integrates aspect category as another input dimension to facilitate memorizing fine-grained aspect terms. Experiments conducted on two real-world review datasets in the restaurant domain demonstrate that MAPLE significantly outperforms baseline review-generation models. MAPLE excels in both text and feature diversity, ensuring that the generated content covers a wide range of aspects. Additionally, MAPLE delivers good generation quality while maintaining strong coherence and factual relevance. The code and dataset used in this paper can be found here https://github.com/Nana2929/MAPLE.git.
nan
Article 1248
Title@2025-05-24 (6): From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?
Title: From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation? | Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus? | 从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充? 2505.18789v1 |
Authors: Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg
Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.
nan
Article 1249
Title@2025-05-24 (6): ReviewEval: An Evaluation Framework for AI-Generated Reviews
Title: ReviewEval: An Evaluation Framework for AI-Generated Reviews | ReviewEval: Ein Bewertungsrahmen für KI-generierte Bewertungen | E. 审评:独立审评评估框架 2502.11736v3 |
Authors: Madhav Krishan Garg, Tejash Prasad, Tanmay Singhal, Chhavi Kirtani, Murari Mandal, Dhruv Kumar
The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. In this work, we propose: 1. ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines; and 2. ReviewAgent, an LLM-based review generation agent featuring a novel alignment mechanism to tailor feedback to target conferences and journals, along with a self-refinement loop that iteratively optimizes its intermediate outputs and an external improvement loop using ReviewEval to improve upon the final reviews. ReviewAgent improves actionable insights by 6.78% and 47.62% over existing AI baselines and expert reviews respectively. Further, it boosts analytical depth by 3.97% and 12.73%, enhances adherence to guidelines by 10.11% and 47.26% respectively. This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.
nan
Article 1250
Title@2025-05-24 (6): A generalised editor calculus (Short Paper)
Title: A generalised editor calculus (Short Paper) | Eine generalisierte Editorrechnung (Short Paper) | 通用编辑器微积分( 短纸) 2505.18778v1 |
Authors: Benjamin Bennetzen, Peter Buus Steffensen, Hans Hüttel, Nikolaj Rossander Kristensen, Andreas Tor Mortensen
In this paper, we present a generalization of a syntax-directed editor calculus, which can be used to instantiate a specialized syntax-directed editor for any language, given by some abstract syntax. The editor calculus guarantees the absence of syntactical errors while allowing incomplete programs. The generalized editor calculus is then encoded into a simply typed lambda calculus, extended with pairs, booleans, pattern matching and fixed points
nan
Article 1251
Title@2025-05-24 (6): Disentangling Knowledge Representations for Large Language Model Editing
Title: Disentangling Knowledge Representations for Large Language Model Editing | Entwirren von Wissensdarstellungen für die Bearbeitung von großen Sprachmodellen | 分散大语言模式编辑的知识代表 2505.18774v1 |
Authors: Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren, Zhumin Chen, Pengjie Ren
Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledgerelated and -unrelated components, and a Disentanglement-based Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closed-form, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.
nan
Article 1252
Title@2025-05-24 (6): Attacking Vision-Language Computer Agents via Pop-ups
Title: Attacking Vision-Language Computer Agents via Pop-ups | Angriff auf Vision-Sprache Computer-Agenten über Pop-ups | 通过弹出式攻击视觉语言计算机代理器 2411.02391v2 |
Authors: Yanzhe Zhang, Tao Yu, Diyi Yang
Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing their tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques, such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
nan
Article 1253
Title@2025-05-24 (6): Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset
Title: Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset | Auf dem Weg zu einer emotional konsistenten textbasierten Sprachredaktion: Einführung von EmoCorrector und dem ECD-TSE-Datensatz | 面向情感上一致的文本语音编辑:介绍EmoCorrictor和ECD-TSE数据集 2505.20341v1 |
Authors: Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li
Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text’s emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker’s identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of $<$text, speech$>$ paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI-S2-Lab/EmoCorrector.
nan
Article 1254
Title@2025-05-24 (6): Towards an automatic method for generating topical vocabulary test forms for specific reading passages
Title: Towards an automatic method for generating topical vocabulary test forms for specific reading passages | Auf dem Weg zu einer automatischen Methode zur Generierung aktueller Vokabular-Testformulare für bestimmte Lesepassagen | 建立一个自动方法,为特定阅读段落制作专题词汇测试表 2505.18762v1 |
Authors: Michael Flor, Zuowei Wang, Paul Deane, Tenaha O’Reilly
Background knowledge is typically needed for successful comprehension of topical and domain specific reading passages, such as in the STEM domain. However, there are few automated measures of student knowledge that can be readily deployed and scored in time to make predictions on whether a given student will likely be able to understand a specific content area text. In this paper, we present our effort in developing K-tool, an automated system for generating topical vocabulary tests that measure students’ background knowledge related to a specific text. The system automatically detects the topic of a given text and produces topical vocabulary items based on their relationship with the topic. This information is used to automatically generate background knowledge forms that contain words that are highly related to the topic and words that share similar features but do not share high associations to the topic. Prior research indicates that performance on such tasks can help determine whether a student is likely to understand a particular text based on their knowledge state. The described system is intended for use with middle and high school student population of native speakers of English. It is designed to handle single reading passages and is not dependent on any corpus or text collection. In this paper, we describe the system architecture and present an initial evaluation of the system outputs.
nan
Article 1255
Title@2025-05-24 (6): How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark
Title: How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark | Wie wird LLM-Reasoning vom irrelevanten Kontext abgelenkt? Eine Analyse mit einem kontrollierten Benchmark | LLM 为何被不相关背景所忽略? 2505.18761v1 |
Authors: Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Wang, Liangming Pan
We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models’ (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
nan
Article 1256
Title@2025-05-24 (6): Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection
Title: Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection | Weniger scharfe Optimierung von Sensordaten mit großen Sprachmodellen: Eine Fallstudie zur Ermüdungserkennung | 利用大语言模型对传感器数据使用高语言模型的微小最优化:关于Fatigue探测的案例研究 2505.18754v1 |
Authors: Elsen Ronando, Sozo Inoue
In this paper, we propose a novel few-shot optimization with HED-LM (Hybrid Euclidean Distance with Large Language Models) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13$\pm$10.71%, outperforming both random selection (59.30$\pm$10.13%) and distance-only filtering (67.61$\pm$11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.
nan
Article 1257
Title@2025-05-24 (6): Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning
Title: Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning | Vereinheitlichen von Aufmerksamkeitsköpfen und Task-Vektoren über versteckte Zustandsgeometrie im In-Context-Lernen | 通过内文学习中隐藏状态几何几何,统一关注负责人和任务矢量 2505.18752v1 |
Authors: Haolin Yang, Hakaze Cho, Yiqiao Zhong, Naoya Inoue
The unusual properties of in-context learning (ICL) have prompted investigations into the internal mechanisms of large language models. Prior work typically focuses on either special attention heads or task vectors at specific layers, but lacks a unified framework linking these components to the evolution of hidden states across layers that ultimately produce the model’s output. In this paper, we propose such a framework for ICL in classification tasks by analyzing two geometric factors that govern performance: the separability and alignment of query hidden states. A fine-grained analysis of layer-wise dynamics reveals a striking two-stage mechanism: separability emerges in early layers, while alignment develops in later layers. Ablation studies further show that Previous Token Heads drive separability, while Induction Heads and task vectors enhance alignment. Our findings thus bridge the gap between attention heads and task vectors, offering a unified account of ICL’s underlying mechanisms.
nan
Article 1258
Title@2025-05-24 (6): An Illusion of Progress? Assessing the Current State of Web Agents
Title: An Illusion of Progress? Assessing the Current State of Web Agents | Eine Illusion des Fortschritts? Bewertung des aktuellen Zustands der Web-Agenten | 进展幻影? 评估网络代理目前的状况 2504.01382v3 |
Authors: Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, Yu Su
As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.
nan
Article 1259
Title@2025-05-24 (6): LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges
Title: LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges | LogicCat: Ein Chain-of-Thought-Text-to-SQL-Benchmark für Multi-Domain-Reasoning-Herausforderungen | LocicCat:多领域合理性挑战的 “ 探索链 “ 文本到SQL基准 2505.18744v1 |
Authors: Tao Liu, Hongying Zan, Yifan Li, Dixuan Zhang, Lulu Kong, Haixin Liu, Jiaming Hou, Aoze Zheng, Rui Li, Yiming Qiao, Zewei Luo, Qi Wang, Zhiqiang Zhang, Jiaxi Li, Supeng Liu, Kunli Zhang, Min Peng
Text-to-SQL is a fundamental task in natural language processing that seeks to translate natural language questions into meaningful and executable SQL queries. While existing datasets are extensive and primarily focus on business scenarios and operational logic, they frequently lack coverage of domain-specific knowledge and complex mathematical reasoning. To address this gap, we present a novel dataset tailored for complex reasoning and chain-of-thought analysis in SQL inference, encompassing physical, arithmetic, commonsense, and hypothetical reasoning. The dataset consists of 4,038 English questions, each paired with a unique SQL query and accompanied by 12,114 step-by-step reasoning annotations, spanning 45 databases across diverse domains. Experimental results demonstrate that LogicCat substantially increases the difficulty for state-of-the-art models, with the highest execution accuracy reaching only 14.96%. Incorporating our chain-of-thought annotations boosts performance to 33.96%. Benchmarking leading public methods on Spider and BIRD further underscores the unique challenges presented by LogicCat, highlighting the significant opportunities for advancing research in robust, reasoning-driven text-to-SQL systems. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.
nan
Article 1260
Title@2025-05-24 (6): Interpretable Company Similarity with Sparse Autoencoders
Title: Interpretable Company Similarity with Sparse Autoencoders | Interpretierbare Firmenähnlichkeit mit Sparse Autoencodern | 与Sparse Autoencolders 相似 2412.02605v3 |
Authors: Marco Molinari, Victor Shao, Luca Imeneo, Mateusz Mikolajczak, Vladimir Tregubiak, Abhimanyu Pandey, Sebastian Kuznetsov Ryder Torres Pereira
Determining company similarity is a vital task in finance, underpinning risk management, hedging, and portfolio diversification. Practitioners often rely on sector and industry classifications such as SIC and GICS codes to gauge similarity, the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Since these classifications lack granularity and need regular updating, using clusters of embeddings of company descriptions has been proposed as a potential alternative, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders (SAEs) have shown promise in enhancing the interpretability of Large Language Models (LLMs) by decomposing Large Language Model (LLM) activations into interpretable features. Moreover, SAEs capture an LLM’s internal representation of a company description, as opposed to semantic similarity alone, as is the case with embeddings. We apply SAEs to company descriptions, and obtain meaningful clusters of equities. We benchmark SAE features against SIC-codes, Industry codes, and Embeddings. Our results demonstrate that SAE features surpass sector classifications and embeddings in capturing fundamental company characteristics. This is evidenced by their superior performance in correlating logged monthly returns - a proxy for similarity - and generating higher Sharpe ratios in co-integration trading strategies, which underscores deeper fundamental similarities among companies. Finally, we verify the interpretability of our clusters, and demonstrate that sparse features form simple and interpretable explanations for our clusters.
nan
Article 1261
Title@2025-05-24 (6): Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Title: Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models | Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen | 语言模型中强化研究链理由的特征采掘和指南 2505.15634v2 |
Authors: Zihao Li, Xu Wang, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, Mengnan Du
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM’s internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.
nan
Article 1262
Title@2025-05-24 (6): ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search
Title: ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search | ReGUIDE: Dateneffizientes GUI Grounding über räumliche Vernunft und Suche | 数据高效界面:通过空间理性和搜索进行数据高效界面定位 2505.15259v2 |
Authors: Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, Kang Min Yoo
Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce equivariance under input transformations. At inference time, ReGUIDE further boosts performance through a test-time scaling strategy, which combines spatial search with coordinate aggregation. Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks, outperforming baselines with substantially fewer training data points (e.g., only 0.2% samples compared to the best open-sourced baselines).
nan
Article 1263
Title@2025-05-24 (6): Demonstration Selection for In-Context Learning via Reinforcement Learning
Title: Demonstration Selection for In-Context Learning via Reinforcement Learning | Demonstrationsauswahl für das In-Context-Lernen mittels Verstärkungs-Lernen | 通过强化学习,通过强化学习,选入内文学习的示范 2412.03966v2 |
Authors: Xubin Wang, Jianfei Wu, Yichen Yuan, Deyu Cai, Mingzhe Li, Weijia Jia
Diversity in demonstration selection is critical for enhancing model generalization by enabling broader coverage of structures and concepts. Constructing appropriate demonstration sets remains a key research challenge. This paper introduces the Relevance-Diversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning (RL) frameworks to optimize the selection of diverse reference demonstrations for tasks amenable to in-context learning (ICL), particularly text classification and reasoning, in few-shot prompting scenarios. RDES employs frameworks like Q-learning and a PPO-based variant to dynamically identify demonstrations that maximize both diversity (quantified by label distribution) and relevance to the task objective. This strategy ensures a balanced representation of reference data, leading to improved accuracy and generalization. Through extensive experiments on multiple benchmark datasets, including diverse reasoning tasks, and involving 14 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances performance compared to ten established baselines. Our evaluation includes analysis of performance across varying numbers of demonstrations on selected datasets. Furthermore, we investigate incorporating Chain-of-Thought (CoT) reasoning, which further boosts predictive performance. The results highlight the potential of RL for adaptive demonstration selection and addressing challenges in ICL.
nan
Article 1264
Title@2025-05-24 (6): Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking
Title: Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking | Zuckerbeschichtetes Gift: Benign Generation entsperrt LLM Jailbreaking | 食糖毒物:善后一代解锁 LLM 监狱破解 2504.05652v2 |
Authors: Yu-Hang Wu, Yu-Jie Xiong, Hao Zhang, Jia-Chen Zhang, Zheng Zhou
With the increasingly deep integration of large language models (LLMs) across diverse domains, the effectiveness of their safety mechanisms is encountering severe challenges. Currently, jailbreak attacks based on prompt engineering have become a major safety threat. However, existing methods primarily rely on black-box manipulation of prompt templates, resulting in poor interpretability and limited generalization. To break through the bottleneck, this study first introduces the concept of Defense Threshold Decay (DTD), revealing the potential safety impact caused by LLMs’ benign generation: as benign content generation in LLMs increases, the model’s focus on input instructions progressively diminishes. Building on this insight, we propose the Sugar-Coated Poison (SCP) attack paradigm, which uses a “semantic reversal” strategy to craft benign inputs that are opposite in meaning to malicious intent. This strategy induces the models to generate extensive benign content, thereby enabling adversarial reasoning to bypass safety mechanisms. Experiments show that SCP outperforms existing baselines. Remarkably, it achieves an average attack success rate of 87.23% across six LLMs. For defense, we propose Part-of-Speech Defense (POSD), leveraging verb-noun dependencies for syntactic analysis to enhance safety of LLMs while preserving their generalization ability.
nan
Article 1265
Title@2025-05-24 (6): Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson’s Disease Classifiers
Title: Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson’s Disease Classifiers | Bewertung der Nützlichkeit nicht-diagnostischer Sprachdaten für die Entwicklung von Parkinson-Krankheitsklassifikatoren | 评价发展帕金森病分级器的非诊断性语音数据的用处 2505.18722v1 |
Authors: Terry Yi Zhong, Esther Janse, Cristian Tejedor-Garcia, Louis ten Bosch, Martha Larson
Speech-based Parkinson’s disease (PD) detection has gained attention for its automated, cost-effective, and non-intrusive nature. As research studies usually rely on data from diagnostic-oriented speech tasks, this work explores the feasibility of diagnosing PD on the basis of speech data not originally intended for diagnostic purposes, using the Turn-Taking (TT) dataset. Our findings indicate that TT can be as useful as diagnostic-oriented PD datasets like PC-GITA. We also investigate which specific dataset characteristics impact PD classification performance. The results show that concatenating audio recordings and balancing participants’ gender and status distributions can be beneficial. Cross-dataset evaluation reveals that models trained on PC-GITA generalize poorly to TT, whereas models trained on TT perform better on PC-GITA. Furthermore, we provide insights into the high variability across folds, which is mainly due to large differences in individual speaker performance.
nan
Article 1266
Title@2025-05-24 (6): Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Title: Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization | Optimales Transport-basiertes Token-Gewichtungssystem für verbesserte Preference-Optimierung | 增强优惠优化的优化运输托肯加权计划 2505.18720v1 |
Authors: Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO’s effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.
nan
Article 1267
Title@2025-05-24 (6): Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer
Title: Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer | Neurale Parameter Suche nach schlankeren Modellen und besserer Übertragung | 搜索细微精制模型和更好传输的神经参数 2505.18713v1 |
Authors: Guodong Du, Zitao Fang, Jing Li, Junlin Li, Runhua Jiang, Shuyang Yu, Yifei Guo, Yangneng Chen, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Honghai Liu, Min Zhang
Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: https://github.com/duguodong7/NPS-Pruning.
nan
Article 1268
Title@2025-05-24 (6): Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models
Title: Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models | Dynamische Manifold Evolutionstheorie: Modellierung und Stabilitätsanalyse latenter Repräsentationen in großen Sprachmodellen | 动态操纵动态进化理论:大语言模型中前代代表的建模和稳定分析 2505.20340v1 |
Authors: Yukun Zhang, Qi Dong
We introduce Dynamic Manifold Evolution Theory (DMET),a unified framework that models large language model generation as a controlled dynamical system evolving on a low_dimensional semantic manifold. By casting latent_state updates as discrete time Euler approximations of continuous dynamics, we map intrinsic energy_driven flows and context_dependent forces onto Transformer components (residual connections, attention, feed-forward networks). Leveraging Lyapunov stability theory We define three empirical metrics (state continuity, clustering quality, topological persistence) that quantitatively link latent_trajectory properties to text fluency, grammaticality, and semantic coherence. Extensive experiments across decoding parameters validate DMET’s predictions and yield principled guidelines for balancing creativity and consistency in text generation.
nan
Article 1269
Title@2025-05-24 (6): What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations
Title: What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations | Worum geht es dabei? Ein Video-zu-Text-Zusammenfassungsdatensatz für wissenschaftliche Präsentationen | 这是在谈论什么?一个用于科学演示的视频到文字汇总数据集 2502.08279v4 |
Authors: Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg
Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.
nan
Article 1270
Title@2025-05-24 (6): Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla
Title: Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla | Verbesserung der Bangla-Linguistik: Fortgeschrittene LSTM-, Bi-LSTM- und Seq2Seq-Modelle zur Übertragung von Sylheti auf moderne Bangla | 改进孟加拉语言:高级LSTM、Bi-LSTM和Seq2Seqeq 将Sylheti转换为现代孟加拉语的模式 2505.18709v1 |
Authors: Sourav Kumar Das, Md. Julkar Naeen, MD. Jahidul Islam, Md. Anisul Haque Sajeeb, Narayan Ranjan Chakraborty, Mayen Uddin Mojumdar
Bangla or Bengali is the national language of Bangladesh, people from different regions don’t talk in proper Bangla. Every division of Bangladesh has its own local language like Sylheti, Chittagong etc. In recent years some papers were published on Bangla language like sentiment analysis, fake news detection and classifications, but a few of them were on Bangla languages. This research is for the local language and this particular paper is on Sylheti language. It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language. Total 1200 data used for training 3 models LSTM, Bi-LSTM and Seq2Seq and LSTM scored the best in performance with 89.3% accuracy. The findings of this research may contribute to the growth of Bangla NLP researchers for future more advanced innovations.
nan
Article 1271
Title@2025-05-24 (6): A General Knowledge Injection Framework for ICD Coding
Title: A General Knowledge Injection Framework for ICD Coding | Ein allgemeiner Wissenseinspritzrahmen für ICD Coding | ICD 编码一般知识输入框架 2505.18708v1 |
Authors: Xu Zhang, Kun Zhang, Wenxin Ma, Rongsheng Wang, Chenxu Wu, Yingtai Li, S. Kevin Zhou
ICD Coding aims to assign a wide range of medical codes to a medical text document, which is a popular and challenging task in the healthcare domain. To alleviate the problems of long-tail distribution and the lack of annotations of code-specific evidence, many previous works have proposed incorporating code knowledge to improve coding performance. However, existing methods often focus on a single type of knowledge and design specialized modules that are complex and incompatible with each other, thereby limiting their scalability and effectiveness. To address this issue, we propose GKI-ICD, a novel, general knowledge injection framework that integrates three key types of knowledge, namely ICD Description, ICD Synonym, and ICD Hierarchy, without specialized design of additional modules. The comprehensive utilization of the above knowledge, which exhibits both differences and complementarity, can effectively enhance the ICD coding performance. Extensive experiments on existing popular ICD coding benchmarks demonstrate the effectiveness of GKI-ICD, which achieves the state-of-the-art performance on most evaluation metrics. Code is available at https://github.com/xuzhang0112/GKI-ICD.
nan
Article 1272
Title@2025-05-24 (6): OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Title: OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis | OpenOmni: Advancing Open-Source Omnimodale große Sprachmodelle mit progressiver multimodaler Ausrichtung und Echtzeit-Self-Aware-Emotional Speech-Synthese | OpenOmni:推进开放源码全现代大语言模式,采用渐进式多模式调整和实时自觉情感言语合成 2501.04561v5 |
Authors: Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang
Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, enabling real-time emotional speech synthesis with high fidelity. Experiments show that \name surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5x fewer training samples and a smaller model size (7B vs. 7x8B). Additionally, \name achieves real-time speech generation with <1s latency at non-autoregressive mode, reducing inference time by 5x compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%
nan
Article 1273
Title@2025-05-24 (6): Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task
Title: Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task | Auf dem Weg zur semantischen Integration von Meinungen: Einheitliche Meinungskonzepte Ontologie und Extraktionsaufgabe | 争取在语义上综合各种意见:统一意见概念的本体学和采掘业任务 2505.18703v1 |
Authors: Gaurav Negi, Dhairya Dalal, Omnia Zayed, Paul Buitelaar
This paper introduces the Unified Opinion Concepts (UOC) ontology to integrate opinions within their semantic context. The UOC ontology bridges the gap between the semantic representation of opinion across different formulations. It is a unified conceptualisation based on the facets of opinions studied extensively in NLP and semantic structures described through symbolic descriptions. We further propose the Unified Opinion Concept Extraction (UOCE) task of extracting opinions from the text with enhanced expressivity. Additionally, we provide a manually extended and re-annotated evaluation dataset for this task and tailored evaluation metrics to assess the adherence of extracted opinions to UOC semantics. Finally, we establish baseline performance for the UOCE task using state-of-the-art generative models.
nan
Article 1274
Title@2025-05-24 (6): Assessing the Capability of LLMs in Solving POSCOMP Questions
Title: Assessing the Capability of LLMs in Solving POSCOMP Questions | Bewertung der Fähigkeit von LLM bei der Lösung von POSCOMP-Fragen | 评估LLLMs在解决POSCOMP问题方面的能力 2505.20338v1 |
Authors: Cayo Viegas, Rohit Gheyi, Márcio Ribeiro
Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models’ proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.
nan
Article 1275
Title@2025-05-24 (6): Benchmarking and Rethinking Knowledge Editing for Large Language Models
Title: Benchmarking and Rethinking Knowledge Editing for Large Language Models | Benchmarking und Rethinking Knowledge Editing für große Sprachmodelle | 大语言模式知识编辑基准制定和重新思考 2505.18690v1 |
Authors: Guoxiu He, Xin Song, Futing Wang, Aixin Sun
Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
nan
Article 1276
Title@2025-05-24 (6): A statistically consistent measure of semantic uncertainty using Language Models
Title: A statistically consistent measure of semantic uncertainty using Language Models | Ein statistisch konsistentes Maß semantischer Unsicherheit mittels Sprachmodellen | 使用语言模式统计一致的语义不确定性计量 2502.00507v3 |
Authors: Yi Liu
To address the challenge of quantifying uncertainty in the outputs generated by language models, we propose a novel measure of semantic uncertainty, semantic spectral entropy, that is statistically consistent under mild assumptions. This measure is implemented through a straightforward algorithm that relies solely on standard, pretrained language models, without requiring access to the internal generation process. Our approach imposes minimal constraints on the choice of language models, making it broadly applicable across different architectures and settings. Through comprehensive simulation studies, we demonstrate that the proposed method yields an accurate and robust estimate of semantic uncertainty, even in the presence of the inherent randomness characteristic of generative language model outputs.
nan
Article 1277
Title@2025-05-24 (6): Large Language Models in the Task of Automatic Validation of Text Classifier Predictions
Title: Large Language Models in the Task of Automatic Validation of Text Classifier Predictions | Große Sprachmodelle in der Aufgabe der automatischen Validierung von Textklassifikatoren Vorhersagen | 文本分类自动验证任务中的大语言模型 2505.18688v1 |
Authors: Aleksandr Tsymbalov
Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model’s entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.
nan
Article 1278
Title@2025-05-24 (6): From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation
Title: From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation | Von der Generation zur Erkennung: Ein multimodaler Multi-Task-Datensatz zum Benchmarking von Gesundheitsmissinformationen | 从产生到检测:用于确定健康错误信息基准的多式联运多任务数据集 2505.18685v1 |
Authors: Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Liting Huang, Imran Razzak, Preslav Nakov, Usman Naseem
Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.
nan
Article 1279
Title@2025-05-24 (6): TULUN: Transparent and Adaptable Low-resource Machine Translation
Title: TULUN: Transparent and Adaptable Low-resource Machine Translation | TULUN: Transparente und anpassungsfähige Maschinelle Übersetzung mit geringer Ressource | TULUN: 透明和可调适的低资源机器翻译 2505.18683v1 |
Authors: Raphaël Merx, Hanna Suominen, Lois Hong, Nick Thieberger, Trevor Cohn, Ekaterina Vylomova
Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.
nan
Article 1280
Title@2025-05-24 (6): $PD^3F$: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models
Title: $PD^3F$: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models | $PD^3F$: Ein steckbares und dynamisches DoS-Defense-Framework gegen Angriffe auf den Ressourcenverbrauch | $PD3F$:针对大语言模式的针对资源消费攻击的可渗透和动态的多斯防御框架 2505.18680v1 |
Authors: Yuanhe Zhang, Xinyue Wang, Haoran Gao, Zhenhong Zhou, Fanyu Meng, Yuyao Zhang, Sen Su
Large Language Models (LLMs), due to substantial computational requirements, are vulnerable to resource consumption attacks, which can severely degrade server performance or even cause crashes, as demonstrated by denial-of-service (DoS) attacks designed for LLMs. However, existing works lack mitigation strategies against such threats, resulting in unresolved security risks for real-world LLM deployments. To this end, we propose the Pluggable and Dynamic DoS-Defense Framework ($PD^3F$), which employs a two-stage approach to defend against resource consumption attacks from both the input and output sides. On the input side, we propose the Resource Index to guide Dynamic Request Polling Scheduling, thereby reducing resource usage induced by malicious attacks under high-concurrency scenarios. On the output side, we introduce the Adaptive End-Based Suppression mechanism, which terminates excessive malicious generation early. Experiments across six models demonstrate that $PD^3F$ significantly mitigates resource consumption attacks, improving users’ access capacity by up to 500% during adversarial load. $PD^3F$ represents a step toward the resilient and resource-aware deployment of LLMs against resource consumption attacks.
nan
Article 1281
Title@2025-05-24 (6): Safety in Large Reasoning Models: A Survey
Title: Safety in Large Reasoning Models: A Survey | Sicherheit in großen vernünftigen Modellen: Eine Umfrage | 大理由模型中的安全性:调查 2504.17704v3 |
Authors: Cheng Wang, Yue Liu, Baolong Bi, Duzhen Zhang, Zhong-Zhi Li, Yingwei Ma, Yufei He, Shengju Yu, Xinfeng Li, Junfeng Fang, Jiaheng Zhang, Bryan Hooi
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
nan
Article 1282
Title@2025-05-24 (6): Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts
Title: Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts | Sozialgut oder wissenschaftliche Neugier? Entdeckung der Forschung hinter NLP-Artefakten | 社会良好还是科学好奇? 发现NLP艺术作品背后的研究阵形 2505.18677v1 |
Authors: Eric Chamoun, Nedjma Ousidhoum, Michael Schlichtkrull, Andreas Vlachos
Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning. We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset-achieving consistent improvements over strong LLM baselines. Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in vague or underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.
nan
Article 1283
Title@2025-05-24 (6): IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
Title: IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery | IRIS: Interaktives Forschungs-Ideierungssystem zur Beschleunigung der wissenschaftlichen Entdeckung | IRIS:加速科学发现交互式研究标志系统 2504.16728v2 |
Authors: Aniketh Garikaparthi, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System
nan
Article 1284
Title@2025-05-24 (6): Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
Title: Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps | Kann MLLMs mich nach Hause führen? Eine Benchmark-Studie zur feinkörnigen visuellen Vernunft von Transit Maps | MLLMM MLLM 指导我回家吗? 关于过境地图的精美视觉依据基准研究 2505.18675v1 |
Authors: Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang
Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.
nan
Article 1285
Title@2025-05-24 (6): Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models
Title: Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models | Cross-Lingual Pitfalls: Automatisches Probieren von Cross-Lingual-Schwächen bei mehrsprachigen großen Sprachmodellen | 跨语言空洞:多种语言大语言模式的自动试探跨语言弱点 2505.18673v1 |
Authors: Zixiang Xu, Yanbo Wang, Yue Huang, Xiuying Chen, Jieyu Zhao, Meng Jiang, Xiangliang Zhang
Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.
nan
Article 1286
Title@2025-05-24 (6): MOSLIM:Align with diverse preferences in prompts through reward classification
Title: MOSLIM:Align with diverse preferences in prompts through reward classification | MOSLIM: Mit verschiedenen Präferenzen in Aufforderungen durch Prämienklassifizierung ausrichten | MOSLIM:通过奖励分类与各种偏好保持一致 2505.20336v1 |
Authors: Yu Zhang, Wanli Jiang, Zhengyu Yang
The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.
nan
Article 1287
Title@2025-05-24 (6): Language Model Distillation: A Temporal Difference Imitation Learning Perspective
Title: Language Model Distillation: A Temporal Difference Imitation Learning Perspective | Sprachmodell Destillation: Ein zeitlicher Unterschied Imitation Lernperspektive | 语言模型蒸馏:时间差异差异模拟学习视角 2505.20335v1 |
Authors: Zishun Yu, Shangzhe Li, Xinhua Zhang
Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.
nan
Article 1288
Title@2025-05-24 (6): Little Data, Big Impact: Privacy-Aware Visual Language Models via Minimal Tuning
Title: Little Data, Big Impact: Privacy-Aware Visual Language Models via Minimal Tuning | Little Data, Big Impact: Datenschutzerklärung Visual Language Models via Minimal Tuning | Little Data, Big impact: 通过最小图案生成的隐私-软件视觉语言模型 2405.17423v3 |
Authors: Laurens Samson, Nimrod Barazani, Sennay Ghebreab, Yuki M. Asano
As Visual Language Models (VLMs) become increasingly embedded in everyday applications, ensuring they can recognize and appropriately handle privacy-sensitive content is essential. We conduct a comprehensive evaluation of ten state-of-the-art VLMs and identify limitations in their understanding of visual privacy. Existing datasets suffer from label inconsistencies, limiting their reliability. To address this, we introduce two compact, high-quality benchmarks, PrivBench and PrivBench-H, that focus on commonly recognized privacy categories aligned with the General Data Protection Regulation (GDPR). Additionally, we present PrivTune, an instruction-tuning dataset specifically curated to improve privacy sensitivity. We obtain a Privacy VLM by fine-tuning an off-the-shelf VLM on only 100 samples from PrivTune, which leads to substantial gains on all benchmarks, surpassing GPT-4, while maintaining strong performance on other tasks. Our findings show that privacy-awareness in VLMs can be substantially improved with minimal data and careful dataset design, setting the stage for safer, more privacy-aligned AI systems.
nan
Article 1289
Title@2025-05-24 (6): ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
Title: ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation | ChartGalaxy: Ein Datensatz für Infografik Chart Verstehen und Generieren | 图表银河:用于了解和生成信息图表的数据集 2505.18668v1 |
Authors: Zhen Li, Yukai Guo, Duan Li, Xinyuan Guo, Bowen Li, Lanxi Xiao, Shenyu Qiao, Jiashu Chen, Zijian Wu, Hui Zhang, Xinhuan Shu, Shixia Liu
Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 330 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.
nan
Article 1290
Title@2025-05-24 (6): Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics
Title: Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics | Robustheit in großen Sprachmodellen: Eine Umfrage zu Mitigationsstrategien und Evaluationsmetrics | 大语言模式的强强力:减轻战略调查和评价 2505.18658v1 |
Authors: Pankaj Kumar, Subhankar Mishra
Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.
nan
Article 1291
Title@2025-05-24 (6): Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change
Title: Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change | Climate-Eval: Ein umfassender Maßstab für NLP-Aufgaben im Zusammenhang mit dem Klimawandel | 气候 – – Eval:与气候变化有关的国家土地规划任务的综合基准 2505.18653v1 |
Authors: Murathan Kurfalı, Shorouq Zahra, Joakim Nivre, Gabriele Messori
Climate-Eval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. Climate-Eval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.
nan
Article 1292
Title@2025-05-24 (6): On the Emergence of Linear Analogies in Word Embeddings
Title: On the Emergence of Linear Analogies in Word Embeddings | Zur Entstehung linearer Analogien in Word-Embeddings | 单线模拟在文字嵌入中的出现 2505.18651v1 |
Authors: Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart
Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure – for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ – whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
nan
Article 1293
Title@2025-05-24 (6): Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study
Title: Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study | Kann LLMs die Hate Speech Detection über Sprachen hinweg verhindern? Eine Null- und Wenige-Schuss-Studie | 能够跨语言探测出LMs Unlock仇恨言论吗? 2505.06149v3 |
Authors: Faeze Ghorbanpour, Daryna Dementieva, Alexander Fraser
Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.
nan
Article 1294
Title@2025-05-24 (6): Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data
Title: Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data | Dateneffiziente Hate Speech-Erkennung durch Cross-Lingual Nearchbor Retrieval mit limitierten beschrifteten Daten | 通过带有有限标签数据的跨近近邻检索检索数据有效仇恨言论检测 2505.14272v2 |
Authors: Faeze Ghorbanpour, Daryna Dementieva, Alexander Fraser
Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.
nan
Article 1295
Title@2025-05-24 (6): SEW: Self-Evolving Agentic Workflows for Automated Code Generation
Title: SEW: Self-Evolving Agentic Workflows for Automated Code Generation | SEW: Selbst-evolvierende Agentische Workflows für die automatisierte Codegenerierung | SEW:自动代码生成的自演动态制剂工作流程 2505.18646v1 |
Authors: Siwei Liu, Jinyuan Fang, Han Zhou, Yingxu Wang, Zaiqiao Meng
Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where complex coding tasks are decomposed into sub-tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand-crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose \textbf{S}elf-\textbf{E}volving \textbf{W}orkflow (\textbf{SEW}), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self-evolution, bringing up to 33\% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.
nan
Article 1296
Title@2025-05-24 (6): Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
Title: Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving | Verbesserung der Verallgemeinerung von sprachgroßen Sprachmodellen mit Multi-Task Behavior Imitation und Speech-Text Interleaving | 加强具有多任务行为模拟和语音文本互换功能的语音大语言模式的通用化 2505.18644v1 |
Authors: Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu
Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task ‘behavior imitation’ method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.
nan
Article 1297
Title@2025-05-24 (6): Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster
Title: Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster | Skip-Thinking: Chain-of-Thought-Destillation ermöglicht kleinere Sprachmodelle besser und schneller zu begründen | 跳过思考: 切入式深思熟虑的蒸馏链让更小的语言模型更好、更快地使用 2505.18642v1 |
Authors: Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, Xinwang Liu
Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.
nan
Article 1298
Title@2025-05-24 (6): Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Title: Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees | Multi-Step Alignment als Markov Games: Ein optimaler Online-Gradient-Abstieg mit Konvergenzgarantien | 作为Markov运动会的多步对齐:带有一致保障的乐观的在线逐渐递增人种方法 2502.12678v2 |
Authors: Yongtao Wu, Luca Viano, Yihang Chen, Zhenyu Zhu, Kimon Antonakopoulos, Quanquan Gu, Volkan Cevher
Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Optimistic Multi-step Preference Optimization (OMPO) is built upon the optimistic online mirror descent algorithm~\citep{rakhlin2013online,joulani17a}. Theoretically, we provide a rigorous analysis for the convergence of OMPO and show that OMPO requires $\mathcal{O}(\epsilon^{-1})$ policy updates to converge to an $\epsilon$-approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
nan
Article 1299
Title@2025-05-24 (6): Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
Title: Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query | Lookahead Q-Cache: Konsistentere KV-Cache-Eviktion durch Pseudo-Abfrage | LOSAhead Q-Cache : 通过 Pseudo 查询实现 KV 更一致的 CAche 切除 2505.20334v1 |
Authors: Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che
Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.
nan
Article 1300
Title@2025-05-24 (6): DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation
Title: DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation | DDO: Dual-Decision-Optimierung durch Multi-Agent-Kollaboration für LLM-basierte medizinische Beratung | DDO:通过多方机构协作,优化基于LLM的医疗咨询的双重决定 2505.18630v1 |
Authors: Zhihao Jia, Mingyi Jia, Junwen Duan, Jianxin Wang
Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling and independently optimizing the the two sub-tasks through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task.
nan
Article 1301
Title@2025-05-24 (6): Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models
Title: Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models | Multi-Scale Manifold Alignment: Ein einheitliches Framework zur besseren Erklärbarkeit großer Sprachmodelle | 多规模工作人员配置对齐:提高大语言模式解释性的统一框架 2505.20333v1 |
Authors: Yukun Zhang, Qi Dong
Recent advances in Large Language Models (LLMs) have achieved strong performance, yet their internal reasoning remains opaque, limiting interpretability and trust in critical applications. We propose a novel Multi_Scale Manifold Alignment framework that decomposes the latent space into global, intermediate, and local semantic manifolds capturing themes, context, and word-level details. Our method introduces cross_scale mapping functions that jointly enforce geometric alignment (e.g., Procrustes analysis) and information preservation (via mutual information constraints like MINE or VIB). We further incorporate curvature regularization and hyperparameter tuning for stable optimization. Theoretical analysis shows that alignment error, measured by KL divergence, can be bounded under mild assumptions. This framework offers a unified explanation of how LLMs structure multi-scale semantics, advancing interpretability and enabling applications such as bias detection and robustness enhancement.
nan
Article 1302
Title@2025-05-24 (6): HARP: Hesitation-Aware Reframing in Transformer Inference Pass
Title: HARP: Hesitation-Aware Reframing in Transformer Inference Pass | HARP: Hezitation-Aware Reframing in Transformer Inferenz Pass | HARP: 变压器推断通过中的偏移-软件重新配置 2412.07282v2 |
Authors: Romain Storaï, Seung-won Hwang
This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to “off-the-shelf” Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP provides insights into the potential of adaptive computation for enhancing the performance of Transformer-based language models.
nan
Article 1303
Title@2025-05-24 (6): Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models
Title: Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models | Empirische Bewertung der Wissensdestillation von Transformern zu subquadratischen Sprachmodellen | 从变异器到次赤道语言模式的知识提炼经验评估 2504.14366v2 |
Authors: Patrick Haller, Jonas Golde, Alan Akbik
Knowledge distillation is a widely used technique for compressing large language models (LLMs), in which a smaller student model is trained to mimic a larger teacher model. Typically, both the teacher and student models are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention during inference remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher model to eight subquadratic student architectures. Our study investigates which subquadratic model can most effectively approximate the teacher model’s learned representations through knowledge distillation, and how different architectural design choices influence the training dynamics. We further investigate the impact of initialization strategies, such as matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.
nan
Article 1304
Title@2025-05-24 (6): Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?
Title: Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? | Können LLM-Wasserzeichen die unautorisierte Destillation von Wissen wirksam verhindern? | LLM Watermarks能否强有力地防止未经授权的知识蒸馏? 2502.11598v2 |
Authors: Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu
The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.
nan
Article 1305
Title@2025-05-24 (6): Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
Title: Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation | Kann LLMs mit Ambiguity helfen? Eine quantitative Bewertung verschiedener großer Sprachmodelle auf Word Sense Disambiguation | LLMs能否协助其模糊性? 量化评估关于 “ Word Sense Disanderation “ 的各种大语言模型。 2411.18337v3 |
Authors: T. G. D. K. Sumanathilaka, Nicholas Micallef, Julian Hough
Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.
nan
Article 1306
Title@2025-05-24 (6): MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation
Title: MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation | MAVL: Ein mehrsprachiger Audio-Video-Text Datensatz für animierte Song-Übersetzung | MAVL: 动动歌曲翻译多语种视听歌词数据集 2505.18614v1 |
Authors: Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
nan
Article 1307
Title@2025-05-24 (6): PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs
Title: PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs | PM-KVQ: Progressive Mixed-Precision KV Cache Quantization für Long-CoT LLMs | PM-KVQ: 长 CoT LLMs 的渐进混合精度 KV 缓存量 2505.18610v1 |
Authors: Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at https://github.com/thu-nics/PM-KVQ.
nan
Article 1308
Title@2025-05-24 (6): Flex-Judge: Think Once, Judge Anywhere
Title: Flex-Judge: Think Once, Judge Anywhere | Flex-Richter: Denken Sie einmal, Richter überall | 灵活法官:想一想,法官 2505.18601v1 |
Authors: Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun
Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.
nan
Article 1309
Title@2025-05-24 (6): SMI: An Information-Theoretic Metric for Predicting Model Knowledge Solely from Pre-Training Signals
Title: SMI: An Information-Theoretic Metric for Predicting Model Knowledge Solely from Pre-Training Signals | SMI: Ein informationstheoretisches Metric zur Vorhersage von Modellwissen ausschließlich aus Vorschulungssignalen | SMI:从培训前信号中单独预测模型知识的信息理论计量方法 2502.04066v3 |
Authors: Changhao Jiang, Ming Zhang, Junjie Ye, Xiaoran Fan, Yifei Cao, Jiajun Sun, Zhiheng Xi, Shihan Dou, Yi Dong, Yujiong Shen, Jingqi Tong, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
The GPT-4 technical report highlights the possibility of predicting model performance on downstream tasks using only pre-training signals, though detailed methodologies are absent. Such predictive capabilities are essential for resource-efficient pre-training and the construction of task-aligned datasets. In this paper, we aim to predict performance in closed-book question answering (QA), a vital downstream task indicative of a model’s internal knowledge. We address three primary challenges: (1) limited access to and understanding of pre-training corpora, (2) limitations of current evaluation methods for pre-trained models, and (3) limitations of frequency-based metrics in predicting model performance. In response to these challenges, we conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models. Subsequently, we develop a multi-template QA evaluation framework incorporating paraphrased question variants. Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics, model size, and QA accuracy, without requiring any additional training. The experimental results demonstrate that SMI outperforms co-occurrence-based baselines, achieving $R^2$ > 0.75 on models with over one billion parameters. Theoretical analysis further reveals the marginal benefits of scaling model size and optimizing data, indicating that the upper limit of specific QA task accuracy is approximately 80%. Our project is available at https://github.com/yuhui1038/SMI.
nan
Article 1310
Title@2025-05-24 (6): Safety Alignment via Constrained Knowledge Unlearning
Title: Safety Alignment via Constrained Knowledge Unlearning | Sicherheitsausrichtung durch eingeschränktes Wissen Unlernen | 通过受限制的知识实现安全协调 2505.18588v1 |
Authors: Zesheng Shi, Yucheng Zhou, Jing Li
Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.
nan
Article 1311
Title@2025-05-24 (6): Model Extrapolation Expedites Alignment
Title: Model Extrapolation Expedites Alignment | Modell Extrapolation Expeditionen Ausrichtung | 模型外推快速调整 2404.16792v4 |
Authors: Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng
Given the high computational cost of preference alignment training of large language models (LLMs), exploring efficient methods to reduce the training overhead remains an important and compelling research problem. Motivated by the observation that alignment training typically involves only small parameter changes without injecting new knowledge into models, we propose a straightforward method called ExPO (model extrapolation) to expedite LLMs’ alignment with human preferences. Given a partially-trained model and its initial SFT checkpoint, ExPO improves the implicit optimization objective of alignment training by simply amplifying the parameter change based on a first-order approximation, without any additional training overhead. Through controlled experiments, we demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one. Moreover, we show that ExPO notably improves existing open-source LLMs (ranging from 1.8B to 70B parameters) on the leading AlpacaEval 2.0 and MT-Bench benchmarks, which highlights ExPO’s broader utility in efficiently enhancing LLM alignment.
nan
Article 1312
Title@2025-05-24 (6): Removal of Hallucination on Hallucination: Debate-Augmented RAG
Title: Removal of Hallucination on Hallucination: Debate-Augmented RAG | Aufhebung der Halluzination auf Halluzination: Debatte-erweiterte RAG | 在幻觉中去除幻觉:辩论增强的RAG 2505.18581v1 |
Authors: Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, Qing Li
Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate-Augmented-RAG.
nan
Article 1313
Title@2025-05-24 (6): Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
Title: Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs | Steigerung der Effizienz und Exploration bei der Stärkung des Lernens für LLMs | 提高LLMM 强化学习的效率和探索 2505.18573v1 |
Authors: Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model’s exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
nan
Article 1314
Title@2025-05-24 (6): ReflectDiffu:Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework
Title: ReflectDiffu:Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework | ReflectDiffu: Reflect zwischen emotional-intent Ansteckung und Mimicry für Empathetic Response Generation über ein RL-Diffusion Framework | 反省:通过RL-扩散框架,对情感-情感内聚变和Mmimimicry之间的反射,以便产生同情性反应 2409.10289v3 |
Authors: Jiahao Yuan, Zixiang Di, Zhiqing Cui, Guisong Yang, Usman Naseem
Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiffu, a lightweight and comprehensive framework for empathetic response generation. This framework incorporates emotion contagion to augment emotional expressiveness and employs an emotion-reasoning mask to pinpoint critical emotional elements. Additionally, it integrates intent mimicry within reinforcement learning for refinement during diffusion. By harnessing an intent twice reflect mechanism of Exploring-Sampling-Correcting, ReflectDiffu adeptly translates emotional decision-making into precise intent actions, thereby addressing empathetic response misalignments stemming from emotional misrecognition. Through reflection, the framework maps emotional states to intents, markedly enhancing both response empathy and flexibility. Comprehensive experiments reveal that ReflectDiffu outperforms existing models regarding relevance, controllability, and informativeness, achieving state-of-the-art results in both automatic and human evaluations.
nan
Article 1315
Title@2025-05-24 (6): From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test
Title: From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test | Von Wort zu Welt: Bewertung und Mitigate Kultur Bias via Word Association Test | 从Word到世界:通过Word协会试验评价和消化文化偏见 2505.18562v1 |
Authors: Xunlian Dai, Li Zhou, Benyou Wang, Haizhou Li
The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.
nan
Article 1316
Title@2025-05-24 (6): TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation
Title: TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation | TAG-INSTRUCT: Controlled Instruction Complexity Enhancement durch strukturbasierte Augmentation | TAG-INSTRSUCT:通过基于结构的增强增强控制性教学复杂度 2505.18557v1 |
Authors: He Zhu, Zhiwen Ruan, Junyou Su, Xingwei He, Wenjia Zhang, Yun Chen, Guanhua Chen
High-quality instruction data is crucial for developing large language models (LLMs), yet existing approaches struggle to effectively control instruction complexity. We present TAG-INSTRUCT, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based methods operating on raw text, TAG-INSTRUCT compresses instructions into a compact tag space and systematically enhances complexity through RL-guided tag expansion. Through extensive experiments, we show that TAG-INSTRUCT outperforms existing instruction complexity augmentation approaches. Our analysis reveals that operating in tag space provides superior controllability and stability across different instruction synthesis frameworks.
nan
Article 1317
Title@2025-05-24 (6): Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
Title: Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation | Erforschung der Vulnerabilität der Content Moderation Guardrail in großen Sprachmodellen durch Intent Manipulation | 通过意向操纵探索大语言模型中内容调节保护栏的脆弱性 2505.18556v1 |
Authors: Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang
Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs’ moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our “FSTR+SPIN” variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs’ safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.
nan
Article 1318
Title@2025-05-24 (6): Unraveling Misinformation Propagation in LLM Reasoning
Title: Unraveling Misinformation Propagation in LLM Reasoning | Nichtverbreitung von Fehlinformationen in LLM-Reasoning | 以LLM 理由解释方式进行错误信息传播 2505.18555v1 |
Authors: Yiyang Feng, Yichen Wang, Shaobo Cui, Boi Faltings, Mina Lee, Jiawei Zhou
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning, positioning them as promising tools for supporting human problem-solving. However, what happens when their performance is affected by misinformation, i.e., incorrect inputs introduced by users due to oversights or gaps in knowledge? Such misinformation is prevalent in real-world interactions with LLMs, yet how it propagates within LLMs’ reasoning process remains underexplored. Focusing on mathematical reasoning, we present a comprehensive analysis of how misinformation affects intermediate reasoning steps and final answers. We also examine how effectively LLMs can correct misinformation when explicitly instructed to do so. Even with explicit instructions, LLMs succeed less than half the time in rectifying misinformation, despite possessing correct internal knowledge, leading to significant accuracy drops (10.02% - 72.20%). Further analysis shows that applying factual corrections early in the reasoning process most effectively reduces misinformation propagation, and fine-tuning on synthesized data with early-stage corrections significantly improves reasoning factuality. Our work offers a practical approach to mitigating misinformation propagation.
nan
Article 1319
Title@2025-05-24 (6): MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors
Title: MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors | MSA bei BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning für die multidimensionale Bewertung von LLMs als Math Tutoren | BEA 2025年BEA管理事务管理事务协议 共同任务:对作为数学导师的LLMs进行多种不同类型评价的 2505.18549v1 |
Authors: Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi
We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.
nan
Article 1320
Title@2025-05-24 (6): Composable Cross-prompt Essay Scoring by Merging Models
Title: Composable Cross-prompt Essay Scoring by Merging Models | Composable Cross-prompt Essay Scoring by Merging Models | 通过合并模型进行可合成的跨速化 ESS Scay Scorporing 2505.18548v1 |
Authors: Sanwoo Lee, Kun Liang, Yunfang Wu
Recent advances in cross-prompt automated essay scoring (AES) typically train models jointly on all source prompts, often requiring additional access to unlabeled target prompt essays simultaneously. However, using all sources is suboptimal in our pilot study, and re-accessing source datasets during adaptation raises privacy concerns. We propose a source-free adaptation approach that selectively merges individually trained source models’ parameters instead of datasets. In particular, we simulate joint training through linear combinations of task vectors – the parameter updates from fine-tuning. To optimize the combination’s coefficients, we propose Prior-encoded Information Maximization (PIM), an unsupervised objective which promotes the model’s score discriminability regularized by priors pre-computed from the sources. We employ Bayesian optimization as an efficient optimizer of PIM. Experimental results with LLMs on in-dataset and cross-dataset adaptation show that our method (1) consistently outperforms training jointly on all sources, (2) maintains superior robustness compared to other merging methods, (3) excels under severe distribution shifts where recent leading cross-prompt methods struggle, all while retaining computational efficiency.
nan
Article 1321
Title@2025-05-24 (6): B-score: Detecting biases in large language models using response history
Title: B-score: Detecting biases in large language models using response history | B-Score: Voreingenommenheit in großen Sprachmodellen anhand der Antworthistorie erkennen | B-序号:利用回应历史在大型语言模型中发现偏见 2505.18545v1 |
Authors: An Vo, Mohammad Reza Taesiri, Daeyoung Kim, Anh Totti Nguyen
Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to “de-bias” themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.
nan
Article 1322
Title@2025-05-24 (6): Unearthing Large Scale Domain-Specific Knowledge from Public Corpora
Title: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora | Großes Domain-Spezifisches Wissen aus der öffentlichen Corpora entschlüsseln | 从公共企业中挖掘出大型大型域域特定知识 2401.14624v4 |
Authors: Zhaoye Fei, Yunfan Shao, Linyang Li, Zhiyuan Zeng, Conghui He, Qipeng Guo, Hang Yan, Dahua Lin, Xipeng Qiu
Large language models (LLMs) have demonstrated remarkable potential in various tasks, however, there remains a significant lack of open-source models and data for specific domains. Previous work has primarily focused on manually specifying resources and collecting high-quality data for specific domains, which is extremely time-consuming and labor-intensive. To address this limitation, we introduce large models into the data collection pipeline to guide the generation of domain-specific information and retrieve relevant data from Common Crawl (CC), a large public corpus. We refer to this approach as Retrieve-from-CC. It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus. By applying this method, we have collected a knowledge domain-related dataset named Retrieve-Pile, which covers four main domains, including the sciences, humanities, and other categories. Through the analysis of , Retrieve-from-CC can effectively retrieve relevant data from the covered knowledge domains and significantly improve the performance in tests of mathematical and knowledge-related reasoning abilities. We have released Retrieve-Pile at https://huggingface.co/datasets/Query-of-CC/Retrieve-Pile.
nan
Article 1323
Title@2025-05-24 (6): Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Title: Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning | Verbesserung des Charakter-Level-Verständnisses in LLMs durch Token Internal Structure Learning | 通过 Token 内部结构学习加强LLM女士的品级理解 2411.17679v4 |
Authors: Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu
Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs’ ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models’ ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer’s vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.
nan
Article 1324
Title@2025-05-24 (6): NoveltyBench: Evaluating Language Models for Humanlike Diversity
Title: NoveltyBench: Evaluating Language Models for Humanlike Diversity | NoveltyBench: Sprachmodelle für die menschliche Vielfalt bewerten | 新闻:评价促进人类多样性的语言模式 2504.05228v3 |
Authors: Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, Daphne Ippolito
Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work introduces NoveltyBench, a benchmark specifically designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs. NoveltyBench utilizes prompts curated to elicit diverse answers and filtered real-world user queries. Evaluating 20 leading language models, we find that current state-of-the-art systems generate significantly less diversity than human writers. Notably, larger models within a family often exhibit less diversity than their smaller counterparts, challenging the notion that capability on standard benchmarks translates directly to generative utility. While prompting strategies like in-context regeneration can elicit diversity, our findings highlight a fundamental lack of distributional diversity in current models, reducing their utility for users seeking varied responses and suggesting the need for new training and evaluation paradigms that prioritize diversity alongside quality.
nan
Article 1325
Title@2025-05-24 (6): Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
Title: Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models | Verstärkte Feinsteuerungskräfte, die die Fähigkeit multimodaler großer Sprachmodelle begründen | 多种多式大语言模式能力的理由 2505.18536v1 |
Authors: Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.
nan
Article 1326
Title@2025-05-24 (6): InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Title: InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models | InftyThink: Die Längengrenzen der Langkontext-Reasoning in großen Sprachmodellen durchbrechen | 思考:在大语言模式中打破长句理由的长度限制 2503.06692v3 |
Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
nan
Article 1327
Title@2025-05-24 (6): SMART: Self-Aware Agent for Tool Overuse Mitigation
Title: SMART: Self-Aware Agent for Tool Overuse Mitigation | SMART: Self-Aware Agent für Tool Overuse Mitigation | SMART: 减少工具过度使用自智能剂 2502.11435v2 |
Authors: Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent’s self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART-ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4o. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.
nan
Article 1328
Title@2025-05-24 (6): metaTextGrad: Automatically optimizing language model optimizers
Title: metaTextGrad: Automatically optimizing language model optimizers | metaTextGrad: Sprachmodell-Optimierer automatisch optimieren | setudeTextGrad: 自动优化语言模型优化器 2505.18524v1 |
Authors: Guowei Xu, Mert Yuksekgonul, Carlos Guestrin, James Zou
Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.
nan
Article 1329
Title@2025-05-24 (6): How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
Title: How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation | Wie beeinflusst Sequence-Modellierung Architektur Basisfähigkeiten von vortrainierten Sprachmodellen? Erforschen von Schlüsselarchitektur-Design-Prinzipien zur Vermeidung von Basisfähigkeiten Degradation | 如何按序列模拟结构模型模拟培训前语言模型的建筑影响基础能力? 探索重要建筑设计原则,以避免基础能力退化 2505.18522v1 |
Authors: Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu
Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.
nan
Article 1330
Title@2025-05-24 (6): AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
Title: AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking | AcuRank: Ungewissheits-Bewusst-Adaptive-Computation für Listwise-Reranking | AcuRank: 列表排序的不确定性- 软件适应性计算 2505.18512v1 |
Authors: Soyoung Yoon, Gyuwan Kim, Gyu-Hwung Cho, Seung-won Hwang
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.
nan
Article 1331
Title@2025-05-24 (6): EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
Title: EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents | EscapeBench: Auf dem Weg zu mehr kreativer Intelligenz von Sprachmodell-Agenten | 逃避:努力推进语言示范代理的创意智能 2412.13549v2 |
Authors: Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xiaocheng Yang, Denghui Zhang, Yunzhu Li, Heng Ji
Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies.
nan
Article 1332
Title@2025-05-24 (6): Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection
Title: Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection | Gruppenadaptive Schwellenoptimierung für robuste KI-generierte Texterkennung | 强力AI-发光的文本探测的集团-适应性阈值优化 2502.04528v4 |
Authors: Minseok Jung, Cynthia Fuertes Panizo, Liam Dugan, Yi R., Fung, Pin-Yu Chen, Paul Pu Liang
The advancement of large language models (LLMs) has made it difficult to differentiate human-written text from AI-generated text. Several AI-text detectors have been developed in response, which typically utilize a fixed global threshold (e.g., $\theta = 0.5$) to classify machine-generated text. However, one universal threshold could fail to account for distributional variations by subgroups. For example, when using a fixed threshold, detectors make more false positive errors on shorter human-written text, and more positive classifications of neurotic writing styles among long texts. These discrepancies can lead to misclassifications that disproportionately affect certain groups. We address this critical limitation by introducing FairOPT, an algorithm for group-specific threshold optimization for probabilistic AI-text detectors. We partitioned data into subgroups based on attributes (e.g., text length and writing style) and implemented FairOPT to learn decision thresholds for each group to reduce discrepancy. In experiments with nine AI text classifiers on three datasets, FairOPT decreases overall balanced error rate (BER) discrepancy by 12\% while minimally sacrificing accuracy by 0.003\%. Our framework paves the way for more robust classification in AI-generated content detection via post-processing.
nan
Article 1333
Title@2025-05-24 (6): Knowledge Grafting of Large Language Models
Title: Knowledge Grafting of Large Language Models | Wissen Graften von großen Sprachmodellen | 大语言模式知识转让 2505.18502v1 |
Authors: Guodong Du, Xuanning Zhou, Junlin Li, Zhuo Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model’s intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.
nan
Article 1334
Title@2025-05-24 (6): UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Title: UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models | UGPhysics: Umfassender Benchmark für Undergraduate Physics Reasoning mit großen Sprachmodellen | 动脉物理学:具有大语言模型的本科物理原因综合基准 2502.00334v3 |
Authors: Xin Xu, Qiyun Xu, Tong Xiao, Tianhao Chen, Yuchen Yan, Jiaxin Zhang, Shizhe Diao, Can Yang, Yang Wang
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs’ abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning. Codes and data are available at https://github.com/YangLabHKUST/UGPhysics .
nan
Article 1335
Title@2025-05-24 (6): ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Title: ACECODER: Acing Coder RL via Automated Test-Case Synthesis | ACECODER: Acing Coder RL über automatisierte Test-Case-Synthese | 通过自动测试-案件综合合成检索编码器 RL 2502.01718v4 |
Authors: Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen
Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.
nan
Article 1336
Title@2025-05-24 (6): The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models
Title: The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models | Der Pragmatische Geist der Maschinen: Auf der Spur des Entstehens der Pragmatischen Kompetenz in großen Sprachmodellen | 机器的实用思维:追踪大语言模式中实用能力的出现 2505.18497v1 |
Authors: Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, Rob Voigt
Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution (Sravanthi et al. (2024)) and theory-of-mind reasoning (Shapira et al. (2024)), both of which require substantial pragmatic understanding. However, how LLMs acquire this competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, designed to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two contextually appropriate but pragmatically distinct continuations, enabling fine-grained assessment of both pragmatic interpretation and contrastive reasoning. We systematically evaluate 22 LLMs across key training stages: pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic reasoning. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.
nan
Article 1337
Title@2025-05-24 (6): FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
Title: FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers | FuseGPT: Lernbare Ebenen Fusion generativer vortrainierter Transformer | FuseGPT: 训练前改造器的产生型先导变异器的可学习层融合 2411.14507v2 |
Authors: Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains, largely due to the extensive scaling of model parameters. Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks. However, such direct removal often leads to irreversible performance degradation. In this paper, we propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model’s performance. Firstly, we introduce a new importance detection metric, Macro Influence (MI), which evaluates the long-term impact of each transformer block by quantifying the information loss incurred upon its removal. Next, we propose group-level layer fusion, which leverages the parameters from layers of less important blocks and integrates them into the corresponding layers of neighboring blocks. This fusion process is not a one-time operation but is refined through iterative parameter updates by lightweight group-level fine-tuning. Specifically, the injected parameters are frozen but are weighted with learnable rank decomposition matrices to reduce the computational overhead during fine-tuning. Our approach not only works well for large language models but also for large multimodal models. Experimental results indicate that, even with modest amounts of data, FuseGPT surpasses previous methods in both perplexity and zero-shot task performance.
nan
Article 1338
Title@2025-05-24 (6): TextArena
Title: TextArena | TextArena | TextArenna 文本 2504.11442v2 |
Authors: Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan
TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.
nan
Article 1339
Title@2025-05-24 (6): AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Title: AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents | AgentOccam: Eine einfache, aber starke Basis für LLM-basierte Web-Agenten | AgentOccam:基于LLM的网络代理的简单而有力的基线 2410.13825v2 |
Authors: Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, Huzefa Rangwala
Autonomy via agents using large language models (LLMs) for personalized, standardized tasks boosts human efficiency. Automating web tasks (like booking hotels within a budget) is increasingly sought after. Fulfilling practical needs, the web agent also serves as an important proof-of-concept example for various agent grounding scenarios, with its success promising advancements in many future applications. Prior research often handcrafts web agent strategies (e.g., prompting templates, multi-agent systems, search methods, etc.) and the corresponding in-context examples, which may not generalize well across all real-world scenarios. On the other hand, there has been limited study on the misalignment between a web agent’s observation/action representation and the pre-training data of the LLM it’s based on. This discrepancy is especially notable when LLMs are primarily trained for language completion rather than tasks involving embodied navigation actions and symbolic web elements. Our study enhances an LLM-based web agent by simply refining its observation and action space to better align with the LLM’s capabilities. This approach enables our base agent to significantly outperform previous methods on a wide variety of web tasks. Specifically, on WebArena, a benchmark featuring general-purpose web interaction tasks, our agent AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. We achieve this without using in-context examples, new agent roles, online feedback or search strategies. AgentOccam’s simple design highlights LLMs’ impressive zero-shot performance on web tasks, and underlines the critical role of carefully tuning observation and action spaces for LLM-based agents.
nan
Article 1340
Title@2025-05-24 (6): ADEPT: A DEbiasing PrompT Framework
Title: ADEPT: A DEbiasing PrompT Framework | ADEPT: Ein abschreckendes PrompT-Framework | ADEPT: 减少偏见的促进促进框架 2211.05414v3 |
Authors: Ke Yang, Charles Yu, Yi Fung, Manling Li, Heng Ji
Several works have proven that finetuning is an applicable approach for debiasing contextualized word embeddings. Similarly, discrete prompts with semantic meanings have shown to be effective in debiasing tasks. With unfixed mathematical representation at the token level, continuous prompts usually surpass discrete ones at providing a pre-trained language model (PLM) with additional task-specific information. Despite this, relatively few efforts have been made to debias PLMs by prompt tuning with continuous prompts compared to its discrete counterpart. Furthermore, for most debiasing methods that alter a PLM’s original parameters, a major problem is the need to not only decrease the bias in the PLM but also to ensure that the PLM does not lose its representation ability. Finetuning methods typically have a hard time maintaining this balance, as they tend to violently remove meanings of attribute words. In this paper, we propose ADEPT, a method to debias PLMs using prompt tuning while maintaining the delicate balance between removing biases and ensuring representation ability. To achieve this, we propose a new training criterion inspired by manifold learning and equip it with an explicit debiasing term to optimize prompt tuning. In addition, we conduct several experiments with regard to the reliability, quality, and quantity of a previously proposed attribute training corpus in order to obtain a clearer prototype of a certain attribute, which indicates the attribute’s position and relative distances to other words on the manifold. We evaluate ADEPT on several widely acknowledged debiasing benchmarks and downstream tasks, and find that it achieves competitive results while maintaining (and in some cases even improving) the PLM’s representation ability. We further visualize words’ correlation before and after debiasing a PLM, and give some possible explanations for the visible effects.
nan
Article 1341
Title@2025-05-24 (6): Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications
Title: Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications | Synchronisieren und Anpassen von Fehlerkorrekturdaten für mobile Großsprachen-Modellanwendungen | 合成和调整移动大语言模型应用错误校正数据 2505.18488v1 |
Authors: Yanxiang Zhang, Zheng Xu, Shanshan Wu, Yuanbo Zhang, Daniel Ramage
Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.
nan
Article 1342
Title@2025-05-24 (6): AI Idea Bench 2025: AI Research Idea Generation Benchmark
Title: AI Idea Bench 2025: AI Research Idea Generation Benchmark | KI Idee Bank 2025: KI Forschung Idee Generation Benchmark | AI 2025年大赦国际思想座座:AI 研究思想的产生基准 2504.14191v3 |
Authors: Yansheng Qiu, Haoquan Zhang, Zhaopan Xu, Ming Li, Diping Song, Zheng Wang, Kaipeng Zhang
Large-scale Language Models (LLMs) have revolutionized human-AI interaction and achieved significant success in the generation of novel ideas. However, current assessments of idea generation overlook crucial factors such as knowledge leakage in LLMs, the absence of open-ended benchmarks with grounded truth, and the limited scope of feasibility analysis constrained by prompt design. These limitations hinder the potential of uncovering groundbreaking research ideas. In this paper, we present AI Idea Bench 2025, a framework designed to quantitatively evaluate and compare the ideas generated by LLMs within the domain of AI research from diverse perspectives. The framework comprises a comprehensive dataset of 3,495 AI papers and their associated inspired works, along with a robust evaluation methodology. This evaluation system gauges idea quality in two dimensions: alignment with the ground-truth content of the original papers and judgment based on general reference material. AI Idea Bench 2025’s benchmarking system stands to be an invaluable resource for assessing and comparing idea-generation techniques, thereby facilitating the automation of scientific discovery.
nan
Article 1343
Title@2025-05-24 (6): GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data?
Title: GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data? | GeoGrid-Bench: Können Stiftungsmodelle multimodale gegrittete Geo-Raumdaten verstehen? | GeoGrid-Bench:基础模型能够理解多式网格地球空间数据吗? 2505.10714v2 |
Authors: Bowen Jiang, Yangxinyu Xie, Xiaomeng Wang, Jiashu He, Joshua Bergerson, John K Hutchison, Jordan Branham, Camillo J Taylor, Tanwi Mallick
We present GeoGrid-Bench, a benchmark designed to evaluate the ability of foundation models to understand geo-spatial data in the grid structure. Geo-spatial datasets pose distinct challenges due to their dense numerical values, strong spatial and temporal dependencies, and unique multimodal representations including tabular data, heatmaps, and geographic visualizations. To assess how foundation models can support scientific research in this domain, GeoGrid-Bench features large-scale, real-world data covering 16 climate variables across 150 locations and extended time frames. The benchmark includes approximately 3,200 question-answer pairs, systematically generated from 8 domain expert-curated templates to reflect practical tasks encountered by human scientists. These range from basic queries at a single location and time to complex spatiotemporal comparisons across regions and periods. Our evaluation reveals that vision-language models perform best overall, and we provide a fine-grained analysis of the strengths and limitations of different foundation models in different geo-spatial tasks. This benchmark offers clearer insights into how foundation models can be effectively applied to geo-spatial data analysis and used to support scientific research.
nan
Article 1344
Title@2025-05-24 (6): Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark
Title: Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark | Pädagogik-R1: Pädagogisch ausgerichtetes Reasoning-Modell mit ausgewogenem Bildungs-Benchmark | 教育-R1:具有平衡教育基准的教学统一理由模型 2505.18467v1 |
Authors: Unggi Lee, Jaeyong Lee, Jiyeong Bae, Yeil Jeong, Junbo Koh, Gyeonggeon Lee, Gunho Lee, Taekyung Ahn, Hyeoncheol Kim
Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs’ pedagogical strengths and limitations.
nan
Article 1345
Title@2025-05-24 (6): Measuring South Asian Biases in Large Language Models
Title: Measuring South Asian Biases in Large Language Models | Messung südasiatischer Biasen in großen Sprachmodellen | 衡量大语言模式中的南亚偏见 2505.18466v1 |
Authors: Mamnuya Rinki, Chahat Raj, Anjishnu Mukherjee, Ziwei Zhu
Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.
nan
Article 1346
Title@2025-05-24 (6): From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data
Title: From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data | Von Reddit zur Generativen KI: Bewertung großer Sprachmodelle für Angstunterstützung Feinabstimmung auf Social Media-Daten | 从改编到创创AI:评估社会支助大语言模式,对社会媒体数据进行微调 2505.18464v1 |
Authors: Ugur Kursuncu, Trilok Padhi, Gaurav Sinha, Abdulkadir Erol, Jaya Krishna Mandivarapu, Christopher R. Larrison
The growing demand for accessible mental health support, compounded by workforce shortages and logistical barriers, has led to increased interest in utilizing Large Language Models (LLMs) for scalable and real-time assistance. However, their use in sensitive domains such as anxiety support remains underexamined. This study presents a systematic evaluation of LLMs (GPT and Llama) for their potential utility in anxiety support by using real user-generated posts from the r/Anxiety subreddit for both prompting and fine-tuning. Our approach utilizes a mixed-method evaluation framework incorporating three main categories of criteria: (i) linguistic quality, (ii) safety and trustworthiness, and (iii) supportiveness. Results show that fine-tuning LLMs with naturalistic anxiety-related data enhanced linguistic quality but increased toxicity and bias, and diminished emotional responsiveness. While LLMs exhibited limited empathy, GPT was evaluated as more supportive overall. Our findings highlight the risks of fine-tuning LLMs on unprocessed social media content without mitigation strategies.
nan
Article 1347
Title@2025-05-24 (6): Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning
Title: Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning | Selbst-GIVE: assoziatives Denken aus begrenztem strukturiertem Wissen für erweiterte Großsprachenmodell-Reasoning | 自用自用:从有限的结构化知识中进行联合思考,以强化大语言模式解释理由 2505.15062v2 |
Authors: Jiashu He, Jinxuan Fan, Bowen Jiang, Ignacio Houine, Dan Roth, Alejandro Ribeiro
When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate “hormones helping mental disorders” with “melatonin being a hormone and insomnia a mental disorder” to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE’s key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to $\textbf{28.5%$\rightarrow$71.4%}$ and $\textbf{78.6$\rightarrow$90.5%}$ in samples $\textbf{unseen}$ in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90\%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.
nan
Article 1348
Title@2025-05-24 (6): Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales
Title: Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales | Verbesserte multimodale Aspect-Based-Sentiment-Analyse durch LLM-generierte Rationale | 由LLM-Generered Rationsales公司进行的增强型多式多式频谱感应分析 2505.14499v2 |
Authors: Jun Cao, Jiyi Li, Ziwei Yang, Renjie Zhou
There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs’ ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.
nan
Article 1349
Title@2025-05-24 (6): Accelerating Large Language Model Reasoning via Speculative Search
Title: Accelerating Large Language Model Reasoning via Speculative Search | Beschleunigen des Large Language Model Reasoning durch spekulative Suche | 通过投机搜索加速大语言示范理由 2505.02865v2 |
Authors: Zhihai Wang, Jie Wang, Jilai Pan, Xilin Xia, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Feng Wu
Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model’s outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12$\times$ speedup with comparable reasoning quality.
nan
Article 1350
Title@2025-05-24 (6): TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Title: TokenSkip: Controllable Chain-of-Thought Compression in LLMs | TokenSkip: Steuerbare Ketten-of-Thought-Kompression in LLMs | TokenSkip: LLMM 中可控制的尝试链压缩 2502.12067v2 |
Authors: Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, Wenjie Li
Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI’s o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop.
nan
Article 1351
Title@2025-05-24 (6): Anchored Diffusion Language Model
Title: Anchored Diffusion Language Model | Verankertes Diffusions-Sprachenmodell | 原成品的传播语言模式 2505.18456v1 |
Authors: Litu Rout, Constantine Caramanis, Sanjay Shakkottai
Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches
nan
Article 1352
Title@2025-05-24 (6): Hybrid Latent Reasoning via Reinforcement Learning
Title: Hybrid Latent Reasoning via Reinforcement Learning | Hybride Latent Reasoning durch Stärkungslernen | 通过强化学习找出原因 2505.18454v1 |
Authors: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs’ generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
nan
Article 1353
Title@2025-05-24 (6): MedScore: Factuality Evaluation of Free-Form Medical Answers
Title: MedScore: Factuality Evaluation of Free-Form Medical Answers | MedScore: Faktizitätsbewertung von Freiform-medizinischen Antworten | 医疗核心:对免费形式医疗答案的实情评估 2505.18452v1 |
Authors: Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze
While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score significantly varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation.
nan
Article 1354
Title@2025-05-24 (6): $μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
Title: $μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts | $μ$-MoE: Test-Time Pruning als Mikro-Grained Mixture-of-Experts | 美元-MoE:作为微粒混合剂专家进行试验时休整 2505.18451v1 |
Authors: Toshiaki Koike-Akino, Jing Liu, Ye Wang
To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $\mu$-MoE. Several experiments demonstrate that $\mu$-MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.
nan
Article 1355
Title@2025-05-24 (6): BRIT: Bidirectional Retrieval over Unified Image-Text Graph
Title: BRIT: Bidirectional Retrieval over Unified Image-Text Graph | BRIT: Bidirektionale Retrieval über Unified Image-Text Graph | BRIT: 统一图像文字图的双向检索 2505.18450v1 |
Authors: Ainulla Khan, Yamada Moyuru, Srinidhi Akella
Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.
nan
Article 1356
Title@2025-05-24 (6): Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model
Title: Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model | Nutzung von Online-Daten zur Verbesserung des medizinischen Wissens in einem kleinen persischen Sprachmodell | 在小型波斯语言模式中利用在线数据加强医疗知识 2505.16000v2 |
Authors: Mehrdad Ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli
The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data, including a crawled corpus from medical magazines and a dataset of real doctor-patient QA pairs. We fine-tuned a baseline model using our curated data to improve its medical knowledge. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and provides better responses compared to its baseline. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments.
nan
Article 1357
Title@2025-05-24 (6): Efficient Long CoT Reasoning in Small Language Models
Title: Efficient Long CoT Reasoning in Small Language Models | Effiziente Long CoT-Reasoning in kleinen Sprachmodellen | 低语言模式中有效的长期计算成本理由 2505.18440v1 |
Authors: Zhaoyang Wang, Jinqi Jiang, Tian Qiu, Hui Liu, Xianfeng Tang, Huaxiu Yao
Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.
nan
Article 1358
Title@2025-05-24 (6): Voice of a Continent: Mapping Africa’s Speech Technology Frontier
Title: Voice of a Continent: Mapping Africa’s Speech Technology Frontier | Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren | 非洲大陆之声:测绘非洲语音技术前沿 2505.18436v1 |
Authors: AbdelRahim Elmadany, Sang Yun Kwon, Hawau Olamide Toyin, Alcides Alcoba Inciarte, Hanan Aldarmaki, Muhammad Abdul-Mageed
Africa’s rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent’s speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa’s linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.
nan